Parsoid/limitations

From mediawiki.org

Note: These limitations apply to the native (fully token-based and single-pass) template expansion pipeline. We have since decided to use the PHP preprocessor for template expansions, which side-steps these issues by reverting to the traditional textual preprocessor pass.

Most limitations have to do with the single-pass architecture of Parsoid vs. the multi-pass structure of the PHP parser. In principle, this class of limitation can be lifted by doing less work in the tokenizer and more work on the token stream, at the cost of a less complete PEG grammar and more complex token stream transforms (which would then start to look like a parser too). So far we have only done this to support template and list fragments, which are widely used.

Templates returning parts of syntactical structure apart from templates and lists[edit]

Example: {{echo|[}}{{echo|[}}Link}} or en:Template:YouTube. If needed, additional cases can be supported by only emitting simple start/end tokens in the tokenizer and moving the actual parsing to a token stream transformer in the sync23 phase (after templates are expanded).

Same root issue: Noinclude spanning template or link parameters as in {{Foo|<noinclude>|Some non-included parameter</noinclude>}}.

More examples: [1]

Template expansion dependent start-of-line or end-of-line context[edit]

==foo=={{echo|
bar}} 

{{echo|
* should be listitem}}

Fixed for pres (stripping named, but not positional parameters), but not yet for list items. I (gwicke) have a patch that lets list items be parsed as listItem token in the tokenizer, but that would need to be converted back to the original newline and bullets when encountered as first token in tokenTrim.

Comments in arbitrary places[edit]

Example: [<!-- comment -->[Main Page]]

These cannot generally be supported without stripping comments before parsing. Even if parsed, this type of comment could not be represented in the DOM. Before deployment, we should check if this is common enough to warrant an automated conversion. Grepping a dump works well for this check.

Mis-nested parser functions[edit]

The grammar-based tokenizer assumes some degree of sane nesting. Parser functions can return full tokens or substrings of an attribute, but not the first half of a token including half an attribute. Similar to the issues above, this limitation could be largely removed by dumbing down the tokenizer and deferring actual parsing until after template exansions at the cost of performance and PEG tokenizer grammar completeness. Mis-nested parser functions are hard to figure out for humans too, and should better be avoided / removed if possible.

Example: search for 'style="}}' in Template:Navbox. There are a total of 12 templates and no articles matching that string in the English Wikipedia. (used the following statement: zcat enwiki-latest-pages-articles.xml.gz | node dumpGrepper.js 'style="}}')


Solvable, but tricky tokenizer issues SOLVED[edit]

  • Start-of-line position vs. templates and parser functions: \n{{#if:||* some list item|}}. See also User:GWicke/Tables and User:GWicke/Lists
  • {{#if||!foo='bar'}} style if-function is used to set table headers in some templates (See Jcttop/core source)
  • attributes conditionally set from parser functions: <div {{#if:||style="color: red"}}>Red text</div>
  • SOL td-wikitext emitted by templates. See example below.
{|
|-
{{echo|{{!}}foo}}  <-- this is the simplified test case for: {{singlechart|Australia|8|artist=Hilary Duff|song=So Yesterday}}
|}
  • td and style constructed separately (see commit note for git SHA 12b561a). But, test case below:
{|
| {{Won}}
|}

Possible approaches to dealing with SOL text emitted by parser functions[edit]

  • Recently, we restructured the pre-handling to move away from the tokenizer to detecting leading-white-space after newlines and using that as the basis for inserting pre-tokens to deal with SOL-posn. white-space being output from templates which cannot be detected in the tokenizer. We can move to a similar technique for table, list, or other wikitext chars/tokens that can show up in SOL position. So, code a generic sol-text handler that looks for sol-position tokens and inserts relevant tokens into the token stream.
  • Alternatively, the only scenario when the tokenizer cannot detect sol-position text is for argument text passed into templates. So, the generic sol-text handler can insert marker tokens before template arg text and look for these markers to reparse text that follows. This technique could also be used to deal (in a limited manner) with navbox-like templates that construct html tags from string pieces. The tokenizer can recognize tag fragments (ex: <td) and insert marker tags into the token stream to require re-parse of text that follows (till an end marker) once all templates are expanded. We could also then generate warnings in parser logs to mark the template as a candidate for rewriting to eliminate tag construction from text fragments.

In either case, we need to collect use-cases for how frequently this kind of non-structural wikitext shows up in templates and wiki pages. If there are very few cases, then it might be better to fix up the corresponding templates and uses rather than hack up the parser to deal with non-structural wikitext. This kind of cleanup can lead us towards moving templates to generate structured HTML (rather than unstructured text).

See also[edit]