Parsoid/limitations

Most limitations have to do with the single-pass architecture of Parsoid vs. the multi-pass structure of the PHP parser. In principle, this class of limitation can be lifted by doing less work in the tokenizer and more work on the token stream, at the cost of a less complete PEG grammar and more complex token stream transforms (which would then start to look like a parser too). So far we have only done this to support template and list fragments, which are widely used.

Templates returning parts of syntactical structure apart from templates and lists
Example:  or en:Template:YouTube. If needed, additional cases can be supported by only emitting simple start/end tokens in the tokenizer and moving the actual parsing to a token stream transformer in the sync23 phase (after templates are expanded).

Same root issue: Noinclude spanning template or link parameters as in.

More examples:

foo
Fixed for pres (stripping named, but not positional parameters), but not yet for list items. I (gwicke) have a patch that lets list items be parsed as listItem token in the tokenizer, but that would need to be converted back to the original newline and bullets when encountered as first token in tokenTrim.

Comments in arbitrary places
Example:

These cannot generally be supported without stripping comments before parsing. Even if parsed, this type of comment could not be represented in the DOM. Before deployment, we should check if this is common enough to warrant an automated conversion. Grepping a dump works well for this check.

Mis-nested parser functions
The grammar-based tokenizer assumes some degree of sane nesting. Parser functions can return full tokens or substrings of an attribute, but not the first half of a token including half an attribute. Similar to the issues above, this limitation could be largely removed by dumbing down the tokenizer and deferring actual parsing until after template exansions at the cost of performance and PEG tokenizer grammar completeness. Mis-nested parser functions are hard to figure out for humans too, and should better be avoided / removed if possible.

Example: search for 'style="}}' in Template:Navbox. There are a total of 12 templates and no articles matching that string in the English Wikipedia. (used the following statement: )

expands to font-weight: bold">Some text in the PHP parser, but expands to &lt;span style="color:red; font-weight: bold"&gt;Some text in Parsoid. This can be fixed by modifying the source to  . Hopefully this is rare enough to allow fixing this manually. Reliable detection of this case is needed to analyze how common this is.