Parsoid/Known differences with PHP parser output

From MediaWiki.org
Jump to: navigation, search

This page will track known HTML output differences between Parsoid and PHP Parser and what the proposed solution is to resolve that difference.

Differences because of implementation differences or functionality gaps[edit]

Difference Description Proposed resolution Status
Parsoid is based on HTML5 semantics whereas PHP parser is based on HTML4 semantics Parsoid uses Domino, which is based on the newer standardized HTML5 semantics. However, PHP Parser relies on Tidy which is based on HTML4. There are a bunch of parsing and rendering differences that arise from this, primarily around broken HTML (which there is a lot of on Wikimedia wikis). Once Tidy is replaced as per Parsing/Replacing Tidy, this difference will be fixed. In progress In progress
Parsoid generates <figure> tags for block images whereas PHP parser uses <div> This is once again a HTML4 / HTML5 fallout. Parsoid uses semantic markup available in HTML5 that wasn't available in HTML4 at the time PHP parser was written.

Once this code is ready to be merged and deployed (and before we deploy this), we'll work with bot and gadget authors to use the new markup that will be generated.

T118517 is the RFC for updating PHP parser output (and https://gerrit.wikimedia.org/r/#/c/196532/ is a WIP patch). To do To do
Parsoid doesn't generate redlinks and disambiguation links whereas PHP parser does. Parsoid currently doesn't markup red links or disambiguation links. Parsoid will fix this as part of T39902 To do To do
Parsoid doesn't handle language variants yet Parsoid doesn't yet parse language variant markup and doesn't provide a variant-specific rendering for reading clients. There are work in progress patches that will enable this in Parsoid. Actually providing an API endpoint for reading clients to access per-language / per-variant output on these pages requires T122942 to be resolved. But, we expect that by end of March 2017, this will be addressed. In progress In progress
Parsoid doesn't generate a wrapper <span> inside headings whereas PHP parser does. PHP parser generates a <h2><span class='mw-headline' id='..'>..</span></h2> whereas Parsoid currently generates <h2>..</h2> without the span. Parsoid is getting an update to generate the same heading ids as core. We will look at generating the mw-headline class as well in Parsoid. But, we don't intend on generating the inner <span> since those are unnecessary. If any bots and gadgets are affected, we'll work with authors to update them. YesY Done
Edge case differences between Parsoid's native implementation of some extensions compared to PHP implementations of the same For any extensions that process wikitext (ex: Cite, Gallery), Parsoid needs a native implementation of the same in Parsoid. However, because of implementation differences, there are edge cases where the output differs (ex: T104662, T96555, and a few others related to gallery). Some of these (T104662, T96555) will be fixed in Parsoid. Others might be tweaked in the PHP implementation, or we might just treat the edge case differences as undefined behavior which shouldn't be relied on by editors. Since these are edge cases, they will be fairly uncommon usage in wikis (otherwise, we would have fixed them). In progress In progress
Unavailability of some parser hooks in Parsoid compared to PHP parser Parsoid and PHP parser have different internals and hence not all the PHP parser's tag hooks are available in Parsoid. This page with parser hook stats lists extensions and the parser hooks they use. Some hooks like ParserBeforeStrip, ParserAfterStrip have no equivalent in Parsoid. So, in a Parsoid-only world, this could affect output and functioning of extensions like <translate>. We are going to develop a parser hooks API that is implementation independent (without exposing the internal details of how parsing happens) and port all the Wikimedia extensions to use this new API. All existing parser hooks will be deprecated en masse and will eventually be removed. To do To do
Parsoid doesn't handle pages in some namespaces properly (ex: File, Category) Parsoid doesn't have special handling for pages in namespaces that has generated content. For example, the content for a page in a Category namespace is generated dynamically. Content for a page in a File namespace similarly has some generated content. There is a good argument to be made that Parsoid shouldn't be duplicating this support and that clients should fetch this from the MediaWiki API directly. However, this does leave Parsoid clients in a bit of a bind because they don't know which of these namespaces are special in that content for those pages is better fetched from the MediaWiki API directly. So, some good resolution of this problem would be helpful. Maybe Parsoid should handle requests for content in all namespaces, and where that content is better served from the MediaWiki API, redirect the client to the right url?

See T153801, T151223, T148118

To be investigated and resolved. To do To do

Differences identified via visual diff testing[edit]

We run mass visual diff tests comparing rendering of Parsoid output and PHP parser output. This table will be filled out as we inspect the visual diffs and identify the underlying cause for those diffs. In addition to the above source of diffs, here are a few more specific ones that we discovered.

Difference Explanation Bug / Proposed Resolution Status
Missing resource modules in Parsoid output http://sv.wikipedia.org/wiki/Mir has a bunch of modules (ext.gadget.*) which the Parsoid output is missing T161278 In progress In progress
Missing CSS resources on some wiki svwiki and huwiki seem to be missing infobox and other CSS styles -- not sure if this is related to the previous row about missing ext.gadget.* on svwiki T161546 To do To do
Missing magnify icons in thumbnail images Thumbnail images in PHP parser output have a magnify icon with HTML output of the form: <div class="magnify"><a href=".." class="internal" title="Enlarge"></a></div>. Parsoid output is missing this. This difference is the source of a lot of visual diffs T160960

For purposes of hiding this noise, we have worked around this in visual diff testing. So, this will no longer show up in visual diff test results. But, this still needs fixing in Parsoid or MediaWiki.

To do To do
CSS differences External links have missing CSS classes (T58756), Cite output needs styling (T156351 and T156350). This should also cover the styling requirements for cite ref links - some wikis like eswiki and frwiki skip the brackets. In addition, knwiki (Kannada) uses Kannada numerals for the ref text. The necessary styles for these various wikis are being added to visual diffing code. Most of these styles for wikis are good to be added to commons.css on these specific wikis.

However, as part of this, we've also identified some limitations in the Cite CSS output. We'll have to figure out how to resolve that.

In progress In progress
Broken wikitext tables Tables in fosterable position has different fixups in Tidy vs. a HTML5 parser (RemexHTML, Parsoid) T161341 In progress In progress
Broken p-wrapping Tidy fixes up PHP Parser's doBlockLevels brokenness differently compared to output of a HTML5 parser (RemexHTML, Parsoid) T161306, T134469 In progress In progress
Broken / missing support for some extensions Pages extension output for wikisource pages is missing some wrapping divs (with associated styles). (Example)

Pages on viwiki are missing mapframe / osm maps (Example)

To be investigated To do To do

See also[edit]