Parsoid/Parser Unification

Currently, we have two separate wikitext parsers that are used in MediaWiki on the Wikimedia cluster (and several other third-party MediaWiki installations). One is the original core parser (legacy parser), and the other is Parsoid. At present, the core parser is used for all desktop and mobile web read views. Parsoid is currently used to serve all editing clients (VisualEditor, Structured Discussions, Content Translation), linting tools (Extension:Linter), some gadgets, mobile apps, Kiwix offline reader, Wikimedia Enterprise, and the Google knowledge graph project.

The goal of this project is to arrive at a single parser that supports all clients and use cases.

This page will serve as a high-level page for tracking this unification project with links to other pages with additional details. Updates will continue to be published on this page. This project is primarily driven by the Content Transform team (previously Parsing Team) with participation from the MediaWiki Platform team, all the internal teams that develop Parsoid clients, Community Liaisons team, Wikimedia wiki editor communities, and third party MediaWiki projects since this parser unification will touch them all.

Project Goals
Longer Term Goal: Parsoid is the default wikitext engine for MediaWiki and the legacy parser is removed from the codebase

Intermediate Goal: Parsoid replaces the core parser for all wikitext use cases on the Wikimedia cluster.

How are we testing this change?
TODO: Need to elaborate a bit for each of these.
 * Parser tests
 * Integration tests
 * Visual diff tests
 * Also note that this output has been prototyped over the years through VE, mobile apps, Kiwix, and other clients that already use Parsoid HTML.

What is our deployment plan / strategy?
At this stage of this project, we have split this work into 6 steps to achieve the intermediate goal.


 * 1) Deploy changes to core that makes media structure HTML largely identical to what Parsoid emits. This has its own deployment plan. This change has been live on mediawiki.org and officewiki since September 2021 and we expect to roll this out to all wikis gradually in 2022.
 * 2) Deploy changes to Wikimedia production that lets DiscussionTools use Parsoid HTML directly. This lets us iron out bugs in a restricted use case.
 * 3) Turn on Parsoid HTML read views on testwiki.
 * 4) Turn on Parsoid HTML read views on mediawiki.org and officewiki and maybe wikitech
 * 5) Ensure Parsoid is able to generate identical metadata that the legacy parser generates. This is needed for tighter integration of Parsoid into MediaWiki core and to start replacing the legacy parser in wikitext use cases gradually.
 * 6) Start rolling out on all wikis gradually -- more specific deployment plan will be developed based on what we learn in previous stages.

How does this impact wikis?
For the most part, the switch to Parsoid generated HTML should be transparent to most users. But, below, we outline some possible impacts on readers, editors, and developers.

Readers
Parsoid models and processes wikitext differently compared to the legacy parser and this can sometimes lead to differences in rendering in some edge case scenarios. If some wikitext pattern is commonly used, we have attempted to support that in Parsoid where possible, and where not, by either fixing or providing support to fix them up. At this time, we believe all rendering differences we expect to run into will be edge cases that can likely be adjusted by fixing wikitext either on individual pages or on templates.

Editors and bot, gadget, skin developers

 * Parsoid's HTML for media wikitext is different from what the legacy parser has typically generated. As part of a separate project to use semantic HTML5 output for images, the legacy parser is currently being updated to generate HTML that is pretty close to Parsoid's HTML. We expect to roll this out this year which might require some skins, gadgets, bots, and template styles to be updated.
 * The Cite extension that targets Parsoid relies on CSS rules to localize numbering of references rather than generate localized HTML. This requires editors with appropriate permissions to update MediaWiki:Common.css on their wikis to add suitable CSS rules targeting this HTML.

Extension developers
Parsoid's internal processing model is different from the legacy parser. To be completed.

What kind of support will we provide to impacted editor and developer communities?
The Content Transform Team is driving this project. Our goal is to make this switch to Parsoid as seamless as possible. So, we have tried to roll out changes over the years gradually.

We started with replacing HTML4 Tidy with HTML5 RemexHtml in the 2015 - 2018 timeframe. In 2019, in preparation to integrate Parsoid into MediaWiki core more closely, we ported Parsoid from JS to PHP. This switch went very smoothly. In the 2020 - 2022 timeframe, we started work to unify the media output generated by Parsoid and by core. This has mostly involved making changes to core, but we have occasionally adjusted Parsoid's output based on feedback and other technical considerations.

Going forward, we expect to provide support in the following ways:


 * Linter
 * Communication via this page, via tech news updates, and via updates and posts to village pump and other wiki-specific forums.
 * Some kind of opt-in mechanism for early adopter users / wikis to test and report problems

Project Updates

 * July 2022:
 * Started work on adding i18n and l10n support in Parsoid.
 * (incomplete: to be updated)
 * 2022 (Jan - June):
 * Started evaluating Parsoid HTML against core HTML wrt size differences, and identifying what parts (tags, annotations, attributes) of Parsoid HTML to strip so that Parsoid-HTML read views don't have serious impacts on bandwidth and client rendering latencies. Acceptable divergences will be selected in conversations with the Performance team.
 * Started work on making Kartographer compatible with Parsoid.
 * Updated Graph to be compatible with Parsoid.
 * Updated MediaWiki core ParserTest runner to support running Parsoid tests in CI and development. This lets extensions (especially those that operate on wikitext) that target Parsoid to test their implementation via parser tests.
 * (incomplete: to be updated with changes in core repo and any other extensions)
 * 2021:
 * Make extension functional with Parsoid
 * A whole bunch of performance work to reduce wt2html transformation latencies
 * Switched a few wikis to use Parsoid-style HTML for media wikitext. Complete rollout to all wikis blocked on ironing out a number of other compatibility issues.
 * Experimental / prototyping work to switch to the Dodo DOM library from native PHP DOM -- project put on hold indefinitely after running into performance issues.
 * Lots of bug fixes and fixes to edge case incompatibilities between Parsoid and legacy parser.
 * (incomplete: to be updated with changes in core repo and any other extensions)
 * December 2020:
 * Finished addressing bulk of functionality differences between Parsoid and core version of Cite implementation
 * Fixed parsing differences between Parsoid and core in use of templates for table-cell attributes and the like
 * Started migrating core output for media wikitext to use Parsoid-style output to reduce output differences between Parsoid and core
 * Updated parsertests framework to enable extension implementations to be tested against Parsoid. Finishing this lets us move extension code out of Parsoid repository into the extension repositories and enables other extensions to start enabling support for Parsoid and run tests with Parsoid
 * Recalibrated our original plans to migrate all extensions to be Parsoid-compatible to a smaller subset. We will initially target only those extensions that implement tag hooks OR use parser hooks. Those that simply use public parser methods could continue to use the core parser for a while
 * November 2020:
 * Introduced uniform error handling for extensions with boilerplate code handled by Parsoid
 * Identified resource module related issues in Parsoid output that result in rendering differentces between Parsoid and core output
 * Made ImageMap extension Parsoid-compatible
 * October 2020:
 * Several technical debt fixes ending in using a single document per request with document fragments for nested pipeline parses
 * Ongoing fixes to Parsoid's Cite implementation
 * Identify CSS fixes to reduce rendering differences between Parsoid and core output
 * Ongoing syncing and consultation with the Platform Engineering Team to upgrade the ParserCache infrastructure to accomodate Parsoid use cases and Parsoid clients in the future
 * September 2020:
 * We continued outreach about Parsoid's Extension API and get feedback (Emails on wikitech-l and mediawiki-l lists).
 * We have been continuing to fix functionality gaps between Parsoid's Cite implementation and the default Cite implementation.
 * August 2020:
 * We have been publishing results of weekly visual diff runs comparing Parsoid rendering and core parser rendering here.
 * We filed a TechCom RFC for Parsoid's Extension API. We also presented a Tech Talk about this.
 * Fixed some performance bugs which greatly reduced out-of-memory errors seen in production.
 * July 2020:
 * We upgraded the visual diffing infrastructure and started initial test runs comparing Parsoid rendering and core rendering on a 25K sample of pages from a small set of wikis. We plan to run these tests every week and monitor progress as we fix rendering and functionality gaps. Results are accessible at http://parsoid-vs-core.wmflabs.org/.
 * We prepared Parsoid for MediaWiki 1.35 LTS release so that Parsoid and VisualEditor can be used out of the box.
 * Reduced impedance mismatches between Parsoid and core parser wrt use of concepts around HTML4 block / inline tag notions.
 * Work in Progress:
 * integrating parser test infrastructure between Parsoid and core.
 * Enabling extension tests to be run against Parsoid.
 * April - June 2020:
 * We started addressing functionality gaps in Parsoid (specially error handling in Cite extension) and rendering differences (output for media wikitext).
 * We fine tuned the Parsoid Extension API further in preparation for wider consultation in the coming months.
 * We Implemented a registration mechanisms for extensions to hook with Parsoid.
 * Jan - March 2020:
 * We addressed a bunch of technical debt incurred during the porting and integrated Parsoid closer with MediaWiki core.
 * Starting March 2020, Parsoid is deployed as part of the weekly MediaWiki train.
 * We also started drafting a Parsoid Extension API for extensions to hook directly into Parsoid.
 * 2019:
 * Around end January, we started porting Parsoid to PHP and by end of the year, successfully completed the project by deploying Parsoid/PHP to the Wikimedia cluster and serving all traffic from it.
 * This blog post provides a good overview of the porting project.
 * 2018:
 * Early experimentation, prototyping and preparation to port Parsoid from JavaScript to PHP. This was low-key background work.
 * 2015 - 2018:
 * HTML4 Tidy was replaced with HTML5 RemexHtml and while upgrading the MediaWiki infrastructure, it also eliminated one of the biggest source of rendering differences between Parsoid & the core paresr. This blog post is a good overview of the reasons for this replacement and the process of doing this.

Related docs

 * The Pixel_Diff_Testing_Stats page tracks progress in measuring (and achieving) rendering parity between Parsoid and core parser output
 * Parsoid Performance Considerations outlines performance work needed for Parsoid and acceptance criteria for deploy
 * February 2019 tech talk: The long and winding road to making Parsoid the default MediaWiki parser
 * Known differences between Parsoid & core parser output
 * Replacing Tidy: Project page related to replacing Tidy
 * Historical document: Parsing/Notes/Two Systems Problem. This 2016 document explored different options at arriving at a single parser