Parsoid/Parser Unification

Currently, we have two separate wikitext parsers that are used in MediaWiki on the Wikimedia cluster (and several other third-party MediaWiki installations). One is the original core parser, and the other is Parsoid. Currently, the core parser is used for all desktop and mobile web read views. Parsoid is currently used to serve all editing clients (VisualEditor, Structured Discussions, Content Translation), linting tools (Extension:Linter), some gadgets, mobile apps, Kiwix offline reader, Wikimedia Enterprise, and the Google knowledge graph project.

The goal of this project is to arrive at a single parser that supports all clients and use cases.

This page will serve as a high-level page for tracking this unification project with links to other pages with additional details. Updates will continue to be published on this page. This project is primarily driven by the Content Transform team (previously Parsing Team) with participation from the MediaWiki Platform team, all the internal teams that develop Parsoid clients, Community Liaisons team, Wikimedia wiki editor communities, and third party MediaWiki projects since this parser unification will touch them all.

Project Goals
Longer Term Goal: Parsoid is the default wikitext engine for MediaWiki and the legacy parser is removed from the codebase

Intermediate Goal: Parsoid replaces the core parser for all wikitext use cases on the Wikimedia cluster.

Updates

 * July 2022:
 * Started work on adding i18n and l10n support in Parsoid.
 * (incomplete: to be updated)
 * 2022 (Jan - June):
 * Started evaluating Parsoid HTML against core HTML wrt size differences, and identifying what parts (tags, annotations, attributes) of Parsoid HTML to strip so that Parsoid-HTML read views don't have serious impacts on bandwidth and client rendering latencies. Acceptable divergences will be selected in conversations with the Performance team.
 * Started work on making Kartographer compatible with Parsoid.
 * Updated Graph to be compatible with Parsoid.
 * Updated MediaWiki core ParserTest runner to support running Parsoid tests in CI and development. This lets extensions (especially those that operate on wikitext) that target Parsoid to test their implementation via parser tests.
 * (incomplete: to be updated with changes in core repo and any other extensions)
 * 2021:
 * Make extension functional with Parsoid
 * A whole bunch of performance work to reduce wt2html transformation latencies
 * Switched a few wikis to use Parsoid-style HTML for media wikitext. Complete rollout to all wikis blocked on ironing out a number of other compatibility issues.
 * Experimental / prototyping work to switch to the Dodo DOM library from native PHP DOM -- project put on hold indefinitely after running into performance issues.
 * Lots of bug fixes and fixes to edge case incompatibilities between Parsoid and legacy parser.
 * (incomplete: to be updated with changes in core repo and any other extensions)
 * December 2020:
 * Finished addressing bulk of functionality differences between Parsoid and core version of Cite implementation
 * Fixed parsing differences between Parsoid and core in use of templates for table-cell attributes and the like
 * Started migrating core output for media wikitext to use Parsoid-style output to reduce output differences between Parsoid and core
 * Updated parsertests framework to enable extension implementations to be tested against Parsoid. Finishing this lets us move extension code out of Parsoid repository into the extension repositories and enables other extensions to start enabling support for Parsoid and run tests with Parsoid
 * Recalibrated our original plans to migrate all extensions to be Parsoid-compatible to a smaller subset. We will initially target only those extensions that implement tag hooks OR use parser hooks. Those that simply use public parser methods could continue to use the core parser for a while
 * November 2020:
 * Introduced uniform error handling for extensions with boilerplate code handled by Parsoid
 * Identified resource module related issues in Parsoid output that result in rendering differentces between Parsoid and core output
 * Made ImageMap extension Parsoid-compatible
 * October 2020:
 * Several technical debt fixes ending in using a single document per request with document fragments for nested pipeline parses
 * Ongoing fixes to Parsoid's Cite implementation
 * Identify CSS fixes to reduce rendering differences between Parsoid and core output
 * Ongoing syncing and consultation with the Platform Engineering Team to upgrade the ParserCache infrastructure to accomodate Parsoid use cases and Parsoid clients in the future
 * September 2020:
 * We continued outreach about Parsoid's Extension API and get feedback (Emails on wikitech-l and mediawiki-l lists).
 * We have been continuing to fix functionality gaps between Parsoid's Cite implementation and the default Cite implementation.
 * August 2020:
 * We have been publishing results of weekly visual diff runs comparing Parsoid rendering and core parser rendering here.
 * We filed a TechCom RFC for Parsoid's Extension API. We also presented a Tech Talk about this.
 * Fixed some performance bugs which greatly reduced out-of-memory errors seen in production.
 * July 2020:
 * We upgraded the visual diffing infrastructure and started initial test runs comparing Parsoid rendering and core rendering on a 25K sample of pages from a small set of wikis. We plan to run these tests every week and monitor progress as we fix rendering and functionality gaps. Results are accessible at http://parsoid-vs-core.wmflabs.org/.
 * We prepared Parsoid for MediaWiki 1.35 LTS release so that Parsoid and VisualEditor can be used out of the box.
 * Reduced impedance mismatches between Parsoid and core parser wrt use of concepts around HTML4 block / inline tag notions.
 * Work in Progress:
 * integrating parser test infrastructure between Parsoid and core.
 * Enabling extension tests to be run against Parsoid.
 * April - June 2020:
 * We started addressing functionality gaps in Parsoid (specially error handling in Cite extension) and rendering differences (output for media wikitext).
 * We fine tuned the Parsoid Extension API further in preparation for wider consultation in the coming months.
 * We Implemented a registration mechanisms for extensions to hook with Parsoid.
 * Jan - March 2020:
 * We addressed a bunch of technical debt incurred during the porting and integrated Parsoid closer with MediaWiki core.
 * Starting March 2020, Parsoid is deployed as part of the weekly MediaWiki train.
 * We also started drafting a Parsoid Extension API for extensions to hook directly into Parsoid.
 * 2019:
 * Around end January, we started porting Parsoid to PHP and by end of the year, successfully completed the project by deploying Parsoid/PHP to the Wikimedia cluster and serving all traffic from it.
 * This blog post provides a good overview of the porting project.
 * 2018:
 * Early experimentation, prototyping and preparation to port Parsoid from JavaScript to PHP. This was low-key background work.
 * 2015 - 2018:
 * HTML4 Tidy was replaced with HTML5 RemexHtml and while upgrading the MediaWiki infrastructure, it also eliminated one of the biggest source of rendering differences between Parsoid & the core paresr. This blog post is a good overview of the reasons for this replacement and the process of doing this.

Related docs

 * The Pixel_Diff_Testing_Stats page tracks progress in measuring (and achieving) rendering parity between Parsoid and core parser output
 * Parsoid Performance Considerations outlines performance work needed for Parsoid and acceptance criteria for deploy
 * February 2019 tech talk: The long and winding road to making Parsoid the default MediaWiki parser
 * Known differences between Parsoid & core parser output
 * Replacing Tidy: Project page related to replacing Tidy
 * Historical document: Parsing/Notes/Two Systems Problem. This 2016 document explored different options at arriving at a single parser