Parsoid/Parser Unification

Currently, we have two separate wikitext parsers that are used in MediaWiki on the Wikimedia cluster (and several other third-party MediaWiki installations). One is the original core parser, and the other is Parsoid. Currently, the core parser is used for all desktop and mobile web read views. Parsoid is currently used to serve all editing clients (VisualEditor, Structured Discussions, Content Translation), linting tools (Extension:Linter), mobile apps, Kiwix offline reader, and the Google knowledge graph project.

The goal of this project is to arrive at a single parser that supports all clients and use cases.

This page will serve as a high-level page for tracking this unification project with links to other pages with additional details. Updates will continue to be published on this page. This project is primarily driven by the Parsing team with participation from the MediaWiki Platform team, all the internal teams that develop Parsoid clients, Community Liaisons team, Wikimedia wiki editor communities, and third party MediaWiki projects since this parser unification will touch them all.

Project Goals
Longer Term Goal (beyond 2021): Parsoid is the default wikitext engine for MediaWiki

Intermediate Goal (by end of 2021): Parsoid replaces the core parser for all wikitext use cases on the Wikimedia cluster.

Updates

 * July 2020:
 * We upgraded the visual diffing infrastructure and started initial test runs comparing Parsoid rendering and core rendering on a 25K sample of pages from a small set of wikis. We plan to run these tests every week and monitor progress as we fix rendering and functionality gaps. Results are accessible at http://parsoid-vs-core.wmflabs.org/.
 * We prepared Parsoid for MediaWiki 1.35 LTS release so that Parsoid and VisualEditor can be used out of the box.
 * Reduced impedance mismatches between Parsoid and core parser wrt use of concepts around HTML4 block / inline tag notions.
 * Work in Progress:
 * integrating parser test infrastructure between Parsoid and core.
 * Enabling extension tests to be run against Parsoid.
 * April - June 2020:
 * We started addressing functionality gaps in Parsoid (specially error handling in Cite extension) and rendering differences (output for media wikitext).
 * We fine tuned the Parsoid Extension API further in preparation for wider consultation in the coming months.
 * We Implemented a registration mechanisms for extensions to hook with Parsoid.
 * Jan - March 2020:
 * We addressed a bunch of technical debt incurred during the porting and integrated Parsoid closer with MediaWiki core.
 * Starting March 2020, Parsoid is deployed as part of the weekly MediaWiki train.
 * We also started drafting a Parsoid Extension API for extensions to hook directly into Parsoid.
 * 2019:
 * Around end January, we started porting Parsoid to PHP and by end of the year, successfully completed the project by deploying Parsoid/PHP to the Wikimedia cluster and serving all traffic from it.
 * This blog post provides a good overview of the porting project.
 * 2018:
 * Early experimentation, prototyping and preparation to port Parsoid from Javascript to PHP. This was low-key background work.
 * 2015 - 2018:
 * HTML4 Tidy was replaced with HTML5 RemexHtml and while upgrading the MediaWiki infrastructure, it also eliminated one of the biggest source of rendering differences between Parsoid & the core paresr. This blog post is a good overview of the reasons for this replacement and the process of doing this.

Related docs

 * Parsing/Parser_Unification/Pixel_Diff_Testing_Stats tracks progress in measuring (and achieving) rendering parity between Parsoid and core parser output
 * Parsoid Performance Considerations outlines performance work needed for Parsoid and acceptance criteria for deploy
 * February 2019 tech talk: The long and winding road to making Parsoid the default MediaWiki parser
 * Known differences between Parsoid & core parser output
 * Replacing Tidy: Project page related to replacing Tidy
 * Historical document: Parsing/Notes/Two Systems Problem. This 2016 document explored different options at arriving at a single parser