Parsoid/Parser Unification

Currently, we have two separate wikitext parsers that are used in MediaWiki on the Wikimedia cluster (and several other third-party MediaWiki installations). One is the original core parser (legacy parser), and the other is Parsoid. At present, the core parser is used for all desktop and mobile web read views. Parsoid is currently used to serve all editing clients (VisualEditor, Structured Discussions, Content Translation), linting tools (Extension:Linter), some gadgets, mobile apps, Kiwix offline reader, Wikimedia Enterprise, and the Google knowledge graph project.

The goal of this project is to arrive at a single parser that supports all clients and use cases.

This page will serve as a high-level page for tracking this unification project with links to other pages with additional details. Updates will continue to be published on this page. This project is primarily driven by the Content Transform team (previously Parsing Team) with participation from the MediaWiki Platform team, all the internal teams that develop Parsoid clients, m:Community Relations Specialists team, Wikimedia wiki editor communities, and third party MediaWiki projects since this parser unification will touch them all.

Project Goals
Longer Term Goal: Parsoid is the default wikitext engine for MediaWiki and the legacy parser is removed from the codebase

Intermediate Goal: Parsoid replaces the core parser for all wikitext use cases on the Wikimedia cluster.

How are we testing this change?
TODO: Need to elaborate a bit for each of these.
 * Parser tests
 * Integration tests
 * Visual diff tests
 * Also note that this output has been prototyped over the years through VE, mobile apps, Kiwix, and other clients that already use Parsoid HTML.

What is our deployment plan / strategy?
At this stage of this project, we have split this work into 6 steps to achieve the intermediate goal.


 * 1) Deploy changes to core that makes media structure HTML largely identical to what Parsoid emits. This has its own deployment plan. This change has been live on mediawiki.org and officewiki since September 2021 and we expect to roll this out to all wikis gradually in 2022.
 * 2) Deploy changes to Wikimedia production that lets DiscussionTools use Parsoid HTML directly. This lets us iron out bugs in a restricted use case.
 * 3) Turn on Parsoid HTML read views on testwiki.
 * 4) Turn on Parsoid HTML read views on mediawiki.org and officewiki and maybe wikitech
 * 5) Ensure Parsoid is able to generate identical metadata that the legacy parser generates. This is needed for tighter integration of Parsoid into MediaWiki core and to start replacing the legacy parser in wikitext use cases gradually.
 * 6) Start rolling out on all wikis gradually -- more specific deployment plan will be developed based on what we learn in previous stages.

How does this impact wikis?
For the most part, the switch to Parsoid generated HTML should be transparent to most users. But, below, we outline some possible impacts on readers, editors, and developers.

Readers
Parsoid models and processes wikitext differently compared to the legacy parser and this can sometimes lead to differences in rendering in some edge case scenarios. If some wikitext pattern is commonly used, we have attempted to support that in Parsoid where possible, and where not, by either fixing or providing support to fix them up. At this time, we believe all rendering differences we expect to run into will be edge cases that can likely be adjusted by fixing wikitext either on individual pages or on templates.

Editors and bot, gadget, skin developers

 * Parsoid's HTML for media wikitext is different from what the legacy parser has typically generated. As part of a separate project to use semantic HTML5 output for images, the legacy parser is currently being updated to generate HTML that is pretty close to Parsoid's HTML. We expect to roll this out this year which might require some skins, gadgets, bots, and template styles to be updated.
 * The Cite extension that targets Parsoid relies on CSS rules to localize numbering of references rather than generate localized HTML. This requires editors with appropriate permissions to update MediaWiki:Common.css on their wikis to add suitable CSS rules targeting this HTML.

Extension developers
Parsoid's internal processing model is different from the legacy parser. To be completed.

What kind of support will we provide to impacted editor and developer communities?
The Content Transform Team is driving this project. Our goal is to make this switch to Parsoid as seamless as possible. So, we have tried to roll out changes over the years gradually.

We started with replacing HTML4 Tidy with HTML5 RemexHtml in the 2015 - 2018 timeframe. In 2019, in preparation to integrate Parsoid into MediaWiki core more closely, we ported Parsoid from JS to PHP. This switch went very smoothly. In the 2020 - 2022 timeframe, we started work to unify the media output generated by Parsoid and by core. This has mostly involved making changes to core, but we have occasionally adjusted Parsoid's output based on feedback and other technical considerations.

Going forward, we expect to provide support in the following ways:


 * Linter
 * Communication via this page, via tech news updates, and via updates and posts to village pump and other wiki-specific forums.
 * Some kind of opt-in mechanism for early adopter users / wikis to test and report problems

Related docs

 * Parsoid/Parser_Unification/Updates Project updates
 * The Pixel_Diff_Testing_Stats page tracks progress in measuring (and achieving) rendering parity between Parsoid and core parser output
 * Parsoid Performance Considerations outlines performance work needed for Parsoid and acceptance criteria for deploy
 * February 2019 tech talk: The long and winding road to making Parsoid the default MediaWiki parser
 * Known differences between Parsoid & core parser output
 * Replacing Tidy: Project page related to replacing Tidy
 * Historical document: Parsing/Notes/Two Systems Problem. This 2016 document explored different options at arriving at a single parser