Parsoid/Parser Unification

Currently, we have two separate wikitext parsers that are used in MediaWiki on the Wikimedia cluster (and several other third-party MediaWiki installations). One is the original core parser (legacy parser), and the other is Parsoid. At present, the core parser is used for all desktop and mobile web read views. Parsoid is currently used to serve all editing clients (VisualEditor, Structured Discussions, Content Translation), linting tools (Extension:Linter), some gadgets, mobile apps, Kiwix offline reader, Wikimedia Enterprise, and the Google knowledge graph project.

The goal of this project is to arrive at a single parser that supports all clients and use cases.

This page will serve as a high-level page for tracking this unification project with links to other pages with additional details. Updates will continue to be published on this page. This project is primarily driven by the Content Transform team (previously Parsing Team) with participation from the MediaWiki Platform team, all the internal teams that develop Parsoid clients, m:Community Relations Specialists team, Wikimedia wiki editor communities, and third party MediaWiki projects since this parser unification will touch them all.

Project Goals
Longer Term Goal: Parsoid is the default wikitext engine for MediaWiki and the legacy parser is removed from the codebase

Intermediate Goal: Parsoid replaces the core parser for all wikitext use cases on the Wikimedia cluster.

How are we testing this change?
As we get closer to rollout, we will identify other QA and testing methodologies as required to ensure we can roll out this change in as smooth and non-disruptive fashion as possible and will update this page as that happens.
 * Parser tests: This is how Parsoid has been developed since its inception. We ensure that Parsoid continues to pass parser tests, and where divergence is known, it is recorded after careful review. We have also vastly expanded parser test coverage over the years, and all patches against Parsoid need to pass tests.
 * Round-tripping / Integration tests: In this mode, before every production deployment, we convert wikitext to HTML and HTML back to wikitext on about 180K pages from about 50 production wikis. While this testing mode is primarily to ensure our HTML -> wikitext conversion is not broken (which would impact our editing client tools), this also implicitly serves to flag any breakages in our HTML output. But, these aren't the most reliable tests for verifying that our HTML output is not broken.
 * Visual diff tests: Here, we take renderings of legacy parser HTML and Parsoid HTML and compare the rendering screenshots and generate a numeric diff score. We have run this in an automated way on 25k+ pages from about 20 production wikis. This has been a really reliable way to identify various breakages and bugs in Parsoid output. As we get closer to rollout, we intend to expand our testing to a wider range of wikis.
 * Parsoid reading and editing clients: Parsoid's output has been used over the years by VisualEditor, Android and iOS mobile apps, Kiwix, and other clients. We have fixed a number of bugs and incompatibilities in Parsoid over the years and continue to fix the various long-tail edge cases as they are discovered and reported.

What is our deployment plan / strategy?
At this stage of this project, we have split this work into 6 steps to achieve the intermediate goal.


 * 1) Deploy changes to core that makes media structure HTML largely identical to what Parsoid emits. This has its own deployment plan. This change has been live on mediawiki.org and officewiki since September 2021 and we expect to roll this out to all wikis gradually in 2022.
 * 2) Deploy changes to Wikimedia production that lets DiscussionTools use Parsoid HTML directly. This lets us iron out bugs in a restricted use case.
 * 3) Turn on Parsoid HTML read views on testwiki.
 * 4) Turn on Parsoid HTML read views on mediawiki.org and officewiki and maybe wikitech.
 * 5) Ensure Parsoid is able to generate identical metadata that the legacy parser generates. This is needed for tighter integration of Parsoid into MediaWiki core and to start replacing the legacy parser in wikitext use cases gradually.
 * 6) Start rolling out on all wikis gradually -- more specific deployment plan will be developed based on what we learn in previous stages.

How does this impact wikis?
For the most part, the switch to Parsoid generated HTML should be transparent to most users. But, below, we outline some possible impacts on readers, editors, and developers.

Readers
Parsoid models and processes wikitext differently compared to the legacy parser and this can sometimes lead to differences in rendering in some edge case scenarios. If some wikitext pattern is commonly used, we have attempted to support that in Parsoid where possible, and where not, by either fixing or providing support to fix them up. At this time, we believe all rendering differences we expect to run into will be edge cases that can likely be adjusted by fixing wikitext either on individual pages or on templates.

Editors and bot, gadget, skin developers

 * Parsoid's HTML for media wikitext is different from what the legacy parser has typically generated. As part of a separate project to use semantic HTML5 output for images, the legacy parser is currently being updated to generate HTML that is pretty close to Parsoid's HTML. We expect to roll this out this year which might require some skins, gadgets, bots, and template styles to be updated.
 * The Cite extension that targets Parsoid relies on CSS rules to localize numbering of references rather than generate localized HTML. This requires editors with appropriate permissions to update MediaWiki:Common.css on their wikis to add suitable CSS rules targeting this HTML.

Extension developers
Parsoid's internal processing model is different from the legacy parser. As a result, extensions may need to be updated. This only impacts extensions that do one or more of the following: (a) operate on wikitext (b) provide handlers for parser hooks (c) call a public method of the legacy parser.

Extensions that process wikitext will definitely need to be updated to work with Parsoid. To date, the vast majority of such extensions have been updated. Since Parsoid continues to access the legacy parser for expanding templates, processing parser functions, any parser hooks triggered during this processing will continue to operate and extensions that rely on these hooks will continue to operate. For the rest, we are exploring strategies to minimize updated needed to extensions.

We will file phabricator tasks for all impacted extensions as we proceed with this work, and will fix whatever extensions we can within our team. If you are an extension developer, we would greatly appreciate any proactive work and code review (for patches we might submit).

What kind of support will we provide to impacted editor and developer communities?
The Content Transform Team is driving this project. Our goal is to make this switch to Parsoid as seamless as possible. So, we have tried to roll out changes over the years gradually.

We started with replacing HTML4 Tidy with HTML5 RemexHtml in the 2015 - 2018 timeframe. In 2019, in preparation to integrate Parsoid into MediaWiki core more closely, we ported Parsoid from JS to PHP. This switch went very smoothly. In the 2020 - 2022 timeframe, we started work to unify the media output generated by Parsoid and by core. This has mostly involved making changes to core, but we have occasionally adjusted Parsoid's output based on feedback and other technical considerations.

Going forward, we expect to provide support in the following ways:


 * Linter rules for any wikitext that needs fixing.
 * The vast majority of this work was completed as part of the Tidy -> Remex migration and we don't expect to introduce new linter categories here.
 * Communication via this page, via tech news updates, and via updates and posts to village pump and other wiki-specific forums.
 * Some kind of opt-in mechanism for early adopter users / wikis to test and report problems.

How can you help / be involved?
'''To be developed. Some rough draft ideas:'''


 * Update Mediawiki:Common.css on your wiki to ensure citation rendering continues to render properly with Parsoid's output for Cite.
 * Here are some custom CSS styles for different wikis which can be added today. And, if your wiki isn't listed on there, you should be able to adapt that CSS to your wiki. Feel free to ask us for help.
 * Test your gadgets / user scripts against Parsoid HTML to identify / fix any breakages
 * Maybe point to gadgets that let wiki users to opt in to viewing Parsoid HTML on desktop
 * Maybe develop user options / beta features to opt in to viewing Parsoid HTML on desktop
 * Early adopter wikis

Related docs

 * Parsoid/Parser Unification/Updates Project updates
 * The Pixel_Diff_Testing_Stats page tracks progress in measuring (and achieving) rendering parity between Parsoid and core parser output
 * Parsoid Performance Considerations outlines performance work needed for Parsoid and acceptance criteria for deploy
 * February 2019 tech talk: The long and winding road to making Parsoid the default MediaWiki parser
 * Known differences between Parsoid & core parser output
 * Replacing Tidy: Project page related to replacing Tidy
 * Historical documents:
 * Parsing/Notes/Two Systems Problem. This 2016 document explored different options at arriving at a single parser
 * Parsing/Notes/Moving Parsoid Into Core