Parsoid/Parser Unification

Currently, we have two separate wikitext parsers that are used in MediaWiki on the Wikimedia cluster (and several other third-party MediaWiki installations). One is the original PHP parser, and the other is Parsoid (currently written in Javascript as a Node.js project and run as an independent service). Currently, the PHP parser is used for all desktop read views and for iOS Wikipedia app views. Parsoid is currently used to serve all editing clients (VisualEditor, Structured Discussions, Content Translation), linting tools (Extension:Linter), Android Wikipedia app, Kiwix offline reader, and the Google knowledge graph project.

The goal of this project is to arrive at a single parser that supports all clients and use cases.

This page will serve as a high-level page for tracking this unification project with links to other pages with additional details. Updates will continue to be published on this page. This project is primarily driven by the Parsing team with participation from the MediaWiki Platform team, all the internal teams that develop Parsoid clients, Community Liaisons team, Wikimedia wiki editor communities, and third party MediaWiki projects since this parser unification will touch them all.

Updates
As of July 1, 2018, this work will be undertaken as part of the Platform Evolution CDP.

Q1 2018-2019
In this quarter, we will be preparing the Parsoid codebase for prototyping a port. Specifically, here are a few things we'll be working towards.


 * Implement unit testing and performance testing features: These features let us port individual token and DOM transformers and verify correctness and test performance without needing a full functional port.
 * Migrate more promises in Parsoid to use newer async/yield code patterns: the benefit of this code pattern is that the code reads as if it is synchronous code and is readily migratable to PHP.
 * Ensure PHP parser and Parsoid general similar media output
 * Explore migrating media processing to a post-processing step: This frees the core parsing step from blocking on database access.

January 2018 - June 2018
In this timeframe, we did a bunch of early experiments to get a sense of feasibility of a PHP port of Parsoid.

In January, we ported two DOM transformation passes of Parsoid in about 3 days (which including lots of googling to figure out equivalent PHP functionality in Javascript). So, we expect porting of DOM passes themselves will be fairly straightforward. Since PHP's DOM implementation is C-backed, the performance of these DOM passes was actually better than the equivalent node.js code. So, performance-wise, DOM passes are unlikely to be a bottleneck.

Separately, we added unit testing features to Parsoid to let us port, test, and benchmark individual token transformers without requiring all of Parsoid to be ported. In Q1, we will be porting this feature to PHP and then port a couple of token transformers to get a handle on the complexity and performance of token transformers in PHP.

Background
The two parsers use different internal processing models to convert wikitext to HTML.

The PHP parser is largely based on string manipulation via regular expressions with a goal of low latency conversion from wikitext to HTML.

Parsoid was born out of the VisualEditor project to support visual editing which required bidirectional conversation between wikitext and HTML with additional constraints on the wikitext generated from edited HTML. In 2012, as this project was in its infancy, it wasn't fully clear how viable this entire project was and where it would go. Since then, Parsoid has proved to be a succession project on its own and has supported a number of additional projects beyond VisualEditor.

Unification
Since around 2015, it has been clear that long-term, this two parser situation is untenable and we had to consolidate behind a single parser.

The long and short of it is that there are two aspects to arriving at a single parser.


 * Bridging the differing processing models and consequent output and feature differences between the two parsers
 * Addressing the language and architectural differences between the two parsers - the Parsing/Notes/Two Systems Problem page documents the differences between the two parsers and various possible scenarios for what the unified parser is going to look like. If you are interested in more details, please check out that page.

We are tackling these two aspects / work categories concurrently.

Replacing the HTML4 based Tidy with HTML5 based RemexHtml was one of the biggest projects under the first work category that has an independent utility and purpose above and beyond the parser unification project. Besides that, we have been continuously addressing the long tail of incompatibility between the two parsers besides continuing to address editing client features and requests.

As for the second work category, after a lot of internal debate and discussion, we have started evaluating and prototyping a port of Parsoid into PHP. Please check the Parsing/Notes/Moving Parsoid Into Core page for more details and background about this aspect of the parser unification project.