Parsing/Notes/Moving Parsoid Into Core

Updates
As of July 1, 2018, this work will be undertaken as part of the Platform Evolution CDP.

Q1 2018-2019
In this quarter, we will be preparing the Parsoid codebase for prototyping a port. Specifically, here are a few things we'll be working towards.


 * Implement unit testing and performance testing features: These features let us port individual token and DOM transformers and verify correctness and test performance without needing a full functional port.
 * Migrate more promises in Parsoid to use newer async/yield code patterns: the benefit of this code pattern is that the code reads as if it is synchronous code and is readily migratable to PHP.
 * Explore migrating media processing to a post-processing step: This frees the core parsing step from blocking on database access.

January 2018 - June 2018
In this timeframe, we did a bunch of early experiments to get a sense of feasibility of a PHP port of Parsoid.

.... to be completed ....

Background
During the 2016 Parsing Team Offsite, the parsing team decided to evaluate a port of Parsoid into PHP so that it can be integrated into MediaWiki core. That was not a commitment to evaluate that immediately. But, for a bunch of reasons, that evaluation continued to be on the backburner. However, since then, a number of things have transpired which have introduced a lot of momentum behind attempting to move Parsoid into MediaWiki core. In the rest of this document, we are going to explore this direction including the why, how, risks and concerns.

Why move Parsoid into Core?
There are two somewhat independent reasons for this.

Architectural concerns
Parsoid was originally developed as an independent stateless service. The idea was to be able to make parsing independent of the rest of MediaWiki. In an ideal interface, MediaWiki would send the parsing service the page source and any additional information and get back HTML that could be further post-processed.

Unfortunately, that ideal has been hard to meet because of the nature of wikitext. As it is used today, wikitext doesn't have a processing model that lends itself to a clean separation of the parsing interface with the rest of MediaWiki. The input wikitext -> output HTML transformation depends on state that Parsoid does not have direct access to. To deal with this, Parsoid makes API calls into MediaWiki to fetch all this state - processed templates, extensions, media information, link information, bad images, etc.

Without additional cleanup to the semantics and processing model of wikitext, a cleaner separation of concerns is hard to accomplish. That will eventually happen and is on our radar. But, for now, without going into too many additional details, this service boundary around Parsoid as it exists today is architecturally inelegant. This service boundary is a source of a large number of API calls and introduces performance overheads in two ways: (a) network traffic (b) Mediawiki API startup overheads for every API call that Parsoid makes. In the ideal service boundary, most of these API calls would be serviced internally and would be a function call.

There are two ways to address the architectural boundary issue and transform most of the API calls into function calls. Either, (a) migrate more pieces of MediaWiki core into the parsing service, or (b) integrate the parsing service back into core. Solution (a) is a hard sell for a bunch of reasons. It is also a slippery slope since it is unclear how much code will have to get pulled out and what other code dependencies that might expose. Given that, if we want to address this architectural issue, for now the best solution is to move Parsoid back into core. But, ideally, this would be done in a way that abstracts the parser behind an interface that can be gradually refined to let us pull it out as a service again in the future, if desirable.

Third party MediaWiki installation concerns
Parsoid has been written in node.js (a) to minimize I/O blocking from API calls by leveraging async functionality (b) because PHP5 had much poorer performance compared to Javascript (c) because there were HTML5 parsing libraries readily available for Javascript but none for PHP.

But, this non-PHP service meant that third party MediaWiki users would have to install and configure a new component. In some cases (like shared hosting services), it might not even be possible to install node.js and hence Parsoid. There has been a lot of debate and discussion about this shared hosting limitation and how far we should go to support this case. As part of the MediaWiki Platform's team work, a clearer direction is likely to emerge with respect to this.

2018 is not 2012
There are some things different in 2012 compared to 2018.


 * Back in 2011/2012, when Parsoid was initiated, a PHP implementation was not viable. The original plan in 2012 was to eventually port the node.js prototype into C++ to get the best performance.
 * Unlike the PHP parser, Parsoid does a lot more work to provide the functionality and support for features that depend on it. In 2012, MediaWiki ran PHP5 which would not have been performant enough for Parsoid's design. In 2018, PHP7 has much better performance compared to PHP5.
 * In 2012, There was no HTML5 parsing library available in PHP. In 2018, there is RemexHtml, which has been developed with performance in mind.
 * In 2012, Parsoid was an experimental prototype designed to support the VisualEditor project with a tight deployment timeline. In that context, node.js was the best choice for that time since the feasibility of this project in terms of its ability to support all the wikitext out there was far from clear. In 2018, Parsoid is an established project which has proved its utility multiple times over and is on the way to supplanting the legacy PHP parser in MediaWiki.
 * In 2018, we have a much better sense of the core pieces of the parsing pipeline: PEG tokenizer, token transformers, HTML5 tree builder, DOM transformations. RemexHtml has a high-performance tree building implementation. PHP's DOM implementation is C-backed and has competitive, if not better, performance compared to domino's implementation. Performance of PHP-based token transformers and PEG tokenizer is unknown at this time and is something to be prototyped and evaluated.

Overall, given this change of context both from a PHP-implementation feasibility and Parsoid's utility point of view, we feel it is time to attempt a closer integration of Parsoid into MediaWiki via a PHP port that addresses both the architectural concerns as well as the third party installation concerns. In 2012, this would have been a questionable decision. Even in 2016, we didn't have sufficient clarity that this was the right decision. But, after a lot of discussions and some early experiments, we have a bit more understanding that gives us confidence that this might be feasible.

This direction is not without its risks. Let us now examine those and think about how we can address them.