Parsing/Notes/Moving Parsoid Into Core

Updates
As of July 1, 2018, this work will be undertaken as part of the Platform Evolution CDP.

Q1 2018-2019
In this quarter, we will be preparing the Parsoid codebase for prototyping a port. Specifically, here are a few things we'll be working towards.


 * Implement unit testing and performance testing features: These features let us port individual token and DOM transformers and verify correctness and test performance without needing a full functional port.
 * Migrate more promises in Parsoid to use newer async/yield code patterns: the benefit of this code pattern is that the code reads as if it is synchronous code and is readily migratable to PHP.
 * Explore migrating media processing to a post-processing step: This frees the core parsing step from blocking on database access.

January 2018 - June 2018
In this timeframe, we did a bunch of early experiments to get a sense of feasibility of a PHP port of Parsoid.

.... to be completed ....

Background
During the 2016 Parsing Team Offsite, the parsing team decided to evaluate a port of Parsoid into PHP so that it can be integrated into MediaWiki core. That was not a commitment to evaluate that immediately. But, for a bunch of reasons, that evaluation continued to be on the backburner. However, since then, a number of things have transpired which have introduced a lot of momentum behind attempting to move Parsoid into MediaWiki core. In the rest of this document, we are going to explore this direction including the why, how, risks and concerns.

Why move Parsoid into Core?
There are two somewhat independent reasons for this.

Architectural concerns
Parsoid was originally developed as an independent stateless service (and was written in node.js -- we'll explore the reasons in a bit). The idea was to be able to make parsing independent of the rest of MediaWiki. In an ideal interface, MediaWiki would send the parsing service the page source and any additional information and get back HTML that could be further post-processed.

Unfortunately, that ideal has been hard to meet because of the nature of wikitext. As it is used today, wikitext doesn't have a processing model that lends itself to a clean separation of the parsing interface with the rest of MediaWiki. The input wikitext -> output HTML transformation depends on state that Parsoid does not have direct access to. To deal with this, Parsoid makes API calls into MediaWiki to fetch all this state - processed templates, extensions, media information, link information, bad images, etc.

Without additional cleanup to the semantics and processing model of wikitext, a cleaner separation of concerns is hard to accomplish. That will eventually happen and is on our radar. But, for now, without going into too many additional details, this service boundary around Parsoid as it exists today is architecturally inelegant. This service boundary is a source of a large number of API calls and introduces performance overheads in two ways: (a) network traffic (b) Mediawiki API startup overheads for every API call that Parsoid makes. In the ideal service boundary, most of these API calls would be serviced internally and would be a function call.

There are two ways to address the architectural boundary issue and transform most of the API calls into function calls. Either, (a) migrate more pieces of MediaWiki core into the parsing service, or (b) integrate the parsing service back into core. Solution (a) is a hard sell for a bunch of reasons. It is also a slippery slope since it is unclear how much code will have to get pulled out and what other code dependencies that might expose. Given that, if we want to address this architectural issue, for now the best solution is to move Parsoid back into core. But, ideally, this would be done in a way that abstracts the parser behind an interface that can be gradually refined to let us pull it out as a service again in the future, if desirable.

Third party MediaWiki installation concerns
Parsoid has been written in node.js (a) to minimize I/O blocking from API calls by leveraging async functionality (b) because PHP5 had much poorer performance compared to Javascript (c) because there were HTML5 parsing libraries readily available for Javascript but none for PHP.

But, this non-PHP service meant that third party MediaWiki users would have to install and configure a new component. In some cases (like shared hosting services), it might not even be possible to install node.js and hence Parsoid. There has been a lot of debate and discussion about this shared hosting limitation and how far we should go to support this case. As part of the MediaWiki Platform's team work, a clearer direction is likely to emerge with respect to this.

2018 is not 2012
Unlike the PHP parser, Parsoid does a lot more work to provide the functionality and support for features it does. Back in 2011/2012, when Parsoid was initiated, a PHP implementation was not viable. Unlike 2012, a bunch of things are different in 2018. RemexHtml is a PHP-only HTML5 parsing library that has been developed with performance in mind. PHP's DOM is C-backed and its performance is not a concern either. Lastly, PHP7 performance is much better compared to PHP5.

... to be completed ...