Parsoid/Parser Unification/Performance

If Parsoid is to replace the Core Parser on the Wikimedia cluster, we need to ensure that it has adequate performance compared to the core parser wrt memory usage, HTML output size, and parse latencies. This page is a summary of discussions between Parsing (Subbu + Scott), Core Platform (Tim), and Performance (Timo + Dave) teams on July 22, 2020.

The consensus seems to be that the initial focus of performance work is going to be primarily pages in the Article / Main namespace.

Memory usage
Given that Parsoid builds a DOM and operates on the DOM, its memory usage is going to be significantly higher compared to the core parser. The core parser was engineered for early 2000s when server memory was expensive and not as plentiful. Those memory constraints are not as relevant in 2020. So, there is no need to optimize Parsoid for low memory usage. Given that the the following considerations apply.


 * Since there is no requirement for Parsoid to match core parser memory usage, there is no need to establish a baseline metric.
 * Parsoid should not throw out-of-memory (OOM) errors on a substantial larger set of pages compared to the core parser. The ideal scenario would be where Parsoid does not OOM on pages that the core parser does not OOM on either.
 * We need to establish reasonable resource limits on wikitext usage - ideally something similar to or better than what the core parser supports. Given that, we can expand memory limits to fill available capacity to support those limits.
 * The above requires that Parsoid's memory usage increases linearly with used resources on any given piece of wikitext. Parsoid currently does not behave this way because of how Parsoid's PEG tokenizer backtracks to successfully tokenize a page and might end up filling available memory. Fixing this is likely going to be an active line of performance work in Parsoid.

HTML output size
Given that Parsoid captures and represents semantic information in HTML (via wrapper tags, and attributes like rel, typeof, data-mw) as well as syntactic and other information needed for converting edited HTML back to wikitext faithfully (data-parsoid), the raw Parsoid output is much bigger than the output of core parser.

Currently, Parsoid does not inline the data-parsoid information in HTML. That is stored separately in RESTBase currently and will be stored separately in ParserCache or some equivalent caching / storage layer over Parsoid.

That leaves us with semantic attributes. In order to not adversely impact network transfer times (especially in low-bandwidth environments), for all regular read views, Parsoid will need to strip the data-mw attribute out of the HTML and store it separately in the caching / storage layer as well. Parsoid has the ability to generated this stripped output but before we can roll this out, editing clients like VisualEditor will need to be updated to fetch this information out-of-band on-demand. This also requires the storage layer to allocate an additional storage bucket for data-mw in addition to data-parsoid. Getting this done is a blocker for deployment.

Even with stripped data-parsoid and data-mw attributes, Parsoid HTML is going to be somewhat larger that core parser output. However, this impact is likely to be in the single-digit increases, especially in the compressed format. The Parsing Team will need to co-ordinate with the Performance Team to conduct performance tests wrt the network payload size to ensure this hypothesis is true.

Parse Latencies
Given that Parsoid does a lot of work to generate semantic markup that is also editable and convertible back to wikitext without information loss or dirty diffs, Parsoid's parse latencies are expected to be higher than the core parser's latencies on equivalent hardware.

We currently don't have baseline metrics comparing Parsoid and PHP parser user latencies. However, on non-equivalent hardware on different Wikimedia production server clusters, with caveats that Parsoid times out and OOMs on a much larger set of pages than the core parser, Parsoid has p95 latencies of 2.5s - 12s whereas core parser has p95 latencies of 1.8s - 4.2s (both computed as movingAverage($metric, '1hour') over a 7-day window). So, while there is some work to be done, the variation is not substantial to be a concern.

However, it is not a performance goal to match Parsoid performance with core parser performance on equivalent hardware. Given that, here are some initial requirements:


 * Ideally, no parse timeouts on pages that the core parser doesn't timeout on.
 * Capping worst case parse latencies. Maybe 30s for featured pages, and 60s for all pages. If necessary, we can utilize more powerful hardware for Parsoid to meet these performance benchmarks.
 * Ensuring some kind of linear increase of parse latencies wrt usage of wikitext resources (see note about resource limits in the memory usage section above).
 * Reasonable post-save parse latencies. No specific metric established at this time.