Parsoid/reasoning behind Q1 2013 technical decisions

From mediawiki.org

In Q1 2013 we made some technical decisions in Parsoid which are reflected in Parsoid/Roadmap. This page discusses the reasoning behind these decisions.

Defer C++ port for now[edit]

The original plan was to speed up and integrate Parsoid by moving the implementation to C++. This implementation would have provided parallel template expansions complete enough to be a drop-in replacement for the PHP preprocessor. Having such an implementation with its raw efficiency and integration potential is still very desirable (and fun to write!), but would also come at the cost of a long delay in Parsoid development with the available resources. Tackling the C++ port, ongoing tweaks and bug fixes to the existing JS implementation for the VE, and HTML DOM storage and related optimizations in parallel does not appear realistic unless there is a significant surge in developer resources.

If we reach the Foundation's goal of having VE as the default editor on all Wikipedias this summer, demand for VE-powered MediaWiki installs will probably be high outside Wikipedia too. Deferring the C++ port should free up some time that we can use to remove the dependency on Parsoid for VisualEditor-only wikis.

We will re-evaluate our decision on C++ in Q4 2013. If the progress on performance we have achieved until then is not sufficient yet, we can dust off the C++ port to get the performance we need.

Use PHP preprocessor for template expansion[edit]

Calling the PHP preprocessor through the API gives us parallelism and full backwards compatibility (including Lua integration) without having to re-implement all this functionality. But it could hamper VE's ability to edit template parameters directly in the template output (rather than editing them at the template transclusion site) since Parsoid can no longer mark up template output to identify parameter substitution sites. This may not be a problem in reality since there are several ways in which Parsoid can still mark up these parameter substitution sites in the PHP preprocessor's output.

Our inline caching of expanded HTML will minimize repeated calls to the API, so after an initial setup period, the number of preprocessor calls through the API should be quite small.

Save HTML DOM in the database[edit]

HTML can be expensive to generate - on some pages, it currently takes over 30 seconds. Fortunately, this delay can be hidden by performing the conversion asynchronously on save before the HTML is actually needed. Both editing in the VisualEditor and our internal optimizations need quick access to this HTML, so we need a reliable storage location for it.

The compression and external store infrastructure around MediaWiki's text table already provides a lot of functionality that is desirable for our purposes. Storing HTML per revision lets us implement visual (HTML-based) diffs efficiently, and also lets us retrieve the HTML of old revisions when loaded into the VisualEditor. It also gives us the storage we need to experiment with VisualEditor-only wiki installations without a Parsoid dependency.

Support simple VisualEditor-only wiki installs by storing HTML DOM only[edit]

Small and simple wikis that wish to use the VisualEditor exclusively don't really need Wikitext and the Parsoid dependency it brings with it. Since the browser and editor both consume HTML and the editor returns modified HTML, storing plain HTML is a pretty obvious solution.

To be useful, a HTML-only wiki will also need a visual diff system to compare HTML documents. This might also be something desirable for MediaWiki core in the longer run, and using HTML-only early adopter wikis to test and refine this functionality could turn out to be useful.

We are aware that HTML-only wikis might be controversial in some quarters. For big and established wikis like Wikipedia, it will make sense to store both wikitext and HTML in parallel for a long time. Smaller wiki installations, however, can experiment and refine HTML-based storage in the meantime. Once HTML-based storage reaches maturity, switching the primary backend storage format for larger wikis should be hardly noticeable to users. For installations without wikitext storage, if necessary, Parsoid can continue to provide wikitext-based editing by serializing stored HTML to wikitext and parsing it back to HTML for storage.