Parsoid/reasoning behind Q1 2013 technical decisions

In Q1 2013 we made some technical decisions in Parsoid which result in Parsoid/RFC: Longer-term plan. This page discusses the reasoning behind these decisions.

Defer C++ port for now
The original plan was to speed up and integrate Parsoid by moving the implementation to C++. This implementation would provide parallel template expansions complete enough to be a drop-in replacement for the PHP preprocessor. Having such an implementation with its raw efficiency and integration potential is still very desirable (and fun to write!), but would also come at the cost of a long delay in Parsoid development with the available resources. Tackling the C++ port, tweaks to the existing JS implementation for the VE and HTML DOM storage and related optimizations in parallel does not appear realistic unless there is a significant surge in manpower.

If we reach the foundation's goal of having the VE as the default editor on all Wikipedias this summer, demand for VE-powered MediaWiki installs will probably be high outside Wikipedia too. If we make good progress on a HTML DOM based infrastructure in the meantime, HTML-only wikis with VE could be a possibility by then.

At some point in the (admittedly not very imminent) future, the role of Parsoid would probably change to a conversion tool and wikitext editor for HTML content, for which very high optimization might not be necessary any more.

Use PHP preprocessor for template expansion
Calling the PHP preprocessor through the API gives us parallelism and full backwards compatibility (including Lua integration) without having to duplicate all this functionality. On the flip side, we lose the potential capability to edit template parameters inline. If needed and no other way to provide it is found, we can still revisit this decision later.

Our inline caching of expanded HTML will minimize repeated calls to the API, so after an initial setup period the number of preprocessor calls through the API should be quite small.

Save HTML DOM in the database
HTML can be expensive to generate- on some pages it currently takes over 30 seconds. Fortunately this delay can be hidden by performing the conversion asynchronously on save before the HTML is actually needed. Both editing in the VisualEditor and our internal optimizations need quick access to HTML, so we need a reliable storage location for it.

The compression and external store infrastructure around MediaWiki's text table already provides a lot of functionality that is desirable for our purposes. Storing HTML per revision lets us implement visual (HTML-based) diffing efficiently, and also lets us retrieve the HTML of old revisions when loaded into the VisualEditor. It also gives us the storage we need to experiment with HTML-only wikis for Parsoid-less wiki installations.

Support simple VisualEditor-only wiki installs by storing HTML DOM only
Small and simple wikis that desire to use the VisualEditor exclusively don't really need Wikitext and the Parsoid dependency it brings with it. Since the browser and editor both consume HTML and the editor returns modified HTML, storing plain HTML is a pretty obvious solution.

To be useful, a HTML-only wiki will also need a visual diff system to compare HTML documents. This might also be something desirable for MediaWiki core in the longer run, and using HTML-only early adopter wikis to test and refine this functionality could turn out to be useful.

We are aware that HTML-only wikis might be controversial in some quarters. For big and established wikis however it will make sense to continue storing both wikitext and HTML in parallel for a long time. If at some future time the HTML-based alternatives and conversion routines have become so reliable that switching over the primary storage format to HTML makes sense, then nobody will probably notice. Even completely without HTML storage Parsoid can be used to support wikitext-based editing.