Parsoid/reasoning behind Q1 2013 technical decisions

In Q1 2013 we made some technical decisions in Parsoid which result in Parsoid/RFC: Longer-term plan. This page discusses the reasoning behind these decisions.

Defer C++ port for now
The original plan was to speed up and integrate Parsoid by moving the implementation to C++. This implementation would have provided parallel template expansions complete enough to be a drop-in replacement for the PHP preprocessor. Having such an implementation with its raw efficiency and integration potential is still very desirable (and fun to write!), but would also come at the cost of a long delay in Parsoid development with the available resources. Tackling the C++ port, ongoing tweaks and bug fixes to the existing JS implementation for the VE, and HTML DOM storage and related optimizations in parallel does not appear realistic unless there is a significant surge in developer resources.

If we reach the Foundation's goal of having VE as the default editor on all Wikipedias this summer, demand for VE-powered MediaWiki installs will probably be high outside Wikipedia too. Deferring the C++ port should free up some time that we can use to remove the dependency on Parsoid for VisualEditor-only wikis.

We will re-evaluate our decision on C++ in Q4 2013. If the progress on performance we have achieved until then is not sufficient yet, we can dust off the C++ port to get the performance we need.

Use PHP preprocessor for template expansion
Calling the PHP preprocessor through the API gives us parallelism and full backwards compatibility (including Lua integration) without having to re-implement all this functionality. On the flip side, we lose the potential capability to mark up (and thus edit) template parameters where they were substituted. If needed and no other way to provide it is found, we can still revisit this decision later.

Our inline caching of expanded HTML will minimize repeated calls to the API, so after an initial setup period the number of preprocessor calls through the API should be quite small.

Save HTML DOM in the database
HTML can be expensive to generate - on some pages it currently takes over 30 seconds. Fortunately this delay can be hidden by performing the conversion asynchronously on save before the HTML is actually needed. Both editing in the VisualEditor and our internal optimizations need quick access to HTML, so we need a reliable storage location for it.

The compression and external store infrastructure around MediaWiki's text table already provides a lot of functionality that is desirable for our purposes. Storing HTML per revision lets us implement visual (HTML-based) diffing efficiently, and also lets us retrieve the HTML of old revisions when loaded into the VisualEditor. It also gives us the storage we need to experiment with VisualEditor-only wiki installations without Parsoid dependency.

Support simple VisualEditor-only wiki installs by storing HTML DOM only
Small and simple wikis that desire to use the VisualEditor exclusively don't really need Wikitext and the Parsoid dependency it brings with it. Since the browser and editor both consume HTML and the editor returns modified HTML, storing plain HTML is a pretty obvious solution.

To be useful, a HTML-only wiki will also need a visual diff system to compare HTML documents. This might also be something desirable for MediaWiki core in the longer run, and using HTML-only early adopter wikis to test and refine this functionality could turn out to be useful.

We are aware that HTML-only wikis might be controversial in some quarters. For big and established wikis like Wikipedia it will make sense to store both wikitext and HTML in parallel for a long time. Smaller wiki installations however can experiment and refine HTML-based storage in the meantime. Once HTML-based storage reaches maturity switching the primary backend storage format for larger wikis should be hardly noticeable to users. Even completely without wikitext storage Parsoid can continue to provide wikitext-based editing.