User:Adamw/Parsoid-DOM

After the successful December release and some more clean-up work following up on that we are now considering the next steps for Parsoid. The medium-term plan for the summer is to support the VE in becoming the default editor on all wikis. The main remaining work on our part to enable this is further refinement, localization support and editing support for more content elements.

This gives us some breathing room to look at the longer term Parsoid and MediaWiki strategy. The continued work with the JavaScript implementation gave us some new information and ideas, which are mainly about leveraging the HTML DOM we are building in Parsoid.

Storing HTML DOM
The HTML/RDFa DOM spec we developed is already very close to an equivalent representation of the content which is easier to work with than pure wikitext. It can contain fully expanded templates while still providing the metadata needed to re-expand a template later.

In the short term, storing / re-generating the HTML DOM on each edit would normally remove the wait time currently experienced by VisualEditor users on large articles. In the longer term, storing the HTML DOM can enable several interesting options:

Fragment caching and incremental updates
Parsoid encapsulates parts of the DOM generated from template expansions, extensions etc. It can be used to classify current templates in those emitting self-contained DOM output (properly nested) and those emitting just a start or end tag (table start / row / end templates for example). Fortunately, most templates produce properly nested output. Those that don't can be marked with a flag in the database, after which proper nesting can be enforced for all other templates from there on. Unbalanced templates are encapsulated in a combined DOM block, which is then properly nested again. This can also be enforced when re-expanding the combined block of templates.

With proper nesting enforced and all template parameters available re-rendering a template will only swap out a DOM subtree. This makes it possible to cache fast-changing templates or extension output (WikiData infoboxes for example) as a fragment in the edge caches, the DB or update it dynamically in clients.

With more per-fragment metadata (reference counted links and list of recursively used templates), the LinkUpdate jobs can be restricted to a re-expansion of the affected template transclusions rather than the full page. The general idea is to collect all dependencies during evaluation, and encode this information efficiently (likely outside the DOM) to enable quick dependency and validity checks.

Some extensions like Cite use global state, for example to number citations. Sadly, this complicates independent re-expansions. It does however seem to be possible to implement numbering and similar page-wide operations using CSS and/or JS, which would also benefit the VisualEditor. Most other extensions used by the WMF like math, poem, timeline etc are order-independent, so this seems to be a solvable issue for the extensions we currently care about.

Parsoid can (and does in the VE deployment) use the PHP preprocessor via the 'expandtemplates' web API method. This lets it fully support parser functions, internal PHP interfaces, Lua scripting etc without having to re-implement this functionality. The result is pre-expanded wikitext, which is then parsed and encapsulated in Parsoid. Tag extensions are expanded independently (via an action=parse API call currently).

In the longer term, we could extend the PHP API to provide more dependency information along with the expanded output. A list of templates, parser functions and Lua scripts used in the expansion would provide pretty complete dependency information for caching / incremental update purposes.

Treating the PHP preprocessor and its associated extensions as a self-contained 'legacy' component side-steps the problems associated with the wikitext-centric interfaces used. Emulating wikitext-based parameters and frame objects passed to (for example) Lua from a token-based parser will probably never work perfectly and involve a lot of work. Performance of template expansions should not matter that much with incremental updates, as they would be relatively rare. For new pages, all template expansions can be performed in parallel (Parsoid currently sends one parallel API request per transclusion), which could be refined with some batching to amortize fixed connection overheads.

HTML-only wikis
New wikis using the VisualEditor UI exclusively could avoid the need for external dependencies by storing HTML exclusively. This will require a HTML-based diff implementation similar to the one in localwiki or XML diff algorithms like XyDiff to replace the wikitext source-based diff.

This diff algorithm and UI could also be applied to old wikitext-based page revisions by converting those revisions to HTML on demand (or in a background job).

DOM-based templating
HTML-only wikis might want to provide similar templating functionality as the existing wikitext-based template system. This could be DOM-based.

The main things we need are
 * Expressions: provide access to modules and logic, but cannot define infinite loops or variables
 * Iteration: Iterate over finite data structures (JSON objects for example)
 * Conditionals: Include / evaluate a sub-DOM depending on an expression
 * Variable interpolation in attributes and text content

This functionality is pretty simple to implement on the DOM (possibly using JS/JQuery, XPath or even XSLT?). It would provide an opportunity to define very minimal service-like (RESTful for example) extension interfaces, which extensions could port to for a gradual transition.

One popular option is to embed control structures in attributes similar to TAL, Distal or Genshi. Another option is to provide a separate binding to a plain HTML document as in Pure. Templates themselves would still be valid HTML, which might make it possible to implement some sort of visual editing mode for templates. Any serious logic would live in Lua or JavaScript modules working on DOM fragments, JSON objects or strings instead of being embedded in the template itself. The limited expressions supported by ESI might serve as an inspiration.

Incremental re-parsing after wikitext edit
After an edit to a wiki page using the wikitext UI, we currently re-parse the entire page. In most cases only a small part of the page was actually modified, so a full re-parse is not really needed.

Using the DSR (DOM source range) information stored in the HTML DOM, we can match the position of a wikitext diff to a containing HTML DOM structure and re-parse only the modified version of that node. This would normally be a top-level element like a paragraph, which does not depend on nested parser state for correct rendering. Expensive operations like template expansions would normally not need to be re-performed, which would make parse times proportional to the edit size rather than the page size.

Fast and integrated C++ implementation
The original plan was to speed up and integrate Parsoid by moving the implementation to C++. This implementation would provide parallel template expansions complete enough to be a drop-in replacement for the PHP preprocessor. Having such an implementation with its raw efficiency and integration potential is still very desirable (and fun to write!), but would also come at the cost of a long delay in Parsoid development with the currently available resources. Tackling the C++ port, tweaks to the existing JS implementation for the VE and HTML DOM storage and related optimizations in parallel does not appear realistic unless there is a sudden surge in manpower.

If we reach our goal of having the VE as the default editor on all Wikipedias this summer, demand for VE-powered MediaWiki installs will probably be high outside Wikipedia too. If we make good progress on a HTML DOM based infrastructure in the meantime, HTML-only wikis with VE could be a possibility by then.

The role of Parsoid would probably change to a conversion tool and wikitext editor for HTML content at some point, for which very high optimization might not be necessary any more. Some of the ideas above also show ways to make the existing JavaScript implementation fast enough by being smart about avoiding unnecessary repeated work.