Parsoid/Roadmap

After the successful December release and some more clean-up work following up on that we are now considering the next steps for Parsoid. The continued work with the JavaScript implementation gave us some new information and ideas which might influence our priorities a bit.

Storing HTML DOM
The HTML/RDFa DOM spec we developed is an equivalent representation of the content which is easier to work with than pure wikitext. It can contain fully expanded templates while still providing the metadata needed to re-expand a template when needed.

In the shorter term, storing / re-generating this HTML DOM after each edit would usually avoid the wait time currently experienced by VisualEditor users on large articles. In the longer term, storing the HTML DOM can enable several interesting options:

Fragment caching and incremental updates
Parsoid encapsulates parts of the DOM generated from template expansions, extensions etc. It can detect properly nested templates vs. unbalanced table start / row / end templates. Most templates produce properly nested output. Those that don't can be marked with a flag in the database, with proper nesting being enforced for all other templates from now on. Unbalanced templates are encapsulated in a combined DOM block, which is then properly nested again. This can also be enforced when re-expanding the combined block of templates.

With proper nesting enforced and all template parameters available re-rendering a template will only swap out a DOM subtree. This makes it possible to cache fast-changing templates or extension output (WikiData infoboxes for example) as a fragment in the edge caches, the DB or update it dynamically in clients.

With more per-fragment metadata (reference counted links and list of recursively used templates), the LinkUpdate jobs can be restricted to a re-expansion of the affected template transclusions rather than the full page. The general idea is to collect all dependencies during evaluation, and encode this information efficiently outside the DOM to enable quick dependency and validity checks.

Some extensions like Cite use global state to number citations, which complicates independent re-expansions. It does however seem to be possible to implement numbering and similar page-wide operations using CSS and/or JS, which would also benefit the VisualEditor.

Parsoid can (and does in the VE deployment) use the PHP preprocessor via the 'expandtemplates' web API method. This lets it fully support parser functions, internal PHP interfaces, Lua scripting etc without having to re-implement this functionality. The result is pre-expanded wikitext, which is then parsed and encapsulated in Parsoid. Tag extensions are expanded independently (via an action=parse API call currently).

In the longer term, we could extend the PHP API to provide more dependency information along with the expanded output. A list of templates, parser functions and Lua scripts used in the expansion would provide pretty complete dependency information for caching / incremental update purposes.

Treating the PHP preprocessor and its associated extensions as a 'legacy' component avoids the problems associated with the wikitext-centric interfaces used. Emulating wikitext-based parameters passed to (for example) Lua from a token-based parser will never work perfectly and involve a bit of work. Incremental updates should make PHP-based template re-expansions rare. For new pages, all template expansions can be performed in parallel (Parsoid currently sends one parallel API request per transclusion), which could be refined with some batching to amortize fixed connection overheads.

HTML-only wikis
New wikis using the VisualEditor UI exclusively could avoid the need for external dependencies by storing HTML exclusively. This will require a HTML-based diff implementation similar to the one in localwiki or XML diff algorithms like XyDiff to replace the wikitext source-based diff.

DOM-based templating
HTML-only wikis might want to provide similar templating functionality as the existing wikitext-based template system. This could be DOM-based.

One popular option is to embed control structures in attributes similar to TAL or Genshi.

The main things we seem to need are
 * Expressions: provide access to modules and logic, but cannot define infinite loops or variables
 * Iteration: Iterate over finite data structures (JSON objects for example)
 * Conditionals: Include / evaluate a sub-DOM depending on an expression

This functionality is pretty simple to implement on the DOM (possibly using JS/JQuery, XPath and/or XSLT). It would provide an opportunity to define very minimal service-like (RESTful for example) extension interfaces, which extensions could port to for a gradual transition.

Fast and integrated C++ implementation
The original plan was to speed up and integrate Parsoid by moving the implementation to C++. We planned to provide parallel template expansions with equivalent functionality as the PHP preprocessor.

Advantages:
 * Raw efficiency and performance- ASIO event loop with parallel worker pool, C++ memory management, opportunities to optimize behind sane C++ interfaces.
 * Opportunities for integration with other libraries (Lua, DOM etc) and PHP

Disadvantages:
 * Despite the prototype work already done in C++, several months of rewrite time
 * Adds a compiled library dependency to MediaWiki, unclear what the strategy for simple 'shared hosting' wikis would be