Parsoid/Roadmap

After the successful December release and some more clean-up work following up on that we are now considering the next steps for Parsoid. The foundation is aiming to make the VisualEditor (VE) the default editor on all wikipedias by July 2013.

The main tasks we see on the Parsoid side to make this possible are:


 * Performance improvements: Loading a large wiki page through Parsoid into VisualEditor can currently take over 30 seconds. We want to make this instantaneous by generating and storing the HTML after each edit. This requires a throughput that can keep up with the edit rates on major wikipedias (~10 Hz on enwiki).


 * Features and refinement: Localization support will enable the use of Parsoid on non-English wikipedias. VisualEditor needs editing support for more content elements including template parameters and extension tags. As usual, we will also continue to refine Parsoid's compatibility in round-trip testing and parserTests.

Apart from these main tasks closely connected to supporting the VisualEditor, we also need to look at the longer-term Parsoid and MediaWiki strategy. Better support for visual editing and smarter caching in MediaWiki's templating facilities is one area we plan to look at. We also would like to make it easy to use the VisualEditor on small mediawiki installations by removing the need to run a separate Parsoid service.

A general theme is pushing some of Parsoid's innovations back into MediaWiki core. The clean and information-rich HTML-based content model in particular opens up several attractive options. Re-parsing a page after an edit can be sped up by reusing the parser output for unmodified parts of the page. We plan to implement this in Q2 2013, and expect a significantly higher parser throughput with this feature. Similarly, updates triggered by template modifications or parser function changes (changes in date or time, for example) can be made more efficient by only re-expanding affected parts of the page (Q3).

We have also decided to narrow our focus a bit by continuing to use the PHP preprocessor to perform our template expansion. This gives us complete coverage of preprocessor functionality including Lua integration, and lets us expand templates in parallel by doing concurrent API calls. Parsoid's template expansion pipeline works very well too, but doesn't implement all parser functions natively and lacks Lua integration.

For more discussion of the technical decisions we have made, see Parsoid/reasoning behind Q1 2013 technical decisions.

Features: Editing support for citations, template parameters and tag extensions
The main focus is on making citations and their associated templates editable, so that VisualEditor users can properly reference their sources. We will rework our version of the Cite extension to support dynamic re-expansion of the references tag. This will be needed both on the server side (for incremental updates) and the client side (inside the VisualEditor, potentially).

Template parameter editing and extension tag editing will be wikitext-based. This accommodates unbalanced template parameters, which are sadly relatively common in existing content. Both parameters and extension tag bodies will restricted syntactically, so that wikitext edits in these cannot affect other parts of the page.

Features: Localization support for non-English wikis
Localization support for non-English wikis (namespaces, magic words, link trails and -prefixes, language variants etc) will be developed. Per-wiki configuration information is retrieved through the API (already implemented). The HTML DOM interface abstracts localization issues for the VisualEditor.

Testing / Required: Update round-trip testing setup to test non-English pages
The RT testing setup should be updated to record language information and do RT testing on these pages suitably. The stats output would need to be updated to output stats based on language. Useful to identify language- and interwiki-related errors in Parsoid and catch associated regressions before going live.

Testing / good to have: Start recording performance data from round-trip testing
For capacity planning and optimization progress tracking we need performance information on as many pages as possible. It should not be too hard to extend our round-trip testing infrastructure to collect this information. We will probably not have time for this project ourselves, but it is quite self-contained and well-suited as a project for an external contributor.

Features: Editing support for images and categories
The VisualEditor plans to add editing support for images and categories. We already support category editing, but image editing remains to be implemented.

Features: HTTP API to render extension tags directly
We currently use an action=parse API hack to expand extension tags to HTML. Instead of this hack, we want to add a dedicated extension tag expansion API end point that can also be used by the VisualEditor to update / insert extension tags inline.

Performance: Generate and store HTML DOM on edit
Instead of converting a (potentially large) wiki page to HTML when a user loads the page into the VisualEditor, we will do so in the background after each edit. The result will be stored in the database, which will make loading a page into the VisualEditor practically instantaneous since no more conversion needs to be performed.

The HTML/RDFa DOM content model we developed is aiming to be an equivalent representation of the content. It can contain fully expanded templates while still providing the metadata needed to re-expand a template later. This makes the HTML DOM an equivalent representation of a revision with the added capability to persistently cache template expansions and extension output inline. The inline cache enables further performance improvements for subsequent edits and refreshLink jobs, which we describe further down in this document.

Adding HTML storage will probably involve adding an additional text table and adapting the regular Revision storage logic to optionally use this. Storage space itself does not seem to be an issue (todo: double-check with ops!). The same text id as the corresponding wikitext can be used in the HTML table to avoid any schema changes in the revision table.

Performance: Incremental re-parsing after wikitext edit
After an edit to a wiki page using the wikitext UI, we currently re-parse the entire page. In most cases only a small part of the page was actually modified, so a full re-parse is not really needed.

Using the DSR (DOM source range) information stored in the HTML DOM, we can match the position of a wikitext diff to a containing HTML DOM structure and re-parse only the modified version of that node. This would normally be a top-level element like a paragraph, which does not depend on nested parser state for correct rendering. Expensive operations like template expansions would normally not need to be re-performed, which would make parse times proportional to the edit size rather than the page size.

There are some cases where a wikitext diff inside a block element can affect other top-level blocks when tokenized as a full page. We need to rule this case out before applying this optimization. We can probably employ similar heuristics as we currently use in wikitext escaping. Details of potential strategies are discussed in Parsoid/Incremental re-parsing after wikitext edit.

Performance: Batch API calls
We currently naïvely perform all API requests in parallel. With most API calls returning very quickly, this wastes resources in Parsoid as well as the API servers. We could batch several requests into a single larger one to amortize the setup and connection costs if necessary, although this would be a special-case feature for template expansion and extension hooks. Incremental re-parsing can however eliminate the need for many repetitive API requests, and might make this task unnecessary if implemented before the July release. See bug 43888 for the details.

Features: Support HTML-only wikis without Parsoid
We will support simple HTML-only wikis with VisualEditor front-end without the need for a Parsoid installation. Besides other tweaks, this means that we will be providing a visual diff interface in place of the current wikitext-based diff.

Performance: Efficient LinksUpdate and fragment caching prototyping
Parsoid encapsulates DOM subtrees generated from template expansions and extensions and adds the information needed to re-expand the subtrees. Parsoid can enforce proper nesting for template output which can be used to splice the re-expanded output into the original subtree locations without expanding affected DOM scope. This makes it possible to leave templates in expanded form in the HTML stored in the DB, as a fragment in the edge caches, or update it dynamically in clients. This is especially useful for fast-changing templates or extension output (WikiData infoboxes for example) which might otherwise prompt repeated reparsing of the entire document.

If Parsoid can collect all dependencies during template expansion (recursively used templates, parser functions, reference-counted links) and encode this information efficiently (likely outside the DOM), the dependency information can be use to implement efficient validity checks and restrict the scope of the LinksUpdate jobs to a re-expansion of the affected template transclusions rather than the full page.

Independent extension expansions are slightly complicated by the fact that some extensions like Cite use global state, for example to number citations in original textual order. This global state introduces ordering dependencies between individual expansions. Thankfully, on Wikipedias, the Cite extension seems to be the only extension that requires global state. We plan to move the document-global citation update to JavaScript and CSS in Q2 2013. Complex extensions used in other projects like WikiBooks still need to be investigated, but in the worst case, a fallback to a re-expansion of all calls per page to such a order-dependent extension seems to be possible.

To summarize, we hope to at least implement an efficient LinksUpdate mechanism which is used to keep HTML copies of pages up-to-date when dependencies change. We will also investigate pushing some of this into the edge or the client, but might not build a complete solution for this in this quarter.

Research / prototype: HTML-only wiki support
The Parsoid web service adds a complex dependency to MediaWiki installations, which is problematic for simple MediaWiki installations that just want to use the VisualEditor. Wikis interested in editing through the VisualEditor exclusively don't necessarily need wikitext-based storage. Instead, they could use HTML storage natively. We already intend to add the capability for HTML storage in MediaWiki, which makes the storage part relatively easy.

In addition, HTML-only wikis will need a HTML-based visual diff implementation.

We will investigate which other issues we need to solve to make an HTML-based wiki possible.

Research / prototype: DOM-based templating
MediaWiki's templating is strongly tied to wikitext: Template parameters are (wikitext) strings, and the template output is wikitext which is further interpreted by a multi-pass parser. Templates are a mix of logic (typically heavily using parser functions) and wikitext snippets, which has given them a reputation for being hard to read. The unstructured nature makes visual editing of templates difficult.

The prospect of HTML-only wikis without a dependency on Parsoid prompts us to re-examine how we do templating in MediaWiki. DOM-based templating with a clear separation between logic and the actual templates looks like a particularly promising option to us.

The main things we need in templates are
 * Simple expressions: provide access to modules and logic, but cannot define infinite loops or variables
 * Iteration: Iterate over finite data structures (JSON objects for example)
 * Conditionals: Include / evaluate a sub-DOM depending on an expression
 * Variable interpolation in attributes and text content
 * Ability to compute expressions and splice output in attributes and text content
 * Ability to invoke other templates and splice the output DOM into the template

This minimal functionality is relatively simple to implement on the DOM. It is desirable to make templates valid HTML documents, so that they can be edited in a visual editor. This can be achieved by encoding the control directives above in attributes similar to TAL, Distal or Genshi.

Type information for template parameters can be used to improve the user interface for editing individual parameters. Instead of (wikitext) strings, we plan to support parameters with JSON-compatible types (Objects, Arrays, Numbers, Strings, Booleans, Date) or DOM fragments. The return type of a template is a DOM fragment instead of wikitext.

Logic can be implemented in an actual programming language (Lua and possibly JavaScript through Scribunto), and can return the same JSON-compatible types. This adds some dependencies, but should still be within the reach of shared hosting installs. Logic should also be able to call templates and return the resulting DOM fragment, in which case it acts as a controller to a template. In this case, the logic should be called in the same namespace as templates so that adding a controller to a template does not require changes to existing callers.

On wikis with Parsoid installed, the wikitext-based template system can be integrated into a DOM-based template system to provide a transition path. Wikitext templates would accept only strings as parameters (other types would be coerced to strings), and would expand to DOM fragments after being parsed by Parsoid.

In this quarter, we plan to implement a first prototype of an HTML DOM-based templating system in PHP (possibly using the built-in XML DOM and XPath bindings), which should provide a good basis for a deeper evaluation.

If still necessary for performance: Fast and integrated C++ implementation
If all of the above fails to produce the required performance (unlikely by our calculations), then we can still dust off the started C++ port and move some part or all of Parsoid to C++.