Parsoid/About

Wikitext has always been both MediaWiki's edit interface and storage format. It has been a great success: the simplicity of wikitext made it possible to start writing Wikipedia with Netscape 4.7 when WYSIWYG editing was technically impossible. A relatively simple PHP script converted the wikitext to HTML.

About 12 years later, the world has changed a bit. Wikitext makes it very difficult to implement visual editing, which is now supported in browsers for HTML documents. With a lot of new features in the runtime, the conversion from wikitext to HTML can also be very slow. On large Wikipedia pages, it can take up to 40 seconds to render a page after an edit.

The Parsoid project is working on addressing these issues by complementing existing wikitext with an equivalent HTML5 version of the content. In the short term, the HTML representation lets us use HTML technology for visual editing. In the longer term, using HTML as the storage format can eliminate conversion overhead when rendering pages, and can also enable more efficient updates after an edit to a part of the page. This might all sound pretty straightforward. So why has this not been done before?

Lossless conversion between wikitext and HTML is really difficult
For the wikitext and HTML5 representations to be considered equivalent, it should be possible to convert between wikitext and HTML5 representations without introducing any semantic differences. It turns out that the ad-hoc structure of wikitext makes such a loss-less conversion to HTML and back extremely difficult.


 * Context-sensitive parsing: Wikitext is not context-free, so it cannot be completely described and parsed based on a context-free grammar. The only complete specification of Wikitext's syntax and semantics is the MediaWiki PHP-based runtime implementation, which is still heavily based on regular expressions driven text transformation.
 * Text-based templating: The PHP runtime supports an elaborate text-based preprocessor and template system. This works very similar to a macro processor in C or C++, and creates very similar issues. As an example, there is no guarantee that the expansion of a template will parse to a self-contained DOM structure. In fact, there are many templates that only produce a table start tag, a table row or a table end tag. They can even only produce the first half of an HTML tag or wikitext element, which is practically impossible to represent in HTML. Despite all this, content generated by an expanded template (or multiple templates) needs to be clearly identified in the HTML DOM.
 * No invalid wikitext: There is no invalid wikitext. Wiki constructs and HTML tags can be freely mixed in a tag soup, which still needs to be converted to a DOM tree that ideally resembles the user's intention. The behavior for rare edge cases is often more accident than design. Reproducing the behavior for all edge cases bug-by-bug is not feasible. We use automated round-trip testing on 100000 Wikipedia articles, unit test cases and statistics on Wikipedia dumps to help us identify the common cases we need to support.
 * Character-based diffs: MediaWiki uses a character-based diff interface to show the changes between the wikitext of two versions of a wiki page. Any character difference introduced by a round-trip from wikitext to HTML and back would show up as a dirty diff, which would annoy editors and make it hard to find the actual changes. This means that the conversion needs to preserve not just the semantics of the content, but also the syntax of unmodified content character-by-character. Put differently, since wikitext-to-HTML is a many-to-one mapping where different snippets of wikitext all result in the same HTML rendering (Ex: " * list " versus " *list "), a reverse conversion would effectively normalize wikitext syntax. However, character-based diffs forces the wikitext-to-HTML mapping to be treated as a one-to-one mapping. We use a combination of complementary techniques to achieve clean diffs:
 * we detect changes to the HTML5 DOM structure and use a corresponding substring of the source wikitext when serializing an unmodified DOM part (selective serialization).
 * we record variations from some normalized syntax in private round-trip data (Ex: excess spaces, variants of table-cell wikitext).
 * we collect and record information about ill-formed HTML that is auto-corrected while building the DOM tree (Ex: auto-closed inline tags in block context).

How we tackle these challenges in Parsoid
Parsoid is implemented as a node.js-based web service. The conversion from wikitext to HTML DOM starts with a PEG-based tokenizer, which emits tokens to an asynchronous token stream transformation pipeline, which in turn feeds fully processed tokens to a HTML5 tree builder. The resulting DOM is further post-processed before it is stored or delivered to a client.



The asynchronous token transformation pipeline lets us perform expensive template and extension tag expansion in parallel. We are using MediaWiki's web API for these expansions, which distributes the execution of a single request across a cluster of machines.

The conversion from HTML DOM to wikitext is performed in a serializer, which needs to make make sure that the generated wikitext parses back to the original DOM. For this it needs a deep understanding of the various syntactical constructs and their constraints. It also needs to escape wikitext-like constructs in text content, which is not trivial for a context-sensitive language.

Let us now have a look at some examples in more detail.

Example: Wiki link with templated content
Consider the wikitext: bar The HTML generated by Parsoid for this is: bar The a-tag itself should be obvious given that the wikitext is a wiki-link. However in addition to wiki links, external links, images, ISBN links and others also generate an a-tag. In order to properly convert the a-tag back to the correct wikitext that generated it, Parsoid needs to be able to distinguish between them. Towards this end, Parsoid also marks the a-tag with the mw:WikiLink property (or mw:ExtLink, mw:Image, etc.). This kind of RDFa markup also provides clients (like the Visual Editor) additional semantic information about HTML DOM subtrees.

Let us now change the wikitext slightly where the link target is generated by a template:

Foo

The HTML generated by Parsoid for this is:

bar 

First of all, note that this wikitext will render identically to the wikitext above in the browser -- so semantically, there is no difference between the two wikitext snippets. However, Parsoid adds additional markup to the link target. The span-tag wrapping the target has an about tag and a RDFa type. Once again, this is to let clients know that the target came from a template and to let Parsoid serialize this back to the original wikitext. Parsoid also maintains private information for roundtripping in the data-parsoid HTML attribute (original template source in this example). The about attribute on the span lets us mark template output expanding to several DOM subtrees as a group.

What's coming up next
Our roadmap describes our plans for the next months and beyond. Apart from new features and refinement in support of the Visual Editor project, we plan to assimilate several Parsoid features into the core of MediaWiki. HTML storage in parallel with wikitext is the first major step in this direction. This will enable several optimizations and might eventually lead to HTML becoming the primary storage format in MediaWiki. We are also working on a DOM-based templating solution with better support for visual editing, separation between logic and presentation and the ability to cache fragments for better performance.

Join us!
If you like the technical challenges in Parsoid and would like to get involved, then please join us in irc://chat.freenode.net/mediawiki-parsoid. You could even get paid to work on Parsoid: We are looking for a full-time software engineer (todo: link) and 1-2 contractors. Join the small Parsoid team and make the sum of all knowledge easier and more efficient to edit, render, and reuse!