Parsoid/About

Wikitext has always been both MediaWiki's edit user interface and storage format. It has been a great success: The simplicity of wikitext made it possible to start writing Wikipedia with Netscape 4.7 when WYSIWYG editing was technically impossible. A simple PHP script converted the wikitext to HTML.

About 12 years later, the world has changed a bit. Wikitext makes it very difficult to implement visual editing, which is now well supported in browsers for HTML documents. With a lot of new features in the runtime, the conversion from wikitext to HTML can also be very slow. On large Wikipedia pages, it can take up to 40 seconds to render a page after an edit.

The Parsoid project is working on addressing these issues by complementing existing wikitext with an equivalent HTML5 version of the content. In the short term, the HTML representation lets us use HTML technology for visual editing. In the longer term, using the HTML also for regular page views can save some conversion overhead and enables more efficient updates after an edit to a part of the page. This might all sound pretty straightforward. So why has this not been done before?

Lossless conversion between wikitext and HTML is really difficult
It turns out that the ad-hoc structure of wikitext makes a loss-less conversion to HTML and back extremely difficult.


 * Wikitext is not context-free, so it cannot be completely described and parsed based on a context-free grammar. The only complete specification of Wikitext's syntax and semantics is the MediaWiki runtime implementation.
 * There is no invalid wikitext. Wiki constructs and some HTML tags can be freely mixed in a tag soup, which still needs to be converted to a DOM tree that ideally resembles the user's intention.
 * The PHP runtime supports an elaborate text-based preprocessor and template system. This works very similar to a macro processor in C or C++, and creates very similar issues. As an example, there is no guarantee that the expansion of a template will parse to a self-contained DOM structure. In fact, there are many templates that only produce a table start tag, a table row or a table end tag. They can even only produce the first half of an HTML tag or wikitext element, which is practically impossible to represent in HTML. Despite all this, content generated by an expanded template needs to be clearly identified in the HTML DOM. For a good editing experience, a DOM subtree generated by a sequence of several templates should be encapsulated as a single template-affected unit.
 * MediaWiki uses a character-based diff interface to show the changes between the wikitext of two versions of a wiki page. Any character difference introduced by a round-trip from wikitext to HTML and back would show up as a dirty diff, which would annoy editors and make it hard to find the actual changes. This means that the conversion needs to preserve not just the semantics of the content, but also the syntax of unmodified content character-by-character.

How we tackle these challenges in Parsoid


(discuss 1-2 examples in more depth)

Things we plan to tackle next
(1-2 paragraphs about HTML storage, incremental parsing / link updates)

Join us!
(Volunteer, contract or interview)