Parsoid/About

Wikitext has always been both MediaWiki's edit user interface and storage format. It has been a great success: The simplicity of wikitext made it possible to start writing Wikipedia with early Netscape browsers when WYSIWYG editing was technically impossible. A simple PHP script converted the wikitext to HTML.

About 12 years later, the world has changed a bit. Wikitext makes it very difficult to implement visual editing, which is now well supported in browsers for HTML documents. With a lot of new features in the runtime, the conversion from wikitext to HTML can also be very slow. On large Wikipedia pages, it can take up to 40 seconds to render a page after an edit.

The Parsoid project is working on addressing these issues by complementing existing wikitext with an equivalent HTML5 version of the content. The HTML representation lets us use HTML editors for visual editing. In the longer term, we could even use the HTML as the primary representation of the content. This promises to avoid some conversion overhead and enables more efficient updates after an edit to a part of the page. Storing MediaWiki's content in HTML might sound pretty obvious. So why has this not been done before?

It turns out that the ad-hoc structure of wikitext makes a loss-less conversion to HTML and back extremely difficult.


 * Wikitext is not context-free, so it cannot be completely described and parsed based on a context-free grammar. The only complete specification of Wikitext's syntax and semantics is the PHP runtime implementation.
 * There is no invalid wikitext. Wiki constructs and some HTML tags can be freely mixed in a tag soup, which still needs to be converted to a DOM tree that ideally resembles the user's intention.
 * The PHP runtime supports an elaborate text-based preprocessor and template system. This works very similar to a macro processor in C or C++, and creates very similar issues. As an example, there is no guarantee that the expansion of a template will parse to a self-contained DOM structure. In fact, there are many templates that only produce a table start tag, a table row or a table end tag. They can even only produce the first half of an HTML tag or wikitext element, which is practically impossible to represent in HTML.