Parsoid/About

About 12 years later, the world has changed a bit. Wikitext makes it very difficult to implement visual editing, which is now well supported in browsers for HTML documents. With a lot of new features in the runtime, the conversion from wikitext to HTML can also be very slow. On large Wikipedia pages, it can take up to 40 seconds to render a page after an edit.

The Parsoid project is working on addressing these issues by complementing existing wikitext with an equivalent HTML5 version of the content. In the short term, the HTML representation lets us use HTML technology for visual editing. In the longer term, using HTML as the storage format can eliminate conversion overhead when rendering pages, and can also enable more efficient updates after an edit to a part of the page. This might all sound pretty straightforward. So why has this not been done before?

Lossless conversion between wikitext and HTML is really difficult
For the wikitext and HTML5 representations to be considered equivalent, it should be possible to convert between wikitext and HTML5 representations without introducing any semantic errors. It turns out that the ad-hoc structure of wikitext makes such a loss-less conversion to HTML and back extremely difficult.


 * Context-sensitive parsing: Wikitext is not context-free, so it cannot be completely described and parsed based on a context-free grammar. The only complete specification of Wikitext's syntax and semantics is the MediaWiki PHP-based runtime implementation.
 * Text-based templating: The PHP runtime supports an elaborate text-based preprocessor and template system. This works very similar to a macro processor in C or C++, and creates very similar issues. As an example, there is no guarantee that the expansion of a template will parse to a self-contained DOM structure. In fact, there are many templates that only produce a table start tag, a table row or a table end tag. They can even only produce the first half of an HTML tag or wikitext element, which is practically impossible to represent in HTML. Despite all this, content generated by an expanded template needs to be clearly identified in the HTML DOM. For a good editing experience, a DOM subtree generated by a sequence of several templates should be encapsulated as a single template-affected unit.
 * No invalid wikitext: There is no invalid wikitext. Wiki constructs and HTML tags can be freely mixed in a tag soup, which still needs to be converted to a DOM tree that ideally resembles the user's intention. This also introduces a vast number of edge cases whose behavior might vary between different implementations of the wikitext conversion. In practice, this means identifying well-defined and well-supported wikitext whose behavior remains consistent across implementations. This can be achieved with a combination of test cases, grammar for context-free subset of wikitext use, and textual descriptions.
 * Character-based diffs: MediaWiki uses a character-based diff interface to show the changes between the wikitext of two versions of a wiki page. Any character difference introduced by a round-trip from wikitext to HTML and back would show up as a dirty diff, which would annoy editors and make it hard to find the actual changes. This means that the conversion needs to preserve not just the semantics of the content, but also the syntax of unmodified content character-by-character. Put differently, since wikitext-to-HTML is a many-to-one mapping where different snippets of wikitext all result in the same HTML rendering (Ex:  * list  versus  *list ), a reverse conversion would effectively normalize wikitext syntax. However, character-based diffs forces the wikitext-to-HTML mapping to be treated as a one-to-one mapping. In practice, this burdens the implementation by requiring one or more of the following techniques:
 * relying on access to source wikitext to re-render original wikitext for unmodified HTML5 output.
 * requiring the HTML5 representation to record variations from some normalized syntax (Ex: recording excess spaces, variants of table-cell wikitext).
 * recording information about DOM fixups when ill-formed HTML is auto-corrected by HTML clients (Ex: auto-closed inline tags in block context).

Example 1
Let us look at an example to see how Parsoid tackles some of these challenges. Consider the wikitext: bar The HTML generated by Parsoid for this is: bar The a-tag itself should be obvious given that the wikitext is a wiki-link. However, Parsoid also marks this with the mw:WikiLink property which lets Parsoid serialize this HTML back to a wikilink. This kind of RDFa markup provides clients (like the Visual Editor) additional semantic information as well as lets Parsoid know what wikitext generated the a-tag and lets it distinguish between external links, images, ISBN links, etc.

Let us now change the wikitext slightly where the link target is generated by a template: Foo The HTML generated by Parsoid for this is:  bar  First of all, note that this wikitext will render identically to the wikitext above in the browser -- so semantically, there is no difference between the two wikitext snippets. However, Parsoid adds additional markup to the link target. The span-tag wrapping the target has an about tag and a RDFa type. Once again, this is to let clients know that the target came from a template and to let Parsoid serialize this back to the original wikitext. Parsoid also maintains private information for roundtripping in the data-parsoid HTML attribute (original template source in this example). The about attribute on the span is useful when the template generates multiple non-nested DOM nodes and collectively identifies them as template output.

(discuss 1-2 examples in more depth)

Things we plan to tackle next
(1-2 paragraphs about HTML storage, incremental parsing / link updates)

Join us!
(Volunteer, contract or interview)