Parsoid/Internals

The following links should give you a good overview of the technical challenges and how we tackle it.
 * Parsoid:How Wikipedia catches up with the web -- blog post from March 2013 outlining why this problem is difficult and how we tackle it.
 * A preliminary look at Parsoid internals [ Slides, Video ] -- tech talk from April 2014 and should still be an useful overview of how Parsoid tackles this problem.

Docs

 * DOM Spec -- documents the HTML that Parsoid generates
 * Data-Parsoid attribute -- documents the information recorded in the data-parsoid attribute. This is considered private information and can be changed at any time without notice.

Architecture
The broad architecture looks like this:

wikitext |     V  PEG tokenizer |  Chunks of tokens V Token stream transformations |  Chunks of tokens V HTML5 tree builder |  HTML 5 DOM tree V DOM Postprocessors |  HTML5 DOM tree V (X)HTML serialization |     +--> Browser |     V  Parsoid clients

So basically a HTML parser pipeline, with the regular HTML tokenizer replaced by a combined Wiki/HTML tokenizer with additional functionality implemented as (mostly syntax-independent) token stream transformations.


 * 1) The PEG-based  (https://phabricator.wikimedia.org/diffusion/GPAR/browse/master/lib/wt2html/pegTokenizer.pegjs.txt) produces a combined token stream from wiki and html syntax. The PEG grammar is a context-free grammar that can be ported to different parser generators, mostly by adapting the parser actions to the target language. Currently we use pegjs to build the actual JavaScript tokenizer for us. We try to do as much work as possible in the grammar-based tokenizer, so that the emitted tokens are already mostly syntax-independent.
 * 2) Token stream transformations are used to implement context-sensitive wiki-specific functionality (wiki lists, quotes for italic/bold etc). Templates are also expanded at this stage, which makes it possible to still render unbalanced templates like table start / row / end combinations.
 * 3) The resulting tokens are then fed to a  (currently the 'html5' node.js module), which builds a HTML5 DOM tree from the token soup. This step already sanitizes nesting and enforces some content-model restrictions according to the rules of the HTML5 parsing spec.
 * 4) The resulting DOM is further manipulated using postprocessors. Currently, any remaining top-level inline content is wrapped into paragraphs in such a postprocessor. For output for viewing, further document model sanitation can be added here to get very close to what Tidy does in the production parser.
 * 5) Finally, the DOM tree can be serialized as XML or HTML.