The following links should give you a good overview of the technical challenges and how we tackle it.
- Parsoid:How Wikipedia catches up with the web -- blog post from March 2013 outlining why this problem is difficult and how we tackle it.
- A preliminary look at Parsoid internals [ Slides, Video ] -- tech talk from April 2014 and should still be an useful overview of how Parsoid tackles this problem.
- DOM Spec -- documents the HTML that Parsoid generates
- Data-Parsoid attribute -- documents the information recorded in the data-parsoid attribute. This is considered private information and can be changed at any time without notice.
The broad architecture looks like this:
wikitext | V PEG tokenizer | Chunks of tokens V Token stream transformations | Chunks of tokens V HTML5 tree builder | HTML 5 DOM tree V DOM Postprocessors | HTML5 DOM tree V (X)HTML serialization | +------------------> Browser | V Parsoid clients
So basically a HTML parser pipeline, with the regular HTML tokenizer replaced by a combined Wiki/HTML tokenizer with additional functionality implemented as (mostly syntax-independent) token stream transformations.
- Token stream transformations are used to implement context-sensitive wiki-specific functionality (wiki lists, quotes for italic/bold etc). Templates are also expanded at this stage, which makes it possible to still render unbalanced templates like table start / row / end combinations.
- The resulting tokens are then fed to a HTML5-spec compatible DOM tree builder (currently the 'html5' node.js module), which builds a HTML5 DOM tree from the token soup. This step already sanitizes nesting and enforces some content-model restrictions according to the rules of the HTML5 parsing spec.
- The resulting DOM is further manipulated using postprocessors. Currently, any remaining top-level inline content is wrapped into paragraphs in such a postprocessor. For output for viewing, further document model sanitation can be added here to get very close to what Tidy does in the production parser.
- Finally, the DOM tree can be serialized as XML or HTML.