Parser 2011/Parser development

Current development happens in the VisualEditor extension in SVN trunk (see modules/parser and tests/parser). The broad architecture looks like this:

PEG tokenizer -> Token stream transformations -> HTML5 tree builder -> DOM tree -> DOM Postprocessors -> HTML

So basically a HTML parser pipeline, with the regular HTML tokenizer swapped out for a combined Wiki/HTML tokenizer.


 * The PEG-based wiki tokenizer produces a combined token stream from wiki and html syntax. The PEG grammar is a context-free grammar that can be ported to different parser generators, mostly by adapting the parser actions to the target language. Currently we use pegjs to build the actual tokenizer for us.
 * Token stream transformations are used to implement non-context-free wiki-specific functionality (wiki lists, quotes for italic/bold etc). Templates will also be expanded at this stage, which makes it possible to still render unbalanced templates like table start / row / end combinations.
 * The resulting tokens are then converted to be compatible with the internal format of a HTML5-spec compatible DOM tree builder (currently the 'html5' node.js module), which builds a HTML5 DOM tree from the token soup. This crucial step already sanitizes nesting and enforces some content-model restrictions according to the rules of the HTML5 parsing spec.
 * The resulting DOM is further manipulated using postprocessors. Currently, any remaining top-level inline content is wrapped into paragraphs in such a postprocessor.
 * Finally, the DOM tree can be serialized using .innerHTML. For editing, the idea is to convert the HTML DOM tree to the editing-optimized WikiDom format. This will involve merging of adjacent formatting elements that were split up by the HTML tree builder to satisfy nesting constraints.