Parser 2011/Parser development

Current development happens in the VisualEditor extension in SVN trunk (see modules/parser and tests/parser). The broad architecture looks like this:

PEG wiki/HTML tokenizer -> Token stream transformations -> HTML5 tree builder -> DOM tree -> DOM Postprocessors +-> (X)HTML +-> WikiDom -> Visual Editor

So basically a HTML parser pipeline, with the regular HTML tokenizer replaced by a combined Wiki/HTML tokenizer.


 * 1) The PEG-based wiki tokenizer produces a combined token stream from wiki and html syntax. The PEG grammar is a context-free grammar that can be ported to different parser generators, mostly by adapting the parser actions to the target language. Currently we use pegjs to build the actual JavaScript tokenizer for us.
 * 2) Token stream transformations are used to implement context-sensitive wiki-specific functionality (wiki lists, quotes for italic/bold etc). Templates will also be expanded at this stage, which makes it possible to still render unbalanced templates like table start / row / end combinations.
 * 3) The resulting tokens are then converted to be compatible with the internal format of a HTML5-spec compatible DOM tree builder (currently the 'html5' node.js module), which builds a HTML5 DOM tree from the token soup. This crucial step already sanitizes nesting and enforces some content-model restrictions according to the rules of the HTML5 parsing spec.
 * 4) The resulting DOM is further manipulated using postprocessors. Currently, any remaining top-level inline content is wrapped into paragraphs in such a postprocessor. For output to HTML (but not editing), further document model sanitation can be added here to get very close to what tidy does in the current parser.
 * 5) Finally, the DOM tree can be serialized using .innerHTML. For editing, the idea is to convert the HTML DOM tree to the editing-optimized WikiDom format. This will involve merging of adjacent formatting elements that were split up by the HTML tree builder to satisfy nesting constraints.

Trying it out
The code is in the VisualEditor extension in SVN. The parser tests uses the parserTests.txt file from the phase3 module.

svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3 svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions

You need node.js, npm and the npm modules listed in extensions/VisualEditor/tests/parser/README:
 * pegjs
 * colors
 * html5
 * jquery
 * jsdom
 * diff
 * libxmljs (requires native compilation)
 * optimist (for argument handling)
 * webworker (not needed for parser tests)

You can install these using  or globally, on Linux using.

When this is in place, you should be able to run all parser tests using:

cd extensions/VisualEditor/tests/parser node ./parserTests.js

parserTests has quite a few options now which can be listed using.

Enjoy!

Tokenizer
General tokenizer support for larger structures is relatively good already, but a lot of details like table parameter parsing etc are still missing. Completely missing:
 * magic words
 * signatures
 * ISBN, RFC
 * language conversion syntax ('-{')
 * general html entity decoding
 * Comments in arbitrary places (e.g., ). Not sanely possible, also cannot be represented in the DOM. Need to grep dump to figure out if common enough to make a conversion necessary.

The heading productions should be collapsed in a single one to avoid backtracking. The heading level can be figured out in the action.

Token stream transforms

 * Template and parser function expansion, alternatively template argument serialization for editor
 * Internal links: handle images, files, categories. Need access to plain text of parameters to allow specialized reparsing depending on type.
 * Filter attributes, convert non-whitelisted tags into text tokens

DOM tree builder

 * Spurious end-tags are ignored by the tree builder, while (some) are displayed as text in current MediaWiki. Text display is helpful for authors, but need to decide if departure from plain HTML tree builder is worth it. The visual editor hopefully avoids this kind of problem in the medium term.