Parser 2011/Parser development

Current development happens in the VisualEditor extension in SVN trunk (see modules/parser and tests/parser). The broad architecture looks like this:

| wikitext V PEG wiki/HTML tokenizer        (or other tokenizers / SAX-like parsers) | Chunks of tokens V Token stream transformations | Chunks of tokens V HTML5 tree builder | HTML 5 DOM tree V DOM Postprocessors | HTML5 DOM tree +--> (X)HTML serialization |   V DomConverter | WikiDom V JSON serialization | JSON WikiDom serialization V Visual Editor

So basically a HTML parser pipeline, with the regular HTML tokenizer replaced by a combined Wiki/HTML tokenizer.


 * 1) The PEG-based wiki tokenizer produces a combined token stream from wiki and html syntax. The PEG grammar is a context-free grammar that can be ported to different parser generators, mostly by adapting the parser actions to the target language. Currently we use pegjs to build the actual JavaScript tokenizer for us.
 * 2) Token stream transformations are used to implement context-sensitive wiki-specific functionality (wiki lists, quotes for italic/bold etc). Templates will also be expanded at this stage, which makes it possible to still render unbalanced templates like table start / row / end combinations.
 * 3) The resulting tokens are then converted to be compatible with the internal format of a HTML5-spec compatible DOM tree builder (currently the 'html5' node.js module), which builds a HTML5 DOM tree from the token soup. This step already sanitizes nesting and enforces some content-model restrictions according to the rules of the HTML5 parsing spec.
 * 4) The resulting DOM is further manipulated using postprocessors. Currently, any remaining top-level inline content is wrapped into paragraphs in such a postprocessor. For output to HTML (but not editing), further document model sanitation can be added here to get very close to what tidy does in the production parser.
 * 5) Finally, the DOM tree can be serialized using .innerHTML. For editing, the HTML DOM tree is converted to the editing-optimized WikiDom format.

Trying it out
The code is in the VisualEditor extension in SVN. The parser tests uses the parserTests.txt file from the phase3 module.

svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3 svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions

You need node.js, npm and the npm modules listed in extensions/VisualEditor/tests/parser/README:
 * pegjs
 * colors
 * html5
 * jquery
 * jsdom
 * diff
 * libxmljs (requires native compilation)
 * optimist (for argument handling)

You can install these using  or globally, on Linux using.

When this is in place, you should be able to run all parser tests using:

cd extensions/VisualEditor/tests/parser node ./parserTests.js

parserTests has quite a few options now which can be listed using.

Enjoy!

Tokenizer
General tokenizer support for larger structures is relatively good already, but some details are still missing. A few simple things are completely missing, but easy to add:
 * magic words (the __UNDERSCORED__ variant)
 * signatures and timestamps
 * ISBN, RFC
 * language conversion syntax ('-{')

Intermediate level todos:
 * Convert parts of wiki tables to tokens rather than matching it all up in the tokenizer. This is needed for the table start / row / end template use case.
 * Move out list building to token transform framework, to run after template expansions.

Issues:
 * Make sure that (potential) extension end tags are always matched, even if parsing the content causes a switch to a plain-text parsing mode. Access to the unparsed source is already provided with source position attributes in tag tokens, but tokens for the parsed content should also be available to extensions. The output of extensions will be parsed (with different sanitizer settings?) as well, which should fix bug 2700.
 * Configuration-dependent syntax. It would be nice to keep the tokenizer independent of local configurations. This appears to be difficult at least for url protocols recognized in links. Most other configuration-dependent things including extensions can however be handled in token stream transforms.
 * Comments in arbitrary places (e.g., ) cannot generally be supported without stripping comments before parsing. Even if parsed, this type of comment could not be represented in the DOM. Before deployment, we should check if this is common enough to warrant an automated conversion. Grepping a dump works well for this check.

Things to check:
 * Tim's Preprocessor ABNF
 * User documentation for preprocessor rewrite

Token stream transforms

 * Parser functions and magic variables: Wikitext_parser/Environment, ParserNotesExtensions Etherpad
 * Internal links: handle images, files, categories. Need access to plain text of parameters to allow specialized reparsing depending on type.
 * Generic attribute expand to support templates and template arguments in them: Expand all non-string arguments, presumably convertible to plain text after phase 2. Use AttributeTransformManager, and move its use out of TemplateHandler.
 * Filter attributes, convert non-whitelisted tags into text tokens. See ext.core.Sanitizer.js for an outline, should be a relatively straightforward port from the PHP version. Good task if you'd like to dive into the JS parser.
 * Fix-ups for things documented in the following parser tests: 'External links: wiki links within external link (Bug 3695)'

DOM tree builder

 * Spurious end-tags are ignored by the tree builder, while (some) are displayed as text in current MediaWiki. Text display is helpful for authors. The necessary change to the html tree builder to replicate this would be small, but is not possible if a browser's built-in parser is used. The visual editor hopefully reduces the need for this kind of debugging aid in the medium term.

parserTests.js result history

 * 15:04, 29 November 2011 (UTC): 50 passed, 4m45
 * 15:07, 29 November 2011 (UTC): 55 passed, 4m40
 * 16:27, 1 December 2011 (UTC): 139 passed, 8m50
 * 22:14, 6 December 2011 (UTC): 169 passed, 7m30
 * 17:32, 7 December 2011 (UTC): 180 passed, 7m35
 * 11:13, 12 December 2011 (UTC): 180 passed, 0m14 (and 5 seconds with --cache) after avoiding to re-build the tokenizer for each test
 * 00:11, 22 January 2012 (UTC): 220 passed, 0m6.1 seconds with --cache