Parsoid

The parser Parsoid project aims to develop a more consistent WikiText parser which translates MediaWiki's well-known syntax into an equivalent representation with better support for automated processing and visual editing. It is developed in parallel with and in support of the visual editor project as a future core project. A major requirement is the ability to reverse this translation (serialize back to WikiText) without the introduction of 'dirty diffs' or information loss. Wiki pages remain editable as plain WikiText.

Architecture
The broad architecture looks like this:

| wikitext V PEG wiki/HTML tokenizer        (or other tokenizers / SAX-like parsers) | Chunks of tokens V Token stream transformations | Chunks of tokens V HTML5 tree builder | HTML 5 DOM tree V DOM Postprocessors | HTML5 DOM tree V (X)HTML serialization |   +--> Browser |   V Visual Editor

So basically a HTML parser pipeline, with the regular HTML tokenizer replaced by a combined Wiki/HTML tokenizer with additional functionality implemented as (mostly syntax-independent) token stream transformations.


 * 1) The PEG-based wiki tokenizer produces a combined token stream from wiki and html syntax. The PEG grammar is a context-free grammar that can be ported to different parser generators, mostly by adapting the parser actions to the target language. Currently we use pegjs to build the actual JavaScript tokenizer for us. We try to do as much work as possible in the grammar-based tokenizer, so that the emitted tokens are already mostly syntax-independent.
 * 2) Token stream transformations are used to implement context-sensitive wiki-specific functionality (wiki lists, quotes for italic/bold etc). Templates are also be expanded at this stage, which makes it possible to still render unbalanced templates like table start / row / end combinations.
 * 3) The resulting tokens are then fed to a HTML5-spec compatible DOM tree builder (currently the 'html5' node.js module), which builds a HTML5 DOM tree from the token soup. This step already sanitizes nesting and enforces some content-model restrictions according to the rules of the HTML5 parsing spec.
 * 4) The resulting DOM is further manipulated using postprocessors. Currently, any remaining top-level inline content is wrapped into paragraphs in such a postprocessor. For output for viewing, further document model sanitation can be added here to get very close to what tidy does in the production parser.
 * 5) Finally, the DOM tree can be serialized as XML or HTML.

Getting started
Development happens in the VisualEditor extension in SVN trunk (see modules/parser and tests/parser). The parser tests uses the parserTests.txt file from the phase3 module.

svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3 svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions

You need node.js, npm and the npm modules listed in extensions/VisualEditor/tests/parser/README:
 * jquery
 * jsdom
 * buffer
 * optimist
 * pegjs
 * querystring
 * html5
 * request (implicitly installed by jsdom)
 * assert

The following additional modules are used in parserTests:
 * colors (for parserTests eye candy)
 * diff (parserTests output diffing)

You can install these using  or globally, on Linux using.

When this is in place, you should be able to run all parser tests using:

cd extensions/VisualEditor/tests/parser node ./parserTests.js

parserTests has quite a few options now which can be listed using.

An alternative wrapper taking wikitext on stdin and emitting WikiDom JSON or HTML on stdout is modules/parser/parse.js:

cd extensions/VisualEditor/modules/parser  echo '' | node parse.js 

This example will transclude the English Wikipedia's en:Main Page including its embedded templates. Also check out  for options.

Enjoy!

Todo
If you would like to hack the Parsoid parser, these are the tasks we currently see ahead. Some of them are marked as especially well suited for newbies. If you have questions, try to ping gwicke on #mediawiki or send a mail to the wikitext-l mailinglist. If all that fails, you can also contact Gabriel Wicke by mail.

Tokenizer
General tokenizer support for larger structures is relatively good already, but some details are still missing. A few simple things are completely missing, but easy to add:
 * magic words (the __UNDERSCORED__ variant)
 * signatures and timestamps
 * ISBN, RFC
 * language conversion syntax ('-{')
 * html vs wiki syntax annotations, try harder to preserve whitespace
 * source range and arg/source annotation for templates, extensions etc

Issues:
 * Make sure that (potential) extension end tags are always matched, even if parsing the content causes a switch to a plain-text parsing mode. Access to the unparsed source is already provided with source position attributes in tag tokens, but tokens for the parsed content should also be available to extensions. The output of extensions will be parsed (with different sanitizer settings?) as well, which should fix bug 2700.
 * Configuration-dependent syntax. It would be nice to keep the tokenizer independent of local configurations. This appears to be difficult at least for url protocols recognized in links. Most other configuration-dependent things including extensions can however be handled in token stream transforms.
 * Comments in arbitrary places (e.g., ) cannot generally be supported without stripping comments before parsing. Even if parsed, this type of comment could not be represented in the DOM. Before deployment, we should check if this is common enough to warrant an automated conversion. Grepping a dump works well for this check.

Things to check:
 * Tim's Preprocessor ABNF
 * User documentation for preprocessor rewrite

Token stream transforms

 * More complete implementation of Parser functions and magic words. Some implementation and lots of stubs (FIXME, quite straightforward!) in ext.core.ParserFunctions.js.
 * Fall-back to action=parse api for extensions and other unsupported constructs. Basically build a page of unsupported elements in document order with each element prefixed/postfixed with unique (non-wikisyntax) delimiters. Then extract results between delimiters. See ParserNotesExtensions and Wikitext_parser/Environment.
 * Filter attributes, convert non-whitelisted tags into text tokens. See ext.core.Sanitizer.js for an outline, should be a relatively straightforward port from the PHP version. Good task if you'd like to dive into the JS parser.
 * Internal links: handle images, files, categories. The tokenizer classifies everything after the first pipe as description. Image parameters fortunately end up in a plain text token and can be easily re-parsed in a token stream transformer.
 * Generic attribute expand to support templates and template arguments in them: Expand all non-string arguments, presumably convertible to plain text after phase 2. Use AttributeTransformManager, and move its use out of TemplateHandler. Might be cleaner to split attribute expansion into phase 1 / 2 instead of calling both from AsyncTokenTransformer (phase 2). Improvement: only expand branches selected by parser functions.
 * Map template attributes to HTML5 microdata.
 * Fix-ups for things documented in the following parser tests: 'External links: wiki links within external link (Bug 3695)'
 * Optimize token representation: Plain string for text, objects with appropriate constructor for others. Basically eliminate the type attribute.
 * Handle table foster-parenting with round-tripping by reordering and marking tokens
 * Handle dynamically generated nowiki sections: . Template arguments are already tokenized and expanded before substitution, so we need to revert this. Idea: Re-serialize tokens to original text using source position annotations and other round-trip information. Icky, but doable. Try to structure HTML DOM to WikiText serializer around SAX-like start/end handlers, so that the same handlers can serialize the token stream back to wikitext.
 * Validate bracketed external link target. Removed validation in tokenizer to support templates and arguments in link target. Links with non-validating targets need to be turned back into plain text (including the brackets). Need to investigate if there are cases where a non-valid link target would change parsing with surrounding structures, e.g. by matching up the closing bracket with an earlier opened bracket. If so, can these be handled in the token stream?

Not really a token stream transform, but closely related:
 * Port the visual editor WikiText serializer to work on HTML DOM, and convert handlers to support per-token (or SAX event style) serialization. Per-token handlers allow us to use the same handlers for DOM or token stream serialization.

DOM tree builder
Generally we would like to avoid any changes to the default HTML5 tree builder algorithm, as this would allow us to use the built-in HTML parser in modern browsers, or unchanged libraries. There are however some tasks that are very hard to solve otherwise, and require only small changes to the tree builder. These tasks all have to do with unbalanced token soup, which should be confined to the server.


 * Spurious end-tags are ignored by the tree builder, while (some) are displayed as text in current MediaWiki. Text display is helpful for authors. The necessary change to the html tree builder to replicate this would be small, but is not possible if a browser's built-in parser is used. The visual editor hopefully reduces the need for this kind of debugging aid in the medium term.
 * Propagate attribute information for end tag tokens (especially source information for tokens originating from templates) to matching start tag, to make sure that the full scope of token-affected subtrees is captured. This is hard to do without a modification to the tree builder. Only needs to be performed on the server side. Relatively simple modification.

DOM postprocessing

 * Some document model enforcement on HTML DOM to aid editor, should be able to run either on server or client.
 * Longer term fun project: move DOM building and transformations to webworker to provide fast Lua-extension-like or DOM/Tal/Genshi template functionality and multi-core support. See some ideas.

Middlewarish bits

 * Wikitext source / modified DOM serialization splicing
 * set up a basic web service wrapper - Neil mostly did this already by hooking parse.js up to the MW API. Will need a longer-running node server at some stage to support splicing and avoid the reconstruction of the tokenizer for each request (takes a few seconds). Also needs to be modified to use the HTML DOM serialization instead of JSON.
 * clean-up of round-trip data-wiki* attributes for pure view html

Monthly status summary
Shared with the Visual editor project.

(See all status reports)

parserTests.js result history
Total 672 tests (all, including normally disabled ones).
 * 15:04, 29 November 2011 (UTC): 50 passed, 4m45
 * 15:07, 29 November 2011 (UTC): 55 passed, 4m40
 * 16:27, 1 December 2011 (UTC): 139 passed, 8m50
 * 22:14, 6 December 2011 (UTC): 169 passed, 7m30
 * 17:32, 7 December 2011 (UTC): 180 passed, 7m35
 * 11:13, 12 December 2011 (UTC): 180 passed, 0m14 (and 5 seconds with --cache) after avoiding to re-build the tokenizer for each test
 * 00:11, 22 January 2012 (UTC): 220 passed, 0m6.1 seconds with --cache
 * 12:53, 1 February 2012 (UTC): 222 passed, 0m6.3 seconds with --cache
 * 17:36, 7 February 2012 (UTC): 232 passed, 0m6.6 seconds with --cache

Technical documents

 * /HTML5 DOM with microdata: The HTML5 DOM produced by the parser, and used for communication with the visual editor. Not yet implemented.
 * /test cases: Please add interesting snippets or pages.