Parsoid/Todo

If you would like to hack the Parsoid parser, these are the tasks we currently see ahead. Some of them are marked as especially well suited for newbies. If you have questions, try to ping gwicke on #mediawiki or send a mail to the wikitext-l mailinglist. If all that fails, you can also contact Gabriel Wicke by mail.

Tokenizer
Earlier minor syntactical changes in Tim's preprocessor rewrite:
 * Tim's Preprocessor ABNF
 * User documentation for preprocessor rewrite

Low-hanging fruit
Simple and fun tasks for somebody wishing to dive into the tokenizer.
 * magic words (the __UNDERSCORED__ variant) : extraction mostly implemented by Ori and Adam, see for example https://gerrit.wikimedia.org/r/4050. Actual behavior still needs to be implemented.
 * ISBN, RFC, PMID - implemented by Ori
 * Language variants ('-{')

Clean-up

 * Move list handler from tokenizer to sync23 phase token transformer to support list items from templates. (Stabs being made by Adam: https://gerrit.wikimedia.org/r/3729)

Round-trip info

 * Add more round-trip information using the dataAttribs object property on tokens. This is serialized as JSON into a data-mw attribute on DOM nodes.
 * HTML vs wiki syntax
 * Try hard to preserve variable whitespace: Search for uses of the space production (or equivalent) in the grammar and capture the value into round-trip info.
 * Add source offset information to most blocks
 * Source offset/range annotation for elements to support selective re-serialization of modified DOM fragments to minimize dirty diffs

Extension tags
Make sure that (potential) extension end tags are always matched, even if parsing the content causes a switch to a plain-text parsing mode. An example would be an unclosed html comment ( <!-- ) in content between extension tags. Idea: Treat any non-html tags as potential extension tags, and handle unbalanced comments and nowiki sections accordingly. In other words, give unknown html tags precedence over unbalanced / unclosed comments or nowiki blocks.
 * Did some statistics on a dump of the English Wikipedia: There are about 10000 non-html/wiki end-tags, and none of them is preceded by an unclosed paragraph on the same line. List of end tags: en:User:GWicke/Martian endtags. So this strategy looks quite safe. GWicke (talk) 20:11, 20 February 2012 (UTC)

Chunked tokenization
The current tokenizer emits chunks of tokens for each parsed top-level block, but continues to parse a document in one go. Combined with the single-threaded nature of node this prevents pipelined processing of tokens, which causes a very high memory usage for template-heavy articles. The expansion of en:Barack Obama runs out of memory after consuming 1.6GB for this reason.

This can be fixed by only parsing a single top-level block, and returning the chunk and the length of the consumed source. Using this information, another tokenizer iteration on the remaining text can be scheduled cooperatively after doing some processing on the current chunk's tokens.

An alternative would be to run the tokenizer in a parallel OS thread, but this does not fit very well into the node framework.

Configuration-dependent syntax
It would be nice to keep the tokenizer independent of local configurations. This appears to be difficult at least for url protocols recognized in links. Most other configuration-dependent things including extensions can however be handled in token stream transforms.

Issues

 * Start-of-line position vs. templates and parser functions: . See also User:GWicke/Tables and User:GWicke/Lists
 * attributes conditionally set from parser functions:

Limitations
Limitation mainly stem from the single-pass nature of Parsoid vs. the multi-pass structure of the PHP parser. Those can be lifted by doing less work in the tokenizer and more work on the token stream, at the cost of a less complete PEG grammar and more complex token stream transforms (which would then be worth implementing as a parser in their own right). So far we have only done this to support template and list fragments, which are widely used.

Templates returning parts of syntactical structure apart from templates and lists
Example:  or en:Template:YouTube. If needed, additional cases can be supported by only emitting simple start/end tokens in the tokenizer and moving the actual parsing to a token stream transformer in the sync23 phase (after templates are expanded).

Same root issue: Noinclude spanning template or link parameters as in.

Comments in arbitrary places
Example:

These cannot generally be supported without stripping comments before parsing. Even if parsed, this type of comment could not be represented in the DOM. Before deployment, we should check if this is common enough to warrant an automated conversion. Grepping a dump works well for this check.

Mis-nested parser functions
The grammar-based tokenizer assumes some degree of sane nesting. Parser functions can return full tokens or substrings of an attribute, but not the first half of a token including half an attribute. Similar to the issues above, this limitation could be largely removed by dumbing down the tokenizer and deferring actual parsing until after template exansions at the cost of performance and PEG tokenizer grammar completeness. Mis-nested parser functions are hard to figure out for humans too, and should better be avoided / removed if possible.

Example: search for 'style="}}' in Template:Navbox. There are a total of 12 templates and no articles matching that string in the English Wikipedia. (used the following statement: )

expands to font-weight: bold">Some text in the PHP parser, but expands to {{#if:foo| Some text in Parsoid. This can be fixed by modifying the source to  . Hopefully this is rare enough to allow fixing this manually. Reliable detection of this case is needed to analyze how common this is.

Token stream transforms
Generally, try to run most transformations after template expansions to retain widest possible support for constructs that only match up after expansion. Try to pick up attributes for non-template tokens as late as possible.

Parser functions and magic words
Some implementation and lots of stubs (FIXME, quite straightforward!) in ext.core.ParserFunctions.js. Many magic words in particular depend on information from the wiki. Idea for now is to fall-back to action=parse api for extensions and other unsupported constructs. Basically build a page of unsupported elements in document order with each element prefixed/postfixed with unique (non-wikisyntax) delimiters. Then extract results between delimiters. See Parsoid/Interfacing with MW API and Wikitext_parser/Environment.

Sanitizer
Filter attributes, convert non-whitelisted tags into text tokens. See ext.core.Sanitizer.js for an outline, should be a relatively straightforward port from the PHP version. Good task if you'd like to dive into the JS parser.

Internal links, categories and images
The tokenizer is still independent of configuration data, so it does not pay attention to a wiki link's namespace. This means that image parameters are not parsed differently from normal link parameters, leaving specialized treatment to the LinkHandler token stream transformer. For images, arguments need to be separated from the caption. Full rendering requires information about the image dimensions, which needs to be retrieved from the wiki using either the generic fall-back described in Parsoid/Interfacing with MW API, or a specialized Image-specific API method. For action=parse, templates and -arguments in image options need to be fully expanded using the AttributeExpander before converting options back to wikitext. The (mostly)plain-text nature of options makes this quite easy fortunately. External link tokens produced by a link= option need to map back to the plain URL.

The LinkHandler transformer also needs to handle the Help:Pipe_trick while rendering.

HTML5 microdata
Track provenance on all tokens, preserve this information while building the tree and convert it into microdata / RDFa markup.

WikiText serialization
Not really a token stream transform, but closely related:
 * Port the visual editor WikiText serializer to work on HTML DOM, and convert handlers to support per-token (or SAX event style) serialization. Per-token handlers allow us to use the same handlers for DOM or token stream serialization.

Miscellaneous

 * Fix-ups for things documented in the following parser tests: 'External links: wiki links within external link (Bug 3695)'
 * Handle table foster-parenting with round-tripping by reordering and marking tokens
 * Handle dynamically generated nowiki sections: . Template arguments are already tokenized and expanded before substitution, so we need to revert this. Idea: Re-serialize tokens to original text using source position annotations and other round-trip information. Icky, but doable. Try to structure HTML DOM to WikiText serializer around SAX-like start/end handlers, so that the same handlers can serialize the token stream back to wikitext.
 * Validate bracketed external link target. Removed validation in tokenizer to support templates and arguments in link target. Links with non-validating targets need to be turned back into plain text (including the brackets). Need to investigate if there are cases where a non-valid link target would change parsing with surrounding structures, e.g. by matching up the closing bracket with an earlier opened bracket. If so, can these be handled in the token stream?
 * Handle named template parameters from a template or template arg:  should set the argument named 'style' on the template transclusion of Template:Foo. Mark as being in attribute position and re-parse after expansion in AttributeTransformManager? This is actually no longer supported since the preprocessor rewrite:  ->  Phew!
 * Generic attribute expand to support templates and template arguments in them: Expand all non-string arguments, presumably convertible to plain text after phase 2. Use AttributeTransformManager, and move its use out of TemplateHandler. Might be cleaner to split attribute expansion into phase 1 / 2 instead of calling both from AsyncTokenTransformer (phase 2). Improvement: only expand branches selected by parser functions.
 * Create global queue of outstanding and ready-to-process (input available) expansion processes to release tokens as early as possible. Right now, all processes waiting for a given template are processed sequentially, and node seems to process ready expansion processes in FIFO order. This blows up the memory usage, as tokens are not freed as early as possible.

DOM tree builder
Generally we would like to avoid any changes to the default HTML5 tree builder algorithm, as this would allow us to use the built-in HTML parser in modern browsers, or unchanged libraries. There are however some tasks that are very hard to solve otherwise, and require only small changes to the tree builder. These tasks all have to do with unbalanced token soup, which should be confined to the server.


 * Spurious end-tags are ignored by the tree builder, while (some) are displayed as text in current MediaWiki. Text display is helpful for authors. The necessary change to the html tree builder to replicate this would be small, but is not possible if a browser's built-in parser is used. The visual editor hopefully reduces the need for this kind of debugging aid in the medium term.
 * Propagate attribute information for end tag tokens (especially source information for tokens originating from templates) to matching start tag, to make sure that the full scope of token-affected subtrees is captured. This is hard to do without a modification to the tree builder. Only needs to be performed on the server side. Relatively simple modification.

DOM postprocessing

 * Some document model enforcement on HTML DOM to aid editor, should be able to run either on server or client.
 * Longer term fun project: move DOM building and transformations to webworker to provide fast Lua-extension-like or DOM/Tal/Genshi template functionality and multi-core support. See some ideas.

Middlewarish bits

 * Wikitext source / modified DOM serialization splicing
 * set up a basic web service wrapper - Neil mostly did this already by hooking parse.js up to the MW API. Will need a longer-running node server at some stage to support splicing and avoid the reconstruction of the tokenizer for each request (takes a few seconds). Opportunity to collaborate with Ashish Dubey on server portion . If we continue to use the API wrapper, then it would need to be modified to use the HTML DOM serialization instead of JSON.
 * clean-up of round-trip data-wiki* attributes for pure view html

Testing
See tests/parser, in particular parserTests.js.

parserTests

 * Set up a more complete testing environment including the time, predefined images and so on (see phase3/tests/parser/parserTests.inc).

Round-trip tests
There is a dumpReader in tests/parser, which can be used to run full dumps through the parser. For round-tripping the WikiText serializers needs to be ported and extended from the client code. See.