Parsoid/Todo

If you would like to hack the Parsoid parser, these are the tasks we currently see ahead. Some of them are marked as especially well suited for newbies.

'''Please report issues in the Parsoid product in bugzilla. You can also add problematic wikitext snippets in Parsoid/Bug_test_cases.''' See also the list of open issues on Bugzilla.

If you have questions, try to ping gwicke or subbu on #mediawiki or send a mail to the wikitext-l mailinglist.

Next tasks

 * Tasks with priority 'normal' in the bug list
 * Talk:Parsoid/Todo
 * set up a test wiki with current VE in parsoid.wmflabs.org VM, and test saving and round-tripping
 * duplicate  and a few other bugs in  and

Tokenizer
Earlier minor syntactical changes in Tim's preprocessor rewrite:
 * Tim's Preprocessor ABNF
 * User documentation for preprocessor rewrite

Low-hanging fruit
Simple and fun tasks for somebody wishing to dive into the tokenizer.
 * Horizontal rules
 * Language variants ('-{')
 * Add tokenizer support for Template:(( et al (see 'see also' section in that template's documentation section).

Round-trip info

 * Add more round-trip information using the dataAttribs object property on tokens. This is serialized as JSON into a data-mw attribute on DOM nodes.
 * HTML vs wiki syntax
 * Try hard to preserve variable whitespace: Search for uses of the space production (or equivalent) in the grammar and capture the value into round-trip info.
 * Add source offset information to most elements to support selective re-serialization of modified DOM fragments to minimize dirty diffs (mostly done)

Extension tags
Make sure that (potential) extension end tags are always matched, even if parsing the content causes a switch to a plain-text parsing mode. An example would be an unclosed html comment ( <!-- ) in content between extension tags. Idea: Treat any non-html tags as potential extension tags, and handle unbalanced comments and nowiki sections accordingly. In other words, give unknown html tags precedence over unbalanced / unclosed comments or nowiki blocks.
 * Did some statistics on a dump of the English Wikipedia: There are about 10000 non-html/wiki end-tags, and none of them is preceded by an unclosed paragraph on the same line. List of end tags: en:User:GWicke/Martian endtags. So this strategy looks quite safe. GWicke (talk) 20:11, 20 February 2012 (UTC)

Parallel or cooperatively concurrent tokenization
The current tokenizer emits chunks of tokens for each parsed top-level block, but continues to parse a document in one go. With cooperative multitasking, this means that template expansions etc are queued up until the full document is parsed.

The tokenizer should either run in a separate OS thread, or cooperatively yield a chunk and schedule the tokenization of the remaining text after each top-level block.

Configuration-dependent syntax
It would be nice to keep the tokenizer independent of local configurations. This appears to be difficult at least for url protocols recognized in links. Most other configuration-dependent things including extensions can however be handled in token stream transforms.

Issues

 * Start-of-line position vs. templates and parser functions: . See also User:GWicke/Tables and User:GWicke/Lists
 * attributes conditionally set from parser functions:

Token stream transforms
See the recipe map in mediawiki.parser.js for the current parser transforms and their phases.

Internal links, categories and images
The tokenizer is still independent of configuration data, so it does not pay attention to a wiki link's namespace. This means that image parameters are not parsed differently from normal link parameters, leaving specialized treatment to the LinkHandler token stream transformer. For images, arguments need to be separated from the caption. Full rendering requires information about the image dimensions, which needs to be retrieved from the wiki using either the generic fall-back described in Parsoid/Interfacing with MW API, or a specialized Image-specific API method. For action=parse, templates and -arguments in image options need to be fully expanded using the AttributeExpander before converting options back to wikitext. The (mostly)plain-text nature of options makes this quite easy fortunately. External link tokens produced by a link= option need to map back to the plain URL.

The LinkHandler should also learn about mw:Manual:$wgCapitalLinks.

Parser functions and magic words
Some implementation and lots of stubs (FIXME, quite straightforward!) in ext.core.ParserFunctions.js. Many magic words in particular depend on information from the wiki. Idea for now is to fall-back to action=parse api for extensions and other unsupported constructs. Basically build a page of unsupported elements in document order with each element prefixed/postfixed with unique (non-wikisyntax) delimiters. Then extract results between delimiters. See Parsoid/Interfacing with MW API and Wikitext_parser/Environment.

Miscellaneous

 * Handle dynamically generated nowiki sections: . Template arguments are already tokenized and expanded before substitution, so we'd need to revert this. Idea: Re-serialize tokens to original text using source position annotations and other round-trip information. Icky, but doable. Try to structure HTML DOM to WikiText serializer around SAX-like start/end handlers, so that the same handlers can serialize the token stream back to wikitext.


 * Refactor the link handler to allow subclassing for the modification of 1) the mapping of namespace to handler method, 2) the mapping of file content types to handler method and 3) individual handler methods. This can then be used by Wikia to add custom handling for videos or other content. See for the complementary serializer extension API.

DOM tree builder
Content moved to Parsoid/Todo:Template round-tripping.

Inline element nesting minimization
Consider this wikitext example:. There are two distinct DOMs that we can parse this into:


 * Non-minimal DOM: .  Serialized wikitext of this DOM =
 * Minimal DOM: .  Serialized wikitext of this DOM =

If the 5 leading apostrophes are parsed as  followed by   we get the first result. But, if it is parsed as  followed by   we get the second form. So, given the above wikitext, we have two possible DOMs. Clearly, the minimal DOM is the desirable DOM in this example. While it might seem that we might be able to pick the right parse order, there is no context-free way of determining the right parse order. To illustrate this, consdier a different wikitext example:. The minimal parse order here requires us to parse the 5 leading apostrophes as  followed by   which is the flip of the first example. Since we can have arbitrary wikitext between the first 5 apostrophes and the closing apostrophes, we will need to look ahead as far as necessary to match up apostrophes appropriately.

So, we have to use a deterministic parsing order (always parse 5 apostrophes as  followed by   OR the other way around -- it shouldn't matter for this problem). We then have two possible strategies for generating a minimal DOM:
 * Transform the token stream to reorder tokens appropriately.
 * Generate a DOM and process the DOM to generate a minimal DOM for I and B HTML tags.

It seems simpler to use the second strategy because we can recognize this DOM structural pattern:   and reorganize it using extremely simple rules. The first strategy (token stream transformation), in the general case, will effectively require a deep stack to push the dom-subtrees which is wasteful from a performance standpoint, since we are half-building the DOM only to reorder tokens and then discarding it altogether. Unless there is information in the token stream that lets us extract this information without a stack, the second strategy is desirable. (Original Gabriel text: Minimization involves opening the inline element with the longest span first, so requires look-ahead. There is code in mediawiki.DOMConverter.js that extracts run lengths of inline elements that could be used as a starting point for a DOM minimization pass.)

Note that this pass needs to be run on both the DOM produced by the parser, and the DOM returned by the editor for serialization back to wikitext. The DOM returned by the editor needs to be minimized in case it introduced excess tags, the user added explicit HTML tags, used wikitext of the form above, etc. This serialization will be run only on the modified parts of the DOM that the editor returns (the editor is responsible for marking modified DOM subtrees).

However, note that running this minimization pass always will introduce dirty diffs in certain scenarios (Ex: Consider the wikitext ). This kind of content seems to be relatively rare, and a simplification / minimization should be desirable in the longer run.

Also note that the problem is broader than just the I and B tags. This minimization routine will be run on a larger set of inline tags. Ex: I, B, U, span are definite candidates.

Misc

 * Some document model enforcement on HTML DOM to aid editor, should be able to run either on server or client.
 * Longer term fun project: move DOM building and transformations to webworker to provide fast Lua-extension-like or DOM/Tal/Genshi template functionality and multi-core support. See some ideas.

Wikitext serializer
Basic idea: ( HTML DOM -> ) tokens -> SAX-style serializer handlers -> wikitext
 * uses data-mw round-trip data
 * Will introduce some normalization- at the very least the tree builder has to fix up stuff when building a tree from tag soup, so the full round-trip cannot be 100% perfect for broken inputs.

Serializing only modified parts of a page

 * The round-trip info on elements contains source offsets in original wikitext
 * The editor marks modified parts of the DOM
 * The serializer splices original source of unmodified DOM parts with serialization of modified subtrees. This avoids dirty diffs from normalization in unmodified parts of the page.

Challenges for offset retrieval:
 * Balancing of tags and foster-parenting in tree builder
 * Attributes on end tags are dropped
 * No offsets on text content

Provide API for the registration of custom content serializer handlers by RDFa type
Needed to support serialization of things like custom DOM for videos linked to in the file namespace. Parser hook extensions would normally be handled generically (with source-based editing support at most), but might also want to register custom serializers when using DOM-based editing of contents. Examples for this would be the gallery or cite extensions.

Testing
See tests/parser, in particular parserTests.js.

parserTests

 * Set up a more complete testing environment including the time, predefined images and so on (see phase3/tests/parser/parserTests.inc).

Round-trip tests on dumps
There is a dumpReader in tests/parser, which can be used to run full dumps through the parser.