Parsoid/Todo

If you would like to hack the Parsoid parser, these are the tasks we currently see ahead. Some of them are marked as especially well suited for newbies.

Please report issues in Talk:Parsoid/Todo or add problematic wikitext snippets in Parsoid/Bug_test_cases. See also the list of open issues on Bugzilla

If you have questions, try to ping gwicke or subbu on #mediawiki or send a mail to the wikitext-l mailinglist.

Post-release tasks

 * Tasks with priority 'normal' in the bug list
 * round-trip thumbs, images 37920, category links 37908, behavior switches 37909 and unexpanded templates 37910
 * add oldid support to the web service to support retrieval of old revisions 37914
 * add a public link target attribute with same content as sHref 37916
 * Talk:Parsoid/Todo
 * Detect and handle changes to generated link content ( Foo -> Bar ) in the serializer. Will need to collect the content instead of dropping it outright, and delay the serialization of the start tag until the end tag is reached. Then compare with sHref, and recursively serialize the explicit content if it was changed from sHref.
 * set up a test wiki with current VE in parsoid.wmflabs.org VM, and test saving and round-tripping
 * duplicate  and a few other bugs in  and
 * strip the first paragraph tag in a list item or table cell context if there are no attributes on it and stx:html is not set: 37913
 * Round-trip nowiki using a span instead of meta, since the content is always text and the span should make it easier to handle in the VE. Alternatively, strip all nowiki from the output and rely on the wikitext escaper to re-add it where needed. The latter would likely introduce some diffs by slightly changing the span of nowiki sections in cases where they are mixed with plain text.

Tokenizer
Earlier minor syntactical changes in Tim's preprocessor rewrite:
 * Tim's Preprocessor ABNF
 * User documentation for preprocessor rewrite

Low-hanging fruit
Simple and fun tasks for somebody wishing to dive into the tokenizer.
 * Horizontal rules
 * Language variants ('-{')
 * Add tokenizer support for Template:(( et al (see 'see also' section in that template's documentation section).

Round-trip info

 * Add more round-trip information using the dataAttribs object property on tokens. This is serialized as JSON into a data-mw attribute on DOM nodes.
 * HTML vs wiki syntax
 * Try hard to preserve variable whitespace: Search for uses of the space production (or equivalent) in the grammar and capture the value into round-trip info.
 * Add source offset information to most elements to support selective re-serialization of modified DOM fragments to minimize dirty diffs (mostly done)

Extension tags
Make sure that (potential) extension end tags are always matched, even if parsing the content causes a switch to a plain-text parsing mode. An example would be an unclosed html comment ( [Main Page]]

These cannot generally be supported without stripping comments before parsing. Even if parsed, this type of comment could not be represented in the DOM. Before deployment, we should check if this is common enough to warrant an automated conversion. Grepping a dump works well for this check.

Mis-nested parser functions
The grammar-based tokenizer assumes some degree of sane nesting. Parser functions can return full tokens or substrings of an attribute, but not the first half of a token including half an attribute. Similar to the issues above, this limitation could be largely removed by dumbing down the tokenizer and deferring actual parsing until after template exansions at the cost of performance and PEG tokenizer grammar completeness. Mis-nested parser functions are hard to figure out for humans too, and should better be avoided / removed if possible.

Example: search for 'style="}}' in Template:Navbox. There are a total of 12 templates and no articles matching that string in the English Wikipedia. (used the following statement: )

expands to font-weight: bold">Some text in the PHP parser, but expands to &lt;span style="color:red; font-weight: bold"&gt;Some text in Parsoid. This can be fixed by modifying the source to  . Hopefully this is rare enough to allow fixing this manually. Reliable detection of this case is needed to analyze how common this is.

Token stream transforms
See the recipe map in mediawiki.parser.js for the current parser transforms and their phases.

Cache transforms per token type, and add more specific 'any tag' transform category
TokenTransformManager.prototype._getTransforms in mediawiki.TokenTransformManager.js could be sped up a bit by caching the mix with 'any' tokens. This cache needs to be cleared or reconstructed whenever transform registrations change. Additionally, it would be useful to add a separate 'tag' '*' category that matches any tag token, but no text. The AttributeExpander would be a good use case for this.

Sanitizer
Filter attributes, convert non-whitelisted tags into text tokens. See ext.core.Sanitizer.js for an outline, should be a relatively straightforward port from the PHP version. Good task if you'd like to dive into the JS parser.

Internal links, categories and images
The tokenizer is still independent of configuration data, so it does not pay attention to a wiki link's namespace. This means that image parameters are not parsed differently from normal link parameters, leaving specialized treatment to the LinkHandler token stream transformer. For images, arguments need to be separated from the caption. Full rendering requires information about the image dimensions, which needs to be retrieved from the wiki using either the generic fall-back described in Parsoid/Interfacing with MW API, or a specialized Image-specific API method. For action=parse, templates and -arguments in image options need to be fully expanded using the AttributeExpander before converting options back to wikitext. The (mostly)plain-text nature of options makes this quite easy fortunately. External link tokens produced by a link= option need to map back to the plain URL.

The LinkHandler transformer also needs to handle the Help:Pipe_trick while rendering. Also, it should learn about mw:Manual:$wgCapitalLinks.

Parser functions and magic words
Some implementation and lots of stubs (FIXME, quite straightforward!) in ext.core.ParserFunctions.js. Many magic words in particular depend on information from the wiki. Idea for now is to fall-back to action=parse api for extensions and other unsupported constructs. Basically build a page of unsupported elements in document order with each element prefixed/postfixed with unique (non-wikisyntax) delimiters. Then extract results between delimiters. See Parsoid/Interfacing with MW API and Wikitext_parser/Environment.

Round-trip expanded templates with Microdata RDFa
Track provenance on all tokens, preserve this information while building the tree and convert it into RDFa markup. This also needs to handle foster parenting. See. Bug 37911.

Miscellaneous

 * Handle dynamically generated nowiki sections: . Template arguments are already tokenized and expanded before substitution, so we'd need to revert this. Idea: Re-serialize tokens to original text using source position annotations and other round-trip information. Icky, but doable. Try to structure HTML DOM to WikiText serializer around SAX-like start/end handlers, so that the same handlers can serialize the token stream back to wikitext.


 * Refactor the link handler to allow subclassing for the modification of 1) the mapping of namespace to handler method, 2) the mapping of file content types to handler method and 3) individual handler methods. This can then be used by Wikia to add custom handling for videos or other content. See for the complementary serializer extension API.

DOM tree builder
Generally we would like to avoid any changes to the default HTML5 tree builder algorithm, as this would allow us to use the built-in HTML parser in modern browsers, or unchanged libraries. There are however some tasks seem to be hard to solve otherwise, and require only small changes to the tree builder. These tasks all have to do with unbalanced token soup, which should be confined to the server.

Self-closing tags like meta and text are never stripped, but might be subject to foster-parenting. The relative order between text and self-closing tags is stable, and attributes on self-closing tags are preserved. document.body.innerHTML = ' '; console.log( document.body.innerHTML); -> foo 
 * Spurious end-tags are ignored by the tree builder, while (some) are displayed as text in current MediaWiki. Text display is helpful for authors. The necessary change to the html tree builder to replicate this would be small, but is not possible if a browser's built-in parser is used. The visual editor hopefully reduces the need for this kind of debugging aid in the medium term.
 * Propagate attribute information for end tag tokens (especially source information for tokens originating from templates) to matching start tag, to make sure that the full scope of template-affected subtrees is captured. This is hard to do without a modification to the tree builder. Only needs to be performed on the server side. Relatively simple modification.
 * Idea for a possible solution:

Before tree building, for each template: generate uid for template transclusion (simple counter) if first token is an element: add tplstart=uid attribute to element else: add meta element with tplstart=uid attribute add tplcontent=uid to each table start tag in the content of the template add meta element with tplend=uid attribute after end of template. If last token is text, also insert a meta with tplcontent=uid before trailing text tokens.

On DOM: build list of nodes with tplstart / tplcontent / tplend attribute set depth-first traverse the DOM when encountering an element with tplstart set: meta element: if tplend for this uid was already found: template ended somewhere in the next sibling table element and was parent-fostered else: following content until corresponding tplend was produced by template non-meta element: element and possibly following siblings were produced by a template: look for tplend when encountering a (meta) element with tplcontent or tplend set: if there is a following sibling table node: if that has the corresponding tplstart or tplcontent set: DOM until that table was template-affected (foster-parented meta) else if tplstart was already found: match all sibling DOM trees between tplstart and tplend node (walk up the tree until common parent is reached) else: there must be a sibling table node with tplstart set. Match all content up to it.

Matching sibling DOM nodes:
 * wrap text nodes in span with attributes
 * set meta info on first element

Inline element nesting minimization
Consider this wikitext example:. There are two distinct DOMs that we can parse this into:


 * Non-minimal DOM: .  Serialized wikitext of this DOM =
 * Minimal DOM: .  Serialized wikitext of this DOM =

If the 5 leading apostrophes are parsed as  followed by   we get the first result. But, if it is parsed as  followed by   we get the second form. So, given the above wikitext, we have two possible DOMs. Clearly, the minimal DOM is the desirable DOM in this example. While it might seem that we might be able to pick the right parse order, there is no context-free way of determining the right parse order. To illustrate this, consdier a different wikitext example:. The minimal parse order here requires us to parse the 5 leading apostrophes as  followed by   which is the flip of the first example. Since we can have arbitrary wikitext between the first 5 apostrophes and the closing apostrophes, we will need to look ahead as far as necessary to match up apostrophes appropriately.

So, we have to use a deterministic parsing order (always parse 5 apostrophes as  followed by   OR the other way around -- it shouldn't matter for this problem). We then have two possible strategies for generating a minimal DOM:
 * Transform the token stream to reorder tokens appropriately.
 * Generate a DOM and process the DOM to generate a minimal DOM for I and B HTML tags.

It seems simpler to use the second strategy because we can recognize this DOM structural pattern:   and reorganize it using extremely simple rules. The first strategy (token stream transformation), in the general case, will effectively require a deep stack to push the dom-subtrees which is wasteful from a performance standpoint, since we are half-building the DOM only to reorder tokens and then discarding it altogether. Unless there is information in the token stream that lets us extract this information without a stack, the second strategy is desirable. (Original Gabriel text: Minimization involves opening the inline element with the longest span first, so requires look-ahead. There is code in mediawiki.DOMConverter.js that extracts run lengths of inline elements that could be used as a starting point for a DOM minimization pass.)

Note that this pass needs to be run on both the DOM produced by the parser, and the DOM returned by the editor for serialization back to wikitext. The DOM returned by the editor needs to be minimized in case it introduced excess tags, the user added explicit HTML tags, used wikitext of the form above, etc. This serialization will be run only on the modified parts of the DOM that the editor returns (the editor is responsible for marking modified DOM subtrees).

However, note that running this minimization pass always will introduce dirty diffs in certain scenarios (Ex: Consider the wikitext ). This kind of content seems to be relatively rare, and a simplification / minimization should be desirable in the longer run.

Also note that the problem is broader than just the I and B tags. This minimization routine will be run on a larger set of inline tags. Ex: I, B, U, span are definite candidates.

Misc

 * Some document model enforcement on HTML DOM to aid editor, should be able to run either on server or client.
 * Longer term fun project: move DOM building and transformations to webworker to provide fast Lua-extension-like or DOM/Tal/Genshi template functionality and multi-core support. See some ideas.

Wikitext serializer
Basic idea: ( HTML DOM -> ) tokens -> SAX-style serializer handlers -> wikitext
 * uses data-mw round-trip data
 * Will introduce some normalization- at the very least the tree builder has to fix up stuff when building a tree from tag soup, so the full round-trip cannot be 100% perfect for broken inputs.

Serializing only modified parts of a page

 * The round-trip info on elements contains source offsets in original wikitext
 * The editor marks modified parts of the DOM
 * The serializer splices original source of unmodified DOM parts with serialization of modified subtrees. This avoids dirty diffs from normalization in unmodified parts of the page.

Challenges for offset retrieval:
 * Balancing of tags and foster-parenting in tree builder
 * Attributes on end tags are dropped
 * No offsets on text content

Provide API for the registration of custom content serializer handlers by RDFa type
Needed to support serialization of things like custom DOM for videos linked to in the file namespace. Parser hook extensions would normally be handled generically (with source-based editing support at most), but might also want to register custom serializers when using DOM-based editing of contents. Examples for this would be the gallery or cite extensions.

Testing
See tests/parser, in particular parserTests.js.

parserTests

 * Set up a more complete testing environment including the time, predefined images and so on (see phase3/tests/parser/parserTests.inc).

Round-trip tests on dumps
There is a dumpReader in tests/parser, which can be used to run full dumps through the parser.

Pre-save transform functionality
Much of the current parser's pre-save transform functionality can be moved to the editor, but there are also cases where this won't be possible. An example of the latter is the self-preserving en:Template:Selfsubst template.