Parsoid/Todo

If you would like to hack the Parsoid parser, these are the tasks we currently see ahead. Some of them are marked as especially well suited for newbies.

'''Please report issues in the Parsoid product in bugzilla. You can also add problematic wikitext snippets in Parsoid/Bug_test_cases.''' See also the list of open issues on Bugzilla.

If you have questions, try to ping gwicke or subbu on #mediawiki or send a mail to the wikitext-l mailinglist.

Q1/2 2013 planning
-> Worked into our roadmap.

Next tasks

 * Tasks with priority 'normal' in the bug list
 * Talk:Parsoid/Todo
 * set up a test wiki with current VE in parsoid.wmflabs.org VM, and test saving and round-tripping
 * duplicate  and a few other bugs in  and
 * Work on round-trip test pages: Parsoid/Roundtrip testpages

Tokenizer
Earlier minor syntactical changes in Tim's preprocessor rewrite:
 * Tim's Preprocessor ABNF
 * User documentation for preprocessor rewrite

Low-hanging fruit
Simple and fun tasks for somebody wishing to dive into the tokenizer.
 * Horizontal rules
 * Language variants ('-{')
 * Add tokenizer support for Template:(( et al (see 'see also' section in that template's documentation section).

Round-trip info

 * Add more round-trip information using the dataAttribs object property on tokens. This is serialized as JSON into a data-mw attribute on DOM nodes.
 * HTML vs wiki syntax
 * Try hard to preserve variable whitespace: Search for uses of the space production (or equivalent) in the grammar and capture the value into round-trip info.
 * Add source offset information to most elements to support selective re-serialization of modified DOM fragments to minimize dirty diffs (mostly done)

Extension tags
Now tracked in Parsoid/Todo:Extension_tag_precedence.

Parallel or cooperatively concurrent tokenization
The current tokenizer emits chunks of tokens for each parsed top-level block, but continues to parse a document in one go. With cooperative multitasking, this means that template expansions etc are queued up until the full document is parsed.

The tokenizer should either run in a separate OS thread, or cooperatively yield a chunk and schedule the tokenization of the remaining text after each top-level block.

Configuration-dependent syntax
It would be nice to keep the tokenizer independent of local configurations. This appears to be difficult at least for url protocols recognized in links. Most other configuration-dependent things including extensions can however be handled in token stream transforms.

Issues

 * Start-of-line position vs. templates and parser functions: . See also User:GWicke/Tables and User:GWicke/Lists
 * style if-function is used to set table headers in some templates (See Jcttop/core source)
 * attributes conditionally set from parser functions:
 * SOL td-wikitext emitted by templates. See example below.
 * td and style constructed separately (see commit note for git SHA 12b561a). But, test case below:

Limitations
Main article: Parsoid/limitations.

Possible approaches to dealing with SOL text emitted by parser functions

 * Recently, we restructured the pre-handling to move away from the tokenizer to detecting leading-white-space after newlines and using that as the basis for inserting pre-tokens to deal with SOL-posn. white-space being output from templates which cannot be detected in the tokenizer. We can move to a similar technique for table, list, or other wikitext chars/tokens that can show up in SOL position.  So, code a generic sol-text handler that looks for sol-position tokens and inserts relevant tokens into the token stream.
 * Alternatively, the only scenario when the tokenizer cannot detect sol-position text is for argument text passed into templates. So, the generic sol-text handler can insert marker tokens before template arg text and look for these markers to reparse text that follows.  This technique could also be used to deal (in a limited manner) with navbox-like templates that construct html tags from string pieces.  The tokenizer can recognize tag fragments (ex: &lt;td) and insert marker tags into the token stream to require re-parse of text that follows (till an end marker) once all templates are expanded.  We could also then generate warnings in parser logs to mark the template as a candidate for rewriting to eliminate tag construction from text fragments.

In either case, we need to collect use-cases for how frequently this kind of non-structural wikitext shows up in templates and wiki pages. If there are very few cases, then it might be better to fix up the corresponding templates and uses rather than hack up the parser to deal with non-structural wikitext. This kind of cleanup can lead us towards moving templates to generate structured HTML (rather than unstructured text).

Token stream transforms
See the recipe map in mediawiki.parser.js for the current parser transforms and their phases.

Internal links, categories and images
The tokenizer is still independent of configuration data, so it does not pay attention to a wiki link's namespace. This means that image parameters are not parsed differently from normal link parameters, leaving specialized treatment to the LinkHandler token stream transformer. For images, arguments need to be separated from the caption. Full rendering requires information about the image dimensions, which needs to be retrieved from the wiki using either the generic fall-back described in Parsoid/Interfacing with MW API, or a specialized Image-specific API method. For action=parse, templates and -arguments in image options need to be fully expanded using the AttributeExpander before converting options back to wikitext. The (mostly)plain-text nature of options makes this quite easy fortunately. External link tokens produced by a link= option need to map back to the plain URL.


 * Support $wgCapitalLinks in the LinkHandler
 * Support interwiki / language links

Parser functions and magic words
Some implementation and lots of stubs (FIXME, quite straightforward!) in ext.core.ParserFunctions.js. Many magic words in particular depend on information from the wiki. Idea for now is to fall-back to action=parse api for extensions and other unsupported constructs. Basically build a page of unsupported elements in document order with each element prefixed/postfixed with unique (non-wikisyntax) delimiters. Then extract results between delimiters. See Parsoid/Interfacing with MW API and Wikitext_parser/Environment.

Dependency Graph
Express the dependency graph (DAG) between the transformations more directly than the relatively implicit rank mechanism. The difficult part will be to manage dynamic changes to the graph efficiently, and providing a convenient notation for dependencies which avoids having to specify all dependencies explicitly for each transform.

Miscellaneous

 * Handle dynamically generated nowiki sections: . Template arguments are already tokenized and expanded before substitution, so we'd need to revert this. Idea: Re-serialize tokens to original text using source position annotations and other round-trip information. Icky, but doable. Try to structure HTML DOM to WikiText serializer around SAX-like start/end handlers, so that the same handlers can serialize the token stream back to wikitext.


 * Refactor the link handler to allow subclassing for the modification of 1) the mapping of namespace to handler method, 2) the mapping of file content types to handler method and 3) individual handler methods. This can then be used by Wikia to add custom handling for videos or other content. See for the complementary serializer extension API.

DOM tree builder
Content moved to Parsoid/Todo:Template round-tripping.

Inline element nesting minimization
Consider this wikitext example:. There are two distinct DOMs that we can parse this into:


 * Non-minimal DOM: .  Serialized wikitext of this DOM =
 * Minimal DOM: .  Serialized wikitext of this DOM =

If the 5 leading apostrophes are parsed as  followed by   we get the first result. But, if it is parsed as  followed by   we get the second form. So, given the above wikitext, we have two possible DOMs. Clearly, the minimal DOM is the desirable DOM in this example. While it might seem that we might be able to pick the right parse order, there is no context-free way of determining the right parse order. To illustrate this, consdier a different wikitext example:. The minimal parse order here requires us to parse the 5 leading apostrophes as  followed by   which is the flip of the first example. Since we can have arbitrary wikitext between the first 5 apostrophes and the closing apostrophes, we will need to look ahead as far as necessary to match up apostrophes appropriately.

So, we have to use a deterministic parsing order (always parse 5 apostrophes as  followed by   OR the other way around -- it shouldn't matter for this problem). We then have two possible strategies for generating a minimal DOM:
 * Transform the token stream to reorder tokens appropriately.
 * Generate a DOM and process the DOM to generate a minimal DOM for I and B HTML tags.

It seems simpler to use the second strategy because we can recognize this DOM structural pattern:   and reorganize it using extremely simple rules. The first strategy (token stream transformation), in the general case, will effectively require a deep stack to push the dom-subtrees which is wasteful from a performance standpoint, since we are half-building the DOM only to reorder tokens and then discarding it altogether. Unless there is information in the token stream that lets us extract this information without a stack, the second strategy is desirable. (Original Gabriel text: Minimization involves opening the inline element with the longest span first, so requires look-ahead. There is code in mediawiki.DOMConverter.js that extracts run lengths of inline elements that could be used as a starting point for a DOM minimization pass.)

Note that this pass needs to be run on both the DOM produced by the parser, and the DOM returned by the editor for serialization back to wikitext. The DOM returned by the editor needs to be minimized in case it introduced excess tags, the user added explicit HTML tags, used wikitext of the form above, etc. This serialization will be run only on the modified parts of the DOM that the editor returns (the editor is responsible for marking modified DOM subtrees).

However, note that running this minimization pass always will introduce dirty diffs in certain scenarios (Ex: Consider the wikitext ). This kind of content seems to be relatively rare, and a simplification / minimization should be desirable in the longer run.

Also note that the problem is broader than just the I and B tags. This minimization routine will be run on a larger set of inline tags. Ex: I, B, U, span are definite candidates.

Misc

 * Some document model enforcement on HTML DOM to aid editor, should be able to run either on server or client.
 * Longer term fun project: move DOM building and transformations to webworker to provide fast Lua-extension-like or DOM/Tal/Genshi template functionality and multi-core support. See some ideas.

Wikitext serializer
Basic idea: ( HTML DOM -> ) tokens -> SAX-style serializer handlers -> wikitext
 * uses data-mw round-trip data
 * Will introduce some normalization- at the very least the tree builder has to fix up stuff when building a tree from tag soup, so the full round-trip cannot be 100% perfect for broken inputs.

Serializing only modified parts of a page

 * The round-trip info on elements contains source offsets in original wikitext
 * The editor marks modified parts of the DOM
 * The serializer splices original source of unmodified DOM parts with serialization of modified subtrees. This avoids dirty diffs from normalization in unmodified parts of the page.

Challenges for offset retrieval:
 * Balancing of tags and foster-parenting in tree builder
 * Attributes on end tags are dropped
 * No offsets on text content

Provide API for the registration of custom content serializer handlers by RDFa type
Needed to support serialization of things like custom DOM for videos linked to in the file namespace. Parser hook extensions would normally be handled generically (with source-based editing support at most), but might also want to register custom serializers when using DOM-based editing of contents. Examples for this would be the gallery or cite extensions.

Testing
See tests/parser, in particular parserTests.js.

parserTests
Todo:


 * Fix Jenkins integration so that it can be re-enabled.
 * Move random selser changes to use --randomchanges and make the default run use the included changes.

Later:
 * Set up a more complete testing environment including the time, predefined images and so on (see phase3/tests/parser/parserTests.inc).
 * Write tests for the following commits:
 * git SHA 683a485
 * git SHA e72e46f
 * git SHA 4b2e27a
 * git SHA f67cb40
 * https://gerrit.wikimedia.org/r/28691 (git SHA 46c24c2)
 * https://gerrit.wikimedia.org/r/27851 (git SHA fa52c48)
 * Fix image tests to be insensitive to order of attributes, and to use of figure and figcaption tags instead of a and image tags.
 * Add/update testing setup for DSR computation to spec. DSR expectations on different kinds of DOMs/wikitext.


 * Tests DONE
 * https://gerrit.wikimedia.org/r/29338 (git SHA e4785f4)
 * https://gerrit.wikimedia.org/r/29333 (git SHA e89caca)
 * https://gerrit.wikimedia.org/r/28760 (git SHA b3ba624)
 * https://gerrit.wikimedia.org/r/28707 (git SHA 87e7fab)
 * https://gerrit.wikimedia.org/r/28147 (git SHA d858818) (covered by other tests)
 * https://gerrit.wikimedia.org/r/28686 (git SHA 7dba7a6)
 * git SHA 81b0102 -- in ParserFunctions/funcsParserTests.txt
 * git SHA bde798f
 * git SHA ecb7a44
 * git SHA edd1a14
 * https://gerrit.wikimedia.org/r/30065 (SHA 058718ccb0d)
 * https://gerrit.wikimedia.org/r/30794 (git SHA 12b561a)
 * NeedsParserTests keyword in commit messages
 * 77b94472265df
 * 6dc0dff494899d2fc63

Round-trip tests on dumps
Now running on 100k randomly selected pages. Current output at http://parsoid.wmflabs.org:8001/stats. See Parsoid/Round-trip testing for documentation.

Wishlist for the stat server:

Regressions per revision
Show regressions and fixes per revision, relative to the preceding revision (by commit timestamp).

Per-revision query: select pages.title, s.errors, s.fails, s.skips, ( select stats.score from stats join commits on stats.page_id = pages.id and stats.commit_hash = commits.hash and stats.id != s.id order by commits.timestamp desc limit 1 ) as oldscore, s.score from pages join stats as s on s.commit_hash = '6394c9b398298906bf527c06120cc305164c2fcf' and s.page_id = pages.id where oldscore < s.score order by (s.score - oldscore) desc limit 10; The runtime seems to be good enough to make it feasible to present a page with regressions / fixes per revision, for the last three revisions or so (pageable?). Ideally for each revision, a link to the old and new result along with the change in stats and a link to the current rt result (at parsoid.wmflabs.org/_rt/) is provided.

List of results per article
Provide a list of all results / stats per article so that the development of particular issues across revisions can be followed.

Enable #items per page (topfails, topfixes, regressions)
Provide a ?items=N query parameter so we can fetch more than 40 items per page.

Test / improve error reporting
Fix / improve error reporting so that:
 * Articles / tests that exceeded their retry limit are listed as errors
 * Sync and async errors on the client are properly reported to the server and listed as an error in the results. This should be mostly implemented now, but could be much improved for async error reporting.

Provide a way to prioritize a selection of tests
Testing all 100k articles takes between 24 and 48 hours currently, so is not feasible for each revision. We currently prioritize failing articles, but this set is not stable over revisions. A way to prioritize a smaller selection of articles in the DB could provide a way to compare statistics (average skips/fails/errors etc) between revisions by making it feasible to actually test all those articles for each revision.

Improve classification of differences
There are still quite a lot of syntactic differences that are misclassified as semantic. Improving this would make it easier to focus on real semantic differences.

A statistical classification of semantic diffs would also be useful to identify pages with similar issues, and their frequency. A standard document classifier could probably be employed for this.

Update older stats
A bug in the server caused the number of skips and fails for older tests to be recorded as one less than there are actually in the result XML. Update the stats table to properly reflect the number of skips/fails in the result XML, so that we get accurate stats for older revisions.

Compress result XML
The result XML blows up the DB quite a bit (>11G now) and contains highly compressible diff markup. Use compression to reduce the DB size.

(Lower priority) Chunk tests
Giving out single chunks has a relatively high overhead, especially if the tests themselves have a very short runtime. After improvements to the getTitle query, the coordinator is fast enough for ~20 clients working on (pretty slow) round-trip testing. For shorter tests or more clients, handing out chunks of titles / tests to clients could improve performance. An old patch is available in Gerrit.

Lowest priority: Switch database to not sqlite
Don't do this until everything else is finished. Just don't. But we would maybe like to switch to a different database system. Pick your favorite, but it should be more of a "proper" database system than sqlite (so couchDB and dirtyDB might be out), but other than that, if it improves performance, we don't so much mind which one.

And of course, be awesome and include a lot of verbose descriptions as to what exactly the new database is, and how it works.

Categories of roundtrip problems (with example pages) that need fixing
RT test pages -- this page lists the different kinds of RT issues we are trying to fix alongwith test pages to replicate the problems.