Parsoid/Todo

If you would like to hack the Parsoid parser, these are the tasks we currently see ahead. Some of them are marked as especially well suited for newbies.

'''Please report issues in the Parsoid product in bugzilla. You can also add problematic wikitext snippets in Parsoid/Bug_test_cases.''' See also the list of open issues on Bugzilla.

If you have questions, try to ping gwicke or subbu on #mediawiki or send a mail to the wikitext-l mailinglist.

Q1/2 2013 planning
-> Worked into our roadmap.

Next tasks

 * Tasks with priority 'normal' in the bug list
 * Talk:Parsoid/Todo
 * set up a test wiki with current VE in parsoid.wmflabs.org VM, and test saving and round-tripping
 * duplicate  and a few other bugs in  and
 * Work on round-trip test pages: Parsoid/Roundtrip testpages

Limitations
See Parsoid/limitations.

Provide API for the registration of custom content serializer handlers by RDFa type
Needed to support serialization of things like custom DOM for videos linked to in the file namespace. Parser hook extensions would normally be handled generically (with source-based editing support at most), but might also want to register custom serializers when using DOM-based editing of contents. Examples for this would be the gallery or cite extensions.

Testing
See tests/parser, in particular parserTests.js.

parserTests
Todo:


 * Fix Jenkins integration so that it can be re-enabled.
 * Move random selser changes to use --randomchanges and make the default run use the included changes.

Later:
 * Set up a more complete testing environment including the time, predefined images and so on (see phase3/tests/parser/parserTests.inc).
 * Write tests for the following commits:
 * git SHA 683a485
 * git SHA e72e46f
 * git SHA 4b2e27a
 * git SHA f67cb40
 * https://gerrit.wikimedia.org/r/28691 (git SHA 46c24c2)
 * https://gerrit.wikimedia.org/r/27851 (git SHA fa52c48)
 * Fix image tests to be insensitive to order of attributes, and to use of figure and figcaption tags instead of a and image tags.
 * Add/update testing setup for DSR computation to spec. DSR expectations on different kinds of DOMs/wikitext.


 * Tests DONE
 * https://gerrit.wikimedia.org/r/29338 (git SHA e4785f4)
 * https://gerrit.wikimedia.org/r/29333 (git SHA e89caca)
 * https://gerrit.wikimedia.org/r/28760 (git SHA b3ba624)
 * https://gerrit.wikimedia.org/r/28707 (git SHA 87e7fab)
 * https://gerrit.wikimedia.org/r/28147 (git SHA d858818) (covered by other tests)
 * https://gerrit.wikimedia.org/r/28686 (git SHA 7dba7a6)
 * git SHA 81b0102 -- in ParserFunctions/funcsParserTests.txt
 * git SHA bde798f
 * git SHA ecb7a44
 * git SHA edd1a14
 * https://gerrit.wikimedia.org/r/30065 (SHA 058718ccb0d)
 * https://gerrit.wikimedia.org/r/30794 (git SHA 12b561a)
 * NeedsParserTests keyword in commit messages
 * 77b94472265df
 * 6dc0dff494899d2fc63

Round-trip tests on dumps
Now running on 100k randomly selected pages. Current output at http://parsoid.wmflabs.org:8001/stats. See Parsoid/Round-trip testing for documentation.

Wishlist for the stat server:

Regressions per revision
Show regressions and fixes per revision, relative to the preceding revision (by commit timestamp).

Per-revision query: select pages.title, s.errors, s.fails, s.skips, ( select stats.score from stats join commits on stats.page_id = pages.id and stats.commit_hash = commits.hash and stats.id != s.id order by commits.timestamp desc limit 1 ) as oldscore, s.score from pages join stats as s on s.commit_hash = '6394c9b398298906bf527c06120cc305164c2fcf' and s.page_id = pages.id where oldscore < s.score order by (s.score - oldscore) desc limit 10; The runtime seems to be good enough to make it feasible to present a page with regressions / fixes per revision, for the last three revisions or so (pageable?). Ideally for each revision, a link to the old and new result along with the change in stats and a link to the current rt result (at parsoid.wmflabs.org/_rt/) is provided.

List of results per article
Provide a list of all results / stats per article so that the development of particular issues across revisions can be followed.

Enable #items per page (topfails, topfixes, regressions)
Provide a ?items=N query parameter so we can fetch more than 40 items per page.

Test / improve error reporting
Fix / improve error reporting so that:
 * Articles / tests that exceeded their retry limit are listed as errors
 * Sync and async errors on the client are properly reported to the server and listed as an error in the results. This should be mostly implemented now, but could be much improved for async error reporting.

Provide a way to prioritize a selection of tests
Testing all 100k articles takes between 24 and 48 hours currently, so is not feasible for each revision. We currently prioritize failing articles, but this set is not stable over revisions. A way to prioritize a smaller selection of articles in the DB could provide a way to compare statistics (average skips/fails/errors etc) between revisions by making it feasible to actually test all those articles for each revision.

Improve classification of differences
There are still quite a lot of syntactic differences that are misclassified as semantic. Improving this would make it easier to focus on real semantic differences.

A statistical classification of semantic diffs would also be useful to identify pages with similar issues, and their frequency. A standard document classifier could probably be employed for this.

Update older stats
A bug in the server caused the number of skips and fails for older tests to be recorded as one less than there are actually in the result XML. Update the stats table to properly reflect the number of skips/fails in the result XML, so that we get accurate stats for older revisions.

Compress result XML
The result XML blows up the DB quite a bit (>11G now) and contains highly compressible diff markup. Use compression to reduce the DB size.

(Lower priority) Chunk tests
Giving out single chunks has a relatively high overhead, especially if the tests themselves have a very short runtime. After improvements to the getTitle query, the coordinator is fast enough for ~20 clients working on (pretty slow) round-trip testing. For shorter tests or more clients, handing out chunks of titles / tests to clients could improve performance. An old patch is available in Gerrit.

Lowest priority: Switch database to not sqlite
Don't do this until everything else is finished. Just don't. But we would maybe like to switch to a different database system. Pick your favorite, but it should be more of a "proper" database system than sqlite (so couchDB and dirtyDB might be out), but other than that, if it improves performance, we don't so much mind which one.

And of course, be awesome and include a lot of verbose descriptions as to what exactly the new database is, and how it works.

Categories of roundtrip problems (with example pages) that need fixing
RT test pages -- this page lists the different kinds of RT issues we are trying to fix alongwith test pages to replicate the problems.