Parser 2011/Wikitext.next

This page is a work in progress; more to come and open calls for participation coming soon! --brion

a project portal to be...


 * "Imagine a world where any wiki page can be parsed consistently by anyone in the world. That's what we're doing."

Problems
MediaWiki's markup syntax has grown organically since 2002, itself based on earlier wiki syntaxes going back to 1995. None of it's very consistent, and there are a lot of edge cases which tend to surprise folks.


 * Raw markup can look really ugly and intimidating to editors
 * Tables, templates, tags, etc have many unexpected boundary conditions, which makes some uses of these constructs hard to deal with even for experts
 * Lack of structure or standardization means that changes to the parser code can unexpectedly change those cases
 * Combination of edge cases makes round-tripping to HTML and back very hard, which has made it difficult to get rich text editing fully integrated

There have been many attempts at making a more self-consistent parser that works with similar to current syntax, but it's been very difficult for any of them to really solidify and be used in MediaWiki itself.

Requirements

 * Document structure that can be used in-memory for transformations, and can be easily sent over the wire, eg in JSON serialization
 * Being able to address and manipulate page sections, paragraphs, links, images, template parameters etc is very useful for editing tools, people extracting data from bulk dumps, and making the rest of the wiki sane!
 * Consistent way to do template transformations etc at the document structure level
 * Current templates can be hard to reason about because interactions between levels of parsing and expansion are confusing. A normalized templating system that doesn't pop outside of its boundaries, and leaves no ambiguous syntax, would be valuable.
 * Consistent way to parse wiki text into a document structure, which should be reasonably compact and easy to port to other languages/environments
 * The current preprocessor aims a little in this direction, but it's internal and doesn't cover everything. Defining which bits do what should make it easier to adapt code to other implementations, or create a compatible implementation for an external tool.
 * Consistent way to serialize document structure back to wikitext to keep compat with source-editing workflow and source-based storage
 * At a minimum, we should be able to serialize any document tree out to a parseable chunk of text that round-trips back to the document. However there are multiple ways to make some constructions; when starting from source, we should be able to return to the original source if we just save it back out, so tools built on manipulation of document structure don't cause surprises for folks looking at diffs of source.

Milestones

 * 1) Define a wiki document structure that conceptually matches how we structure wikitext pages
 * 2) * JSON-friendly structure
 * 3) * Tables, templates etc will be more limited in structure: templates must be workable by fairly straightforward tree transformations
 * 4) ** this will introduce some incompatibilties... more on this later!
 * 5) Define parsing rules for taking wikitext and producing the document structure
 * 6) * Remain very nearly compatible with the current parser, but know that some breaks will be made.
 * 7) * (optionally) retain enough information to the structure that we can round-trip to original source
 * 8) * Every input must produce valid output, with the caveat that the output may include chunks that are marked 'wasn't sure what to do with this'.
 * 9) Build tools to identify incompatibilities in existing usage to aid in migration
 * 10) Devise progressive migration scheme whereby pages or templates get confirmed as working in the new parser, and their uses get bumped over gradually.

Iteration
We'll be iterating this a lot before it solidifies; need some tools to help:
 * Automated HTML comparison for identifying mismatches between old and new parser (with fewer false positives from irrelevent whitespace, ordering of attributes, etc)
 * Pattern-matching of known problematic structures to aid in next steps in supporting or planning for migration
 * Side-by-side rendering & parse tree breakdown, live rendering to help in figuring out what things happen
 * Batch reports of consistency on our existing datasets

Migration
Problem cases include:
 * page sections with odd boundaries
 * templates with odd boundaries
 * weird table/template mixes

Prior work and related stuff

 * Data summit 2011/Parsers
 * Alternative parsers
 * Markup spec
 * Wikitext standard
 * Category:Parser