Parser 2011/Wikitext.next

This page is a work in progress; more to come and open calls for participation coming soon! --brion

a project portal to be...


 * "Imagine a world where any wiki page can be parsed consistently by anyone in the world. That's what we're doing."

Problems
MediaWiki's markup syntax has grown organically since 2002, itself based on earlier wiki syntaxes going back to 1995. None of it's very consistent, and there are a lot of edge cases which tend to surprise folks.


 * Raw markup can look really ugly and intimidating to editors
 * Tables, templates, tags, etc have many unexpected boundary conditions, which makes some uses of these constructs hard to deal with even for experts
 * Lack of structure or standardization means that changes to the parser code can unexpectedly change those cases
 * Combination of edge cases makes round-tripping to HTML and back very hard, which has made it difficult to get rich text editing fully integrated

There have been many attempts at making a more self-consistent parser that works with similar to current syntax, but it's been very difficult for any of them to really solidify and be used in MediaWiki itself.

Requirements

 * Document structure that can be used in-memory for transformations, and can be easily sent over the wire, eg in JSON serialization
 * Being able to address and manipulate page sections, paragraphs, links, images, template parameters etc is very useful for editing tools, people extracting data from bulk dumps, and making the rest of the wiki sane!
 * Having the structure should be as useful as having the flat source, but will allow many tools to work with the structured document data instead of worrying about duplicating the parser.
 * Consistent way to do template transformations etc at the document structure level
 * Current templates can be hard to reason about because interactions between levels of parsing and expansion are confusing. A normalized templating system that doesn't pop outside of its boundaries, and leaves no ambiguous syntax, would be valuable.
 * Consistent way to parse wiki text into a document structure, which should be reasonably compact and easy to port to other languages/environments
 * The current preprocessor aims a little in this direction, but it's internal and doesn't cover everything. Defining which bits do what should make it easier to adapt code to other implementations, or create a compatible implementation for an external tool.
 * Consistent way to serialize document structure back to wikitext to keep compat with source-editing workflow and source-based storage
 * At a minimum, we should be able to serialize any document tree out to a parseable chunk of text that round-trips back to the document. However there are multiple ways to make some constructions; when starting from source, we should be able to return to the original source if we just save it back out, so tools built on manipulation of document structure don't cause surprises for folks looking at diffs of source.

Milestones

 * 1) Define a wiki document structure that conceptually matches how we structure wikitext pages
 * 2) * JSON-friendly structure
 * 3) * Tables, templates etc will be more limited in structure: templates must be workable by fairly straightforward tree transformations
 * 4) ** this will introduce some incompatibilties... more on this later!
 * 5) Define parsing rules for taking wikitext and producing the document structure
 * 6) * Remain very nearly compatible with the current parser, but know that some breaks will be made.
 * 7) * (optionally) retain enough information to the structure that we can round-trip to original source
 * 8) * Every input must produce valid output, with the caveat that the output may include chunks that are marked 'wasn't sure what to do with this'.
 * 9) Build tools to identify incompatibilities in existing usage to aid in migration
 * 10) Devise progressive migration scheme whereby pages or templates get confirmed as working in the new parser, and their uses get bumped over gradually.

Iteration
We'll be iterating this a lot before it solidifies; need some tools to help:
 * Automated HTML comparison for identifying mismatches between old and new parser (with fewer false positives from irrelevent whitespace, ordering of attributes, etc)
 * Pattern-matching of known problematic structures to aid in next steps in supporting or planning for migration
 * Side-by-side rendering & parse tree breakdown, live rendering to help in figuring out what things happen
 * Batch reports of consistency on our existing datasets

Performance
Ideally, we want to be at least as fast as the old parser in a pure-PHP implementation in MediaWiki, for the whole source->HTML end to end.

However, there may be trade-offs in some areas; there are also some very different possible performance characteristics:
 * Parsing to document form should be entirely independent of user, and of details of external resources (presence or absence of pages, content of templates, etc).
 * In principle we can save the parse tree permanently along with saved source text, making that step skippable on future renderings.
 * User-specific options should generally only affect the final stages converting from the transformed parse tree to HTML. (But... language switching for instance may not.)
 * Interpolation of templates, functions etc is more aggressively separated from parsing
 * Transformation of the parse tree for template inclusion etc could be done client-side in certain circumstances.

Compatibility (back)
There will likely be significant back-comptibility breakages with some template & table constructs. These will need to be planned for in migration, which will likely not be an immediate switchover for Wikimedia sites, but piecemeal.

Compatibility (future)
For the most part, we should be able to avoid adding crazy new syntax in the future; existing parser function and tag hook interfaces allow pretty much anything to be plugged in as inline or block bits and will be able to feed into rich editing.

The main difficulty in compatibility of data & parsing between sites or software suites will likely be in the actual implementation of extensions, particularly with things like #if style functions that might have their own logic, or those that access state in the wiki. If something that doesn't know them works with the page, they'll round-trip the invocations through just fine but won't know how to render them, so they'll push through as ugly source or such.

This is something to consider when devising new things to use in fancy compound templates.

Standalone magic words, and "magical" behavior of special links based on local namespace names or interwiki setup, are another source of inconsistency as the wiki's state (configured language, namespaces, related wikis) will affect how some bits are dealt with.

Long term and source
It's conceivable that in some long term we'll dump wikitext source-level editing entirely, always using the structured document form and wrapping friendly, flexible editing tools around them. In the forseeable future we expect to keep all the source around though!

Migration
Problem cases include:
 * page sections with odd boundaries
 * templates with odd boundaries
 * weird table/template mixes

Prior work and related stuff

 * Data summit 2011/Parsers
 * Alternative parsers
 * Extension:XML Bridge, Extension:XML_Bridge/Examples
 * Markup spec
 * Wikitext standard
 * Category:Parser