Parser 2011/Parser plan

Draft starting at http://etherpad.wikimedia.org/mwhack11Sat-ParserDraft -- more will be copied to the wiki after more polishing.

Terms
Parsing terminology tends to get used... inexactly within MediaWiki. We'll want to make sure we use some terms consistently. :) ...
 * wikitext: the markup format used by MediaWiki
 * page: an object in MediaWiki containing wikitext data. Pages are referred to by a site-unique title, and may have metadata and versioning.
 * title: MediaWiki page titles are site-unique within each namespace in the wiki's page database
 * parsing: the entire process of manipulating wikitext source into an output-ready form -- output may be HTML, renormalized save-ready wikitext, or a syntax tree.
 * Parser: the Parser class and its related parts performs parsing
 * Preprocessor: the Preprocessor class converts wikitext to an XML or hashmap tree representing parts of the document -- template invocations, parser functions, tag hooks, section headings, and a few other structures.

Description of parsing context
Wikitext parser/Context


 * page title, text contents, notions of other pages accessible

Stage 1 formal grammar
Wikitext parser/Stage 1: Formal grammar

Rule sets to interpret wikitext source string into an abstract syntax tree. It should be possible to use fairly standard parser generators to produce stub code in many languages with a minimum of manual construction. Some rule combinations will depend on context information such as being inside or outside of a template, but all the rules themselves remain consistent; see Preprocessor ABNF for a similar description of the current MediaWiki Preprocessor, which covers a smaller subset of the syntax.

Stage 2 annotated steps
Wikitext parser/Stage 2: Informal grammar

Informal description of processing stages upon the AST to perform steps that can't be expressed in the formal grammar. We hope to cut down the amount of explicit steps significantly from the current Parser class

Expansion stage annotated steps
Wikitext parser/Stage 3: Expansion

Informal description of how to handle expansions of templates (combining the parent page's tree with the template page's tree, and resolution of parameters etc), template parameters, etc.

Parser function addenda
Wikitext parser/Core parser functions

Parser functions are roughly like template invocations, but they call code instead of fetching contents directly from another page. This is the primary document format feature extension mechanism in MediaWiki, as it creates no new syntax requirements. Most parser functions will expand into a subtree like normal templates do, but some could expand into custom node types (eg extensions). Core section describes abstract API between a core parser and callbacks for parser functions; addenda describe standard parser functions (those shipped as part of MediaWiki core today). eg, the parser function implements formatting of times from parameter or from current local time as provided by parsing context. Description should be sufficient to write a compatible implementation or reasonable fallback behavior if the exact function won't be suitable for some implementation.

Tag hook addenda
Wikitext parser/Core tag hooks

Tag hooks have an XML-like source syntax rather than the curly-brace MediaWiki template/parser function syntax. Unlike parser functions, the parameters and text content passed to a tag hook are not automatically expanded as parse trees, though a specific tag hook may choose to run its data back through the parser for this purpose. Core section describes abstract API between a core parser and callbacks for tag hooks; addenda describe the standard tag hooks (those shipped as part of MediaWiki core today: nowiki, pre, gallery, html)

Get involved

 * Join the wikitext-l mailing list if interested in following along or getting involved; there should be posts from Brion, Trevor, or Neil at least a couple times a week, and we're going to need feedback and help!
 * Give feedback on the initial prelim docs & demos via Future/AST (to come soon)
 * Collect references to existing alternate parser output formats via Future/AST
 * Collect test cases (example pages, known problematic pages, corpus from Wikipedia, adapted parser tests) via Future/Parser test cases