Manual:Parser

This is an overview of the design of the MediaWiki parser.

Design principles
The MediaWiki parser is not really a parser, in the strict sense of the word. It does not recognise a grammar, rather it translates wikitext to HTML. It was called a parser for want of a better word. At least, even before the term was introduced as a class name, it was generally understood what was meant by "the Mediawiki Parser".

Performance is its primary goal, taking precedence over readability of the code and the simplicity of the markup language it defines. As such, changes which improve the performance of the parser will be warmly received.

Since the parser operates on potentially malicious user input up to 2MB in size, it is essential that it has a worst case execution time proportional to the input size, rather than proportional to the square of the input size.

The parser targets a low-memory environment, assuming a few hundred MB of RAM, and thus it uses markup as intermediate state where possible instead of generating inefficient PHP data structures.

Security is also a critical goal -- user input cannot be allowed to leak through into unvalidated HTML output, except if this is specifically configured for the wiki. Remote images, and other markup which causes the client to send a request to an arbitrary remote server, is not allowed by default, for privacy reasons.

History
Lee Daniel Crocker wrote the initial version of MediaWiki in 2002. His wikitext parser was originally inside the OutputPage class, with the main entry point being OutputPage::addWikiText. The basic structure was similar to the current parser. It stripped out non-markup sections such as &lt;nowiki>, replacing them with temporary strip markers. Then it ran a security pass (removeHTMLtags), then a series of transformation passes, and then finally put the strip markers back in.

The transformation passes used plain regex replacement where possible, and tokenization based on explode or preg_split for more complex operations. The complete implementation was about 700 lines.

Many of the passes still exist with their original names, although almost all of them have been rewritten.

In 2004, Tim Starling split the parser out to Parser.php, and introduced ParserOptions and ParserOutput. He also introduced templates and template arguments. Significant work was contributed by Brion Vibber, Gabriel Wicke, Jens Frank, Wil Mahan and others.

In 2008, for MediaWiki 1.12, Tim merged the strip and replaceVariables passes into a new preprocessor, which was based on building an in-memory parse tree, and then walking the tree to produce expanded wikitext.

In 2011, the Parsoid project began. Parsoid is an independent wikitext parser in JavaScript, introduced to support VisualEditor. It includes an HTML-based DOM model and a serializer which generates wikitext from a (possibly user-edited) DOM. At this point, harmonization with Parsoid became a development goal for the MediaWiki parser.

For some time, it was unclear whether the MediaWiki parser would continue to exist in the long term, or whether it would be deprecated in favour of Parsoid. Current thinking is that at least the preprocessor component of the MediaWiki parser will be retained. Parsoid lacks a complete preprocessor implementation, and relies on remote calls to MediaWiki to provide this functionality.

Entry points
The main public entry points which start a parse operation are:


 * parse : Generates a ParserOutput object, which includes the HTML content area and structured data defining changes to the HTML outside the content area, such as JavaScript modules and navigation links.
 * preSaveTransform : Wikitext to wikitext transformation, called before saving a page.
 * getSection and replaceSection : Section identification and extraction to support section editing.
 * preprocess: Wikitext to wikitext transformation with template expansion, roughly equivalent to the first stage of HTML parsing. This is used by Parsoid to remotely expand templates. Message transformation also uses this function.
 * startExternalParse: This sets up the parser state so that an external caller can directly call the individual passes.

Input
The input to the parser is:


 * Wikitext
 * A ParserOptions object
 * A Title object and revision ID

There are also some dependencies on global state and configuration, notably the content language.

ParserOptions has many options, which collectively represent:


 * User preferences which affect the parser output. This was originally the main application for ParserOptions, which is why it takes a User object as a constructor parameter. It is important that the caching system is aware of such user options, so that users with different options have cached HTML stored in different keys. This is handled via ParserOptions::outputHash.
 * Caller-dependent options. For example, Tidy and limit reporting are only enabled when parsing the main content area of an article. Different options are set for normal page views, previews and old revision views.
 * Test injection data. For example, there is setCurrentRevisionCallback and setTemplateCallback which can be used to override certain database calls.

During a parse operation, the ParserOptions object and the title and revision context are available via the relevant accessors. The input text is not stored in a member variable, it is available only via formal parameters.

Output
Some entry points only return text, but there is always a ParserOutput object available which can be fetched with Parser::getOutput.

The ParserOutput object contains:
 * The "text" HTML fragment, set shortly before parse returns.
 * Extensive metadata about "links", which is used by LinksUpdate to update SQL caches of link information. This includes category membership, image usage, interlanguage and interwiki links, and extensible "page properties". In addition to being used to update database index tables, category and interlanguage links also affect the page display.
 * Various properties which affect the page display outside the content area. This includes JavaScript modules, to be loaded via ResourceLoader, the page title, for &lt;h1> and &lt;title> elements, indicators, categories and language links.

ParserOutput is a serializable object. It is stored into the ParserCache, often on page save, and retrieved on page view.

The current OutputPage object represents the output from the current request. It is vital that no parser extension directly modifies OutputPage, since such modifications will not be reproduced when the ParserOutput object is retrieved from the cache. Similarly, it is not possible to hook into the skin and to use a class static property set during parse to affect the skin output.

Instead, extensions wishing to modify the page outside the content HTML can use ParserOutput::setExtensionData to store serializable data which they will need when the page is displayed. Then ParserOutput::addOutputHook can be used to set a hook which will be called when the ParserOutput is retrieved and added to the current OutputPage.

State
The Parser object is both a long-lived configuration object and a parse state object.

The configuration aspect of the Parser object is initialised when clearState calls Parser::firstCallInit. This sets up extensions and core built-ins, and builds regexes and hashtables. It is quite slow (~10ms) so multiple calls should be avoided if possible.

The parse state aspect of the Parser object is initialised by the entry point, which sets several variables, and calls clearState, which clears local caches and accumulators.

It is difficult to run more than one parse operation at a time. Attempting to re-enter Parser::parse from a parser hook will lead to destruction of the previous parse state and corruption of the output. In theory one can set the $clearState parameter to parse to false to prevent the clearState call and allow re-entry, but in practice this is almost never done and probably doesn't work.

In practice, there are two options for recursive re-entry:


 * Cloning the Parser object
 * This is often done and will probably work. As long as all extensions cooperate, it provides an independent state which allows a second parse operation to be started immediately via an entry point such as Parser::parse. However, note that PHP's clone operator is a shallow copy. This means that if any parse state is stored in object references, that parse state will be shared with the clone, and modifications to the clone will affect the original object. The core tries to work around this by breaking object references in __clone. Extensions that store state in object references attached to the Parser object should hook ParserCloned and manually break such references.


 * Using the recursive entry points.
 * These allow text to be parsed in the same state as the currently executing parse operation, without clearing the current state. Notably:
 * recursiveTagParse: This returns "half-parsed" HTML, with strip markers still included, suitable for return from a tag or function hook.
 * recursiveTagParseFully: This returns fully parsed HTML, suitable for direct output to the user, for example via ParserOutput::setExtensionData.

TO DO: write more here.

Markup transformation passes

 * doTableStuff
 * doDoubleUnderscore
 * doHeadings
 * replaceInternalLinks
 * doAllQuotes
 * replaceExternalLinks
 * doMagicLinks
 * formatHeadings

internalParseHalfParsed

 * Guillemet
 * doBlockLevels
 * replaceLinkHolders
 * Language conversion
 * Tidy
 * The non-tidy cases