Markup spec

There are many documents aiming at creating a formal representation of the MediaWiki markup and the parser behaviour. So far, none of them is actually complete, but there are a number of drafts in different syntaxes such as BNF, EBNF or ANTLR.

In this document all of these efforts are collected, discussed and coordinated.

Goals
Produce a specification of MediaWiki's markup format that is sufficiently complete and consistent. Future parser implementations can be built from it. Also, features that are currently either Not Possible or Very Hard (e.g. WYSIWYG editing) could benefit from such a specification.

Specification might include both grammar description and parser behaviour.

Concerning the grammar description, the specification might use a standard notation such as BNF or EBNF.

As for the parser behaviour, the specification should avoid deviating from present behavior where it is reasonable and well-defined. Therefore it must avoided adding new behaviour without considering whether it may break already existent pages. Where the current parser's behavior is undefined or obviously buggy, the specification may define new behavior which is different. The parser might be described using tools such as ANTLR.

Also a data model for a parse tree can be defined. The data model should be representable in XML. An official XML schema for such a representation may or may not be defined. Round-trip conversion between source code and the data should be also possible with a data model. There might be a many-to-one relationship between source code and parse trees, but the canonical transformation from parse tree to source code should always parse back to the same parse tree.

Feasibility study
It has been broadly asserted that the Wiki Markup is a context-sensitive language, and therefore that it cannot be expressed with a context-free grammar (such as the those defined with BNF or EBNF). To put some light on the topic, it would be useful to first define some concepts:


 * Formal grammar: it is a set of rules to form strings in a language. According to the Chomsky hierarchy a grammar can be:
 * type 3 (or regular): when all the rules are of the form A → a or A → aB (being A and B single nonterminal symbols and a a terminal symbol);
 * type 2 (or context-free): when all the rules are of the form A → γ (being γ any combination of terminal an nonterminal symbols);
 * type 1 (or context-sensitive): when all the rules are of the form αAβ → αγβ (being α and β any combination of terminal and nonterminal symbols); or
 * type 0 (or unrestricted): when there are no restrictions in the production rules (α → β).
 * Syntactic analysis: it is the process of analyzing a text to determine its grammatical structure with respect to a given formal grammar.
 * Parser: it is the program performing this syntactic analysis. Should errors be found in the input text, the parser must do its best to identify and if possible recover and go further with the analysis.
 * Context-free language: that which can be generated with a context-free grammar.
 * Context-sensitive language: that which can be generated with a context-sensitive grammar.
 * Ambiguous grammar: that grammar producing any string so that it can be generated in more than one way.
 * Inherently ambiguous language: that language that can only be generated by ambiguous grammar.

Language generation and grammar type
As it can be seen in the rule description of each one of the grammars, regular grammars have very relaxed rules, whereas unrestricted grammars are allowed to describe very restraining rules. Relaxed in this context means that symbols can be generated in regular grammars without taking into account what has been produced so far. More on the contrary, context-sensitive grammars' and unrestricted grammars' rules might need some already-generated strings in order to apply certain rules.

Contrary to what could be intuitive, context-sensitive grammars can have more restrictive rules. Not in vain Chomsky conceived context-sensitive grammars as a way to describe natural language. Natural languages are clearly more restrictive than classical context-free computer languages (such as C or Java), for it is true that a word may or may not be appropriate in a certain place depending upon the context.

Wikicode is, as many other computer languages, a context-free language. The Wikicode is composed of many tokens for formating, title description, text linking or list representation. Some of the tokens need to be placed in a certain place (such as those which need to go at the beginning of a line), but otherwise tokens may appear in any place of the document, regardless of the context. Considering that it could be argued that Wikicode it is indeed a regular language. However, Wikicode does have nested structures, which cannot be expressed with regular grammar rules.

In short, Wikicode does not need context-sensitive grammar rules, for any token can be placed anywhere (with few restrictions such as #REDIRECT, include modes and similar structures). It cannot be however expressed with regular grammar rules, as nested structures (such as bold markup inside a header markup) cannot be described in type 2 grammars. Therefore, a context-free grammar suffices for Wikicode description.

Ambiguities in the language
The fact that Wikicode uses the same characters for different tokens leads to strings that can be interpreted in many different ways. This does not mean, however, that the language is not context-free, but instead that there are many different combinations of rules in the grammar that reach to the same final string. Consider the following string of Wikicode:

The dog's bone

Being the word dog enclosed in cursive marks (two apostrophes) and bold marks (three apostrophes), with an additional apostrophe to indicate a saxon genitive. This does not necessarily mean that the language is inherently ambiguous either, as there might be a grammar which can generate that structure without ambiguity. In terms of language recognition, this can be easily avoided just considering a precedence in the rules (pretty much like the precedence rules in computer language's mathematical expressions).

Language recognition
Parsers normally use grammars to analyze or recognize strings of a certain language. When a missmatch (or error) is found, the parser needs to figure out what to do with the unexpected input, and a way to recover from the error in order to further analyze the string.

Whereas the grammar description is quite easy to describe, the parser behaviour is somewhat complex. This is due to the fact that every input string should derive to the most-likely result, even if it contains syntax errors. MediaWiki's parser does a complex error recovery, like for instance when dealing with wrongly nested structures such as:

The quick brown fox

where the opening bold mark is inside an italics structure, and the closed bold mark is outside. For a valid output to be produced, the parser closes the italics structure and opens it again, producing an output where the two structures are properly nested. Also, should a tag mismatching occur, such as in the following example:

The quick ''brown fox

the parser will add the mismatched closing tags. The huge number of recovery rules make the language recognizer hard to describe, for every single rule should be reflected in the specification. Since there is not such thing as right or wrong Wikicode, an extensive list of recovery rules is as important as the own grammar when aiming at creating a complete description of the wiki language interpretation.

Current efforts
So far progress has been made in both grammar definition and parser behaviour. Although none of the descriptions seems to be complete, some have achieved to describe a good part of the language.

Current parser descriptions tried to do its best to follow the MediaWiki's parser behaviour. It is however very hard to properly describe all the error recovery and rule precedence performed by the MediaWiki's parser, for there are a number of different cases of grammar mismatch (i.e.: unclosed tags) or operator precedence, considering the ambiguity that symbols such as  or   brings.


 * Grammar descriptions:
 * DTD
 * EBNF
 * Parser description:
 * ABNF plus natural language
 * BNF plus natural language
 * ANTLR
 * OCaml
 * flex
 * bison
 * Parser testing:
 * compare to current

Resources

 * Alternative parsers
 * m:Help:Editing
 * m:Help:Magic words
 * User:HappyDog/WikiText parsing - some observations based on 1.3.10 by HappyDog
 * wikitext-l - Wikitext-l maillist
 * EBNF help
 * More EBNF help
 * Python parser for MediaWiki articles
 * JamWiki claims to have Mediawiki compatible syntax - see the code for an attempt to write a parser.
 * MediaCloth Is a Ruby library for parsing Mediawiki compatible syntax to XHTML.
 * WikiCloth is another Ruby library for Mediawiki markup.
 * Raid Magnus's wiki2xml work for some starting points; examine how his parser works (and how it differs from the main one) and the intermediate XML format he uses
 * Riehle et al.'s work on an EBNF grammar for Wiki Creole (a subset of MediaWiki syntax), XML interchange and XSLT transformations: http://www.riehle.org/category/wiki-tech/

The Markup Language
The MediaWiki markup language (commonly referred to within the MediaWiki community as wikitext, though this usage is ambiguous within the larger wiki community) uses sometimes paired non-textual ASCII characters to indicate to the parser how the editor wishes an item or section of text to be displayed. The parser translates these tokens into (X)HTML as closely as semantically possible.

v1.6 markup tokens
The markup tokens fall into two broad categories: unary tokens (like : or * used at the beginning of a line), which stand alone, and binary tokens (like those for italic or boldface) which must be used in matched pairs. Unary tokens may only be preceded by comments or whitespace; otherwise, they will not be interpreted.

Start of line only

 * blank line: paragraph break (HTML &lt;p&gt;)
 * Horizontal line:  (4 or more hyphens), specified in /BNF/Article
 * Pre-formatted text: (space)
 * Lists
 * Bulleted: *
 * Numbered: #
 * Indent with no marking: :
 * Definition list: ;
 * Notes:
 * These may be combined at the start of the line to create nested lists, e.g. *** to give a bulleted list three levels deep, or **# to have a numbered list within two-levels of bulleted list nesting.
 * Redirects: #redirect or #REDIRECT (followed by wikilink)
 * The whole quagmire that is table formatting:  {| ... |}  with in between  |- |+ || | ! ! .

Can be used anywhere

 * "Magic words", e.g.  ,    (see Help:Magic words)
 * Signatures:
 *   Replaced with your username
 *  ~  Replaced with your username and the date
 *   Replaced with the date.
 * Notes:
 * These tags are replaced at the point the edit is saved.
 * Magic links: ISBN ..., RFC ..., PMID ... (see /BNF/Magic links/)

Binary
The ellipses (...) are used to indicate where the content goes and are not part of the markup.

Beginning of a line

 * Equals signs are used for headings (must be at start of line)
 * 1st level heading:  = ... = 
 * 2nd level heading:  == ... == 
 * 3rd level heading:  === ... === 
 * 4th level heading:  ==== ... ==== 
 * 5th level heading:  ===== ... ===== 
 * 6th level heading:  ====== ... ====== 
 * Specified in /BNF/Article

Anywhere

 * Square brackets are used for links:
 * Internal/interwiki link + language links + category links + images:  ...   (see also Namespaces below)
 * vertical bars separate optional parameters, which are:
 * link: first parameter: display text (also defaulted using "pipe trick") (also trailing concatenated text included in display, e.g. s for plural)
 * image: many parameters; see w:Wikipedia:Extended image syntax; may contain nested links (and images!) in caption text.
 * category: first parameter: sort order in category list
 * link contents have to be parsed for whether they're dates if $wgUseDynamicDates is on
 * External link:  [ ... ] 
 * space separates optional first parameter, which is display text
 * undecorated URLs are also recognized and hotlinked
 * Specified in /BNF/Links
 * Apostrophes are used for formatting:
 * Italic:   ...  
 * Bold:   ...  
 * Bold + Italic:   ...  
 * Note that improper nesting of bold and italics is currently permitted.
 * Curly braces are used for transclusion:
 * Include template:   (see also Namespaces below)
 * Unlimited number of optional pipe-delimited parameters, each of which may optionally start with a parameter name preceding an equals sign
 * Include template parameter:  
 * Optionally including a pipe followed by the parameter default: will use the first passed in parameter, and if none is received, will insert "Bob" instead.
 * Use a built-in variable:   (see m:Help:Variable)
 * Call a parser function:  
 * Various HTML style tags:
 * &lt;nowiki&gt; do not interpret wiki markup, do allow newline in list and indent elements (but still flow text, still allow SGML entities)
 * &lt;pre&gt; do not interpret wiki markup, do not flow text (but still allow SGML entities)
 * &lt;math&gt; if $wgUseTeX is set
 * &lt;html&gt; if $wgRawHtml is set
 * &lt;gallery&gt;</tt>
 * &lt;onlyinclude&gt;</tt> &lt;noinclude&gt;</tt> &lt;includeonly&gt;</tt>
 * Parser extension tags, like &lt;ref&gt;</tt> (using Cite.php)
 * Plus most 'non-dangerous' HTML tags: 'b', 'del', 'i', 'ins', 'u', 'font', 'big', 'small', 'sub', 'sup', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'cite', 'code', 'em', 's', 'strike', 'strong', 'tt', 'var', 'div', 'center', 'blockquote', 'ol', 'ul', 'dl', 'table', 'caption', 'pre', 'ruby', 'rt', 'rb' , 'rp', 'p', 'span', 'u', 'br', 'hr', 'li', 'dt', 'dd', 'td', 'th', 'tr'
 * &lt;!-- ... --&gt; HTML-style comments
 * SGML entities: &amp;...;</tt>

Namespaces
In wikilinks and template inclusions, colons set off namespaces and other modifiers:
 * proper namespaces: <TT>Talk:</TT>, <TT>User:</TT>, project, etc.
 * "special" namespaces: <TT>File:</TT> (was <TT>Image:</TT>), <TT>Category:</TT>, <TT>Template:</TT>
 * pseudo-namespaces: Special:</tt>, Media:</tt>
 * lone/leading :</tt>
 * lone :</tt> forces main namespace
 * leading :</tt> allows link to image page rather than inline image, or similarly to category or template page
 * interwiki links:
 * same project, different language: code of two or more letters
 * different project, same language: w:</tt> for Wikipedia, wikt:</tt> for Wiktionary, m:</tt> for Meta, etc. -- see m:Help:Interwiki_linking for more information (especially when using in templates; transwiki transclusion, iw_trans)
 * <tt>subst:</tt> force one-time template substitution upon edit, rather than dynamic expansion on each view
 * <tt>int:</tt>, <tt>msg:</tt>, <tt>msgnw:</tt>, <tt>raw:</tt> -- see m:Help:Magic words
 * <tt>MediaWiki:</tt> magically access mediawiki formatting and boilerplate text (e.g. MediaWiki:copyrightwarning)
 * Standard parser functions: <tt>UC:</tt>, <tt>LC:</tt>, etc. (see m:Help:Parser function)
 * Additional parser functions: <tt>#expr:</tt>, <tt>#if:</tt>, <tt>#switch:</tt>, etc.
 * other extensions?

Several combinations of the above are possible, e.g. m:Help:Variable -- help namespace within Meta project.

MetaWiki markup description
The following text was at Wikitext Metasyntax and needs to be merged with the description above.

Basic Markup
Define markups

Parser outline
Another way to check whether we've covered everything in the grammar is to look at the steps the parser actually goes through:


 * The preprocessor does:


 * 1) Strip (hooks before/after)
 * 2) Remove HTML-like comments <<<<!>!>!>!>
 * 3) Replace variables
 * 4) Subst
 * 5) MSG, MSGNW, RAW
 * 6) Parser functions
 * 7) Templates


 * The parser does:


 * 1) Strip (hooks before, after)
 * 2) treats nowiki, pre, math and possibly other with "userfunc tag hooks" hiero)
 * 3) Removes HTML-like comments
 * 4) *HTML comments are removed. (this text by HappyDog)
 * 5) *Any tags that are not allowed by the software (e.g. tags) are replaced by HTML entitities, so they display as literals and are not treated as HTML by the browser.
 * 6) *Any badly formed tags (e.g. nested tags that shouldn't be nested,  tags outside a   tag, etc.) are also replaced by HTML entitities so they are not treated as HTML.
 * 7) *Any attributes that are not allowed by the software (e.g. onMouseOver) are removed from otherwise valid tags.
 * 8) *A small amount of minor source formatting is applied (basically, the removal of unnecessary whitespace).
 * 9) *A closing tag is added at the end for all tags that are not closed properly. Note that some tags (e.g.   ) don't need to be closed.
 * 10) Internal parse
 * 11) Noinclude/onlyinclude/includeonly sections
 * 12) Remove HTML tags
 * 13) Replace variables
 * 14) Hooks: Internalparsebeforelinks
 * 15) Tables
 * 16) Magic words
 * 17) Strip TOC
 * 18) Strip no gallery ( __NOGALLERY__ )
 * 19) do headings
 * 20) Do dynamic dates
 * 21) Do quotes (  and ' )
 * 22) Replace internal links
 * 23) Process images (do the caption recursively as it might contain links, or even other images...)
 * 24) Process categories
 * 25) Replace external links
 * 26) Re-replace masked internal links
 * 27) Do magic links (ISBN, RFC...)
 * 28) Format headings ( __NEWSECTIONLINK__, ...)
 * 29) Unstrip general
 * 30) Fix tags (french spaces, guillemet)
 * 31) Blocks (lists etc)
 * 32) Replace link holders
 * 33) Language converter:
 * 34) Normal text converted on a word by word basis(?) if autoconvert is enabled
 * 35) Text in -{code1:text1;code2:text2;...}- blocks converted manually
 * 36) Text in -{...}- not converted at all.
 * 37) Unstrip no wiki
 * 38) Extra tags and params
 * 39) User funcs?
 * 40) Un strip general
 * 41) Normalise char references
 * 42) Tidy + hook


 * The save parser does:
 * 1) Convert newlines
 * 2) Strips
 * 3) Pass 2
 * 4) Substs
 * 5) Strip again? gallery something.
 * 6) Signatures
 * 7) Pipe tricks
 * 8) Trim trailing whitespace
 * 9) Unstrips