Preprocessor ABNF

MediaWiki preprocessor syntax in ABNF (RFC 5234).

Ideal rules

 * START = start of string
 * END = end of string
 * LINE-START = start of line
 * LINE-END = end of line
 * The string starts with LINE-START. An LF input produces the tokens
 * LINE-END LF LINE-START, and the string ends with LINE-END.
 * The starting symbol of the grammar is wikitext-L1.
 * The starting symbol of the grammar is wikitext-L1.
 * The starting symbol of the grammar is wikitext-L1.

xml-char = %x9 / %xA / %xD / %x20-D7FF / %xE000-FFFD / %x10000-10FFFF sptab = SP / HTAB

attr-char = %x9 / %xA / %xD / %x20-3D / %x3F-D7FF / %xE000-FFFD / %x10000-10FFFF
 * everything except "&gt;"

literal        = *xml-char part           = ( part-name "=" part-value ) / ( part-value ) part-name      = wikitext-L3 part-value     = wikitext-L3 parts          = [ part 1*( "|" part ) ] tplarg         = "" template       = "" link           = "" wikitext-L3 ""

comment        = "" unclosed-comment = "

The nowiki-element wins.

In ambiguity between heading, template, tplarg and link, the structure with the rightmost opening takes precedence. For example:

[[

The template wins because it was opened after the link.

tplarg takes precedence over template</tt> where braces alone are involved. But it is neither higher nor lower in precedence than link</tt>. Sequences of matching braces are thus interpreted as follows:


 * 4: &rarr; {&middot;&middot;}
 * 5: &rarr;
 * 6: &rarr;
 * 7: &rarr; {&middot;&middot;}

Practicalities
The main implementation challenge is avoiding infinite backtracking when disambiguating between competing bracketed constructs: template</tt>, tplarg</tt>, link</tt> and heading</tt>. The xmlish elements (including comments) don't suffer this problem because an unclosed xmlish element runs to the end, forcing a literal interpretation of the contents.

For example:

The square brackets are unclosed, and so the pipe characters should be interpreted as separating the parts</tt> of a template</tt>. But we don't know if the link is valid until the cursor reaches the of the long string. This has traditionally been dealt with by adding a number of "broken" rules with the same precedence as the unbroken rules.

Since forever: broken-tplarg  = "{{{" parts-L2 broken-template = "{{" parts-L2 broken-link    = "[[" wikitext-L2

Since MW 1.12: broken-heading = LINE-START 1*6"=" wikitext-L3 LINE-END

Where parts-L2</tt> is like parts</tt> except that it allows headings inside it:

part-L2        = ( part-name-L2 "=" part-value-L2 ) / ( part-value-L2 ) part-name-L2   = wikitext-L2 part-value-L2  = wikitext-L2 parts-L2       = [ part-L2 1*( "|" part-L2 ) ]

These "broken" rules, when matched, produce output similar to a literal start followed by ordinary wikitext. The difference is that they compete on the same precedence level as the unbroken rules. So the previous example is parsed as a broken-template</tt> containing a broken-link</tt> containing a long string and a literal "}}". Based on the ideal rules, we would expect the literal</tt> interpretation of "}}" to have a lower precedence than its interpretation as the end of a template</tt>. But with the "broken" rules, the broken-link</tt> takes precedence over the template</tt>, being the rightmost-opened structure.

Broken rules always run to the end of the input string, because the only other way to terminate a broken rule is to turn it into an unbroken rule by closing it.

Because a <tt>heading</tt> or a <tt>broken-heading</tt> can appear in a <tt>part-L2</tt>, there is now ambiguity between the equals sign between the name/value separator, and the equals sign for the heading. We resolve it in the following way:


 * For level 1 headings (i.e. one equals sign on each side), the <tt>part</tt> takes precedence.
 * For level 2-6 headings, the heading takes precedence.

If the <tt>part-L2</tt> later becomes a <tt>part</tt> because the <tt>template</tt> or <tt>tplarg</tt> is closed, we could now have an errant <tt>heading</tt> in <tt>wikitext-L3</tt>, where it's not allowed. The <tt>heading</tt> can easily be disabled, but the name/value separator can't easily be recovered. To represent the syntactic effect of this, we introduce another rule:

disabled-heading  = heading wikitext-L3       =/ disabled-heading

The disambiguation of <tt>disabled-heading</tt> with <tt>part</tt> works in the same way as the disambiguation of <tt>heading</tt> with <tt>part-L2</tt>, described above.

Possible improvements
If an efficient algorithm could be found for disambiguating the ideal rules, without introducing "broken" rules, that would be great. It would be a b/c break, but probably beneficial. Backwards compatibility was broken anyway by introducing <tt>broken-heading</tt> (the "newsome" bug on MNPP).

Line-eating comments could very easily be made to match at the start of the string. Currently they don't since there is no <tt>LF</tt> at the start of the string, just a <tt>LINE-START</tt>.