Preprocessor ABNF

MediaWiki preprocessor syntax in ABNF (RFC 5234).

Ideal rules

 * START = start of string
 * END = end of string
 * LINE-START = start of line
 * LINE-END = end of line
 * The string starts with LINE-START. An LF input produces the tokens
 * LINE-END LF LINE-START, and the string ends with LINE-END.
 * The starting symbol of the grammar is wikitext-L1.
 * The starting symbol of the grammar is wikitext-L1.
 * The starting symbol of the grammar is wikitext-L1.

xml-char = %x9 / %xA / %xD / %x20-D7FF / %xE000-FFFD / %x10000-10FFFF sptab = SP / HTAB

attr-char = %x9 / %xA / %xD / %x20-3D / %x3F-D7FF / %xE000-FFFD / %x10000-10FFFF
 * everything except "&gt;"

literal        = *xml-char part           = ( part-name "=" part-value ) / ( part-value ) part-name      = wikitext-L3 part-value     = wikitext-L3 parts          = [ part 1*( "|" part ) ] tplarg         = "" template       = "" link           = "" wikitext-L3 ""

comment        = "" unclosed-comment = "

The nowiki-element wins.

In ambiguity between template, tplarg and link, the structure with the rightmost opening takes precedence. For example:

[[

The template wins because it was opened after the link.

tplarg takes precedence over template where braces alone are involved. But it is neither higher nor lower in precedence than link</tt>. Sequences of matching braces are thus interpreted as follows:


 * 4: &rarr; {&middot;&middot;}
 * 5: &rarr;
 * 6: &rarr;
 * 7: &rarr; {&middot;&middot;}

Practicalities
The main implementation challenge is avoiding infinite backtracking when disambiguating between competing bracketed constructs: template</tt>, tplarg</tt>, link</tt> and heading</tt>. The xmlish elements (including comments) don't suffer this problem because an unclosed xmlish element runs to the end, forcing a literal interpretation of the contents.

For example:

The square brackets are unclosed, and so the pipe characters should be interpreted as separating the parts</tt> of a template</tt>. But we don't know if the link is valid until the cursor reaches the end of the long string. This has traditionally been dealt with by adding a number of "broken" rules with the same precedence as the unbroken rules.

Since forever: broken-tplarg  = "{{{" parts-L2 broken-template = "{{" parts-L2 broken-link    = "[[" wikitext-L2

Since MW 1.12: broken-heading = LINE-START 1*6"=" wikitext-L3 LINE-END

Where parts-L2</tt> is like parts</tt> except that it allows headings inside it:

part-L2        = ( part-name-L2 "=" part-value-L2 ) / ( part-value-L2 ) part-name-L2   = wikitext-L2 part-value-L2  = wikitext-L2 parts-L2       = [ part-L2 1*( "|" part-L2 ) ]

These "broken" rules, when matched, produce output similar to a literal start followed by ordinary wikitext. The difference is that they compete on the same precedence level as the unbroken rules. So the previous example is parsed as a broken-template</tt> containing a broken-link</tt> containing a long string and a literal "}}". Based on the ideal rules, we would expect the literal</tt> interpretation of "}}" to have a lower precedence than its interpretation as the end of a template</tt>. But with the "broken" rules, the broken-link</tt> takes precedence over the template</tt>, being the rightmost-opened structure.

Broken rules always run to the end of the input string, because the only other way to terminate a broken rule is to turn it into an unbroken rule by closing it.

Because a heading</tt> or a <tt>broken-heading</tt> can appear in a <tt>part-L2</tt>, there is now ambiguity between the equals sign between the name/value separator, and the equals sign for the heading. We resolve it in the following way:


 * For level 1 headings (i.e. one equals sign on each side), the <tt>part</tt> takes precedence.
 * For level 2-6 headings, the heading takes precedence.

If the <tt>part-L2</tt> later becomes a <tt>part</tt> because the <tt>template</tt> or <tt>tplarg</tt> is closed, we could now have an errant <tt>heading</tt> in <tt>wikitext-L3</tt>, where it's not allowed. The <tt>heading</tt> can easily be disabled, but the name/value separator can't easily be recovered. To represent the syntactic effect of this, we introduce another rule:

disabled-heading  = heading wikitext-L3       =/ disabled-heading

The disambiguation of <tt>disabled-heading</tt> with <tt>part</tt> works in the same way as the disambiguation of <tt>heading</tt> with <tt>part-L2</tt>, described above.

Possible improvements
If an efficient algorithm could be found for disambiguating the ideal rules, without introducing "broken" rules, that would be great. It would be a b/c break, but probably beneficial. Backwards compatibility was broken anyway by introducing <tt>broken-heading</tt> (the "newsome" bug on MNPP).

Line-eating comments could very easily be made to match at the start of the string. Currently they don't since there is no <tt>LF</tt> at the start of the string, just a <tt>LINE-START</tt>.

The "rightmost opening" rule for bracketed precedence is arbitrary, an artifact of implementation. Leftmost opening would probably be more intuitive.