Preprocessor ABNF

MediaWiki preprocessor syntax in augmented Backus–Naur Form (ABNF) (RFC 5234).

Ideal rules
In inclusion mode, these rules are added:

In non-inclusion mode, these rules are added:

Ideal precedence

 * 1) Angle bracket constructs: onlyinclude-sequence, xmlish-element, comment, unclosed-comment, line-eating-comment, inclusion-ignored-tags, noninclusion-ignored-tags
 * 2) Bracketed syntax: tplarg, template</tt>, link</tt>
 * 3) heading</tt>
 * 4) literal</tt>

In ambiguity between angle-bracket constructs, the first-opened structure takes precedence. For example:

&lt;nowiki&gt;&lt;!--&lt;/nowiki&gt;--&gt;

The nowiki-element</tt> wins.

In ambiguity between template</tt>, tplarg</tt> and link</tt>, the structure with the rightmost opening takes precedence. For example:

[[

The template</tt> wins because it was opened after the link</tt>.

tplarg</tt> takes precedence over template</tt> where braces alone are involved. But it is neither higher nor lower in precedence than link</tt>. Sequences of matching braces are thus interpreted as follows:


 * 4: &rarr; {&middot;&middot;}
 * 5: &rarr;
 * 6: &rarr;
 * 7: &rarr; {&middot;&middot;}

Practicalities
The main implementation challenge is avoiding infinite backtracking when disambiguating between competing bracketed constructs: template</tt>, tplarg</tt>, link</tt> and <tt>heading</tt>. The xmlish elements (including comments) don't suffer this problem because an unclosed xmlish element runs to the end, forcing a literal interpretation of the contents.

For example:

The square brackets are unclosed, and so the pipe characters should be interpreted as separating the <tt>parts</tt> of a <tt>template</tt>. But we don't know if the link is valid until the cursor reaches the end of the long string. This has traditionally been dealt with by adding a number of "broken" rules with the same precedence as the unbroken rules.

Since forever: broken-tplarg  = "{{{" parts-L2 broken-template = "{{" parts-L2 broken-link    = "[[" wikitext-L2

Since MW 1.12: broken-heading = LINE-START 1*6"=" wikitext-L3 LINE-END

Where <tt>parts-L2</tt> is like <tt>parts</tt> except that it allows headings inside it:

part-L2        = ( part-name-L2 "=" part-value-L2 ) / ( part-value-L2 ) part-name-L2   = wikitext-L2 part-value-L2  = wikitext-L2 parts-L2       = [ part-L2 1*( "|" part-L2 ) ]

These "broken" rules, when matched, produce output similar to a literal start followed by ordinary wikitext. The difference is that they compete on the same precedence level as the unbroken rules. So the previous example is parsed as a <tt>broken-template</tt> containing a <tt>broken-link</tt> containing a long string and a literal "}}". Based on the ideal rules, we would expect the <tt>literal</tt> interpretation of "}}" to have a lower precedence than its interpretation as the end of a <tt>template</tt>. But with the "broken" rules, the <tt>broken-link</tt> takes precedence over the <tt>template</tt>, being the rightmost-opened structure.

Broken rules always run to the end of the input string, because the only other way to terminate a broken rule is to turn it into an unbroken rule by closing it.

Because a <tt>heading</tt> or a <tt>broken-heading</tt> can appear in a <tt>part-L2</tt>, there is now ambiguity between the equals sign of the name/value separator, and the equals sign for the heading. We resolve it in the following way:


 * For level 1 headings (i.e. one equals sign on each side), the <tt>part</tt> takes precedence.
 * For level 2-6 headings, the heading takes precedence.

If the <tt>part-L2</tt> later becomes a <tt>part</tt> because the <tt>template</tt> or <tt>tplarg</tt> is closed, we could now have an errant <tt>heading</tt> in <tt>wikitext-L3</tt>, where it's not allowed. The <tt>heading</tt> can easily be disabled, but the name/value separator can't easily be recovered. To represent the syntactic effect of this, we introduce another rule:

disabled-heading  = heading wikitext-L3       =/ disabled-heading

The disambiguation of <tt>disabled-heading</tt> with <tt>part</tt> works in the same way as the disambiguation of <tt>heading</tt> with <tt>part-L2</tt>, described above.

Note that even with the changes described in this section, the grammar outlined here has ambiguities and precedence issues and does not correspond to the implementation of the PHP Preprocessor. This spec shouldn't be relied on an authoritative machine-readable reference, but as a useful guide for a human to understand the intended precedence and semantics of the preprocessor.

Possible improvements
If an efficient algorithm could be found for disambiguating the ideal rules, without introducing "broken" rules, that would be great. It would be a b/c break, but probably beneficial. Backwards compatibility was broken anyway by introducing <tt>broken-heading</tt> (the "newsome" bug on MNPP).

Line-eating comments could very easily be made to match at the start of the string. Currently they don't since there is no <tt>LF</tt> at the start of the string, just a <tt>LINE-START</tt>.

The "rightmost opening" rule for bracketed precedence is arbitrary, an artifact of implementation. Leftmost opening would probably be more intuitive.