Preprocessor ABNF

MediaWiki preprocessor syntax in augmented Backus–Naur Form (ABNF) (RFC 5234).

Ideal rules
In inclusion mode, these rules are added:

In non-inclusion mode, these rules are added:

Ideal precedence

 * 1) Angle bracket constructs: onlyinclude-sequence, xmlish-element, comment, unclosed-comment, line-eating-comment, inclusion-ignored-tags, noninclusion-ignored-tags
 * 2) Bracketed syntax: tplarg, template</tt>, link</tt>
 * 3) heading</tt>
 * 4) literal</tt>

In ambiguity between angle-bracket constructs, the first-opened structure takes precedence. For example:

&lt;nowiki&gt;&lt;!--&lt;/nowiki&gt;--&gt;

The nowiki-element</tt> wins.

In ambiguity between template</tt>, tplarg</tt> and link</tt>, the structure with the rightmost opening takes precedence. For example:

[[

The template</tt> wins because it was opened after the link</tt>.

tplarg</tt> takes precedence over template</tt> where braces alone are involved. But it is neither higher nor lower in precedence than link</tt>. Sequences of matching braces are thus interpreted as follows:


 * 4: &rarr; {&middot;&middot;}
 * 5: &rarr;
 * 6: &rarr;
 * 7: &rarr; {&middot;&middot;}

Practicalities
The main implementation challenge is avoiding infinite backtracking when disambiguating between competing bracketed constructs: template</tt>, tplarg</tt>, link</tt> and <tt>heading</tt>. The xmlish elements (including comments) don't suffer this problem because an unclosed xmlish element runs to the end, forcing a literal interpretation of the contents.

For example:

The square brackets are unclosed, and so the pipe characters should be interpreted as separating the <tt>parts</tt> of a <tt>template</tt>. But we don't know if the link is valid until the cursor reaches the end of the long string. This has traditionally been dealt with by adding a number of "broken" rules with the same precedence as the unbroken rules.

Since forever: broken-tplarg  = "{{{" parts-L2 broken-template = "{{" parts-L2 broken-link    = "[[" wikitext-L2

Since MW 1.12: broken-heading = LINE-START 1*6"=" wikitext-L3 LINE-END

Where <tt>parts-L2</tt> is like <tt>parts</tt> except that it allows headings inside it:

part-L2        = ( part-name-L2 "=" part-value-L2 ) / ( part-value-L2 ) part-name-L2   = wikitext-L2 part-value-L2  = wikitext-L2 parts-L2       = [ part-L2 1*( "|" part-L2 ) ]

These "broken" rules, when matched, produce output similar to a literal start followed by ordinary wikitext. The difference is that they compete on the same precedence level as the unbroken rules. So the previous example is parsed as a <tt>broken-template</tt> containing a <tt>broken-link</tt> containing a long string and a literal "}}". Based on the ideal rules, we would expect the <tt>literal</tt> interpretation of "}}" to have a lower precedence than its interpretation as the end of a <tt>template</tt>. But with the "broken" rules, the <tt>broken-link</tt> takes precedence over the <tt>template</tt>, being the rightmost-opened structure.

Broken rules always run to the end of the input string, because the only other way to terminate a broken rule is to turn it into an unbroken rule by closing it.

Because a <tt>heading</tt> or a <tt>broken-heading</tt> can appear in a <tt>part-L2</tt>, there is now ambiguity between the equals sign of the name/value separator, and the equals sign for the heading. We resolve it in the following way:


 * For level 1 headings (i.e. one equals sign on each side), the <tt>part</tt> takes precedence.
 * For level 2-6 headings, the heading takes precedence.

If the <tt>part-L2</tt> later becomes a <tt>part</tt> because the <tt>template</tt> or <tt>tplarg</tt> is closed, we could now have an errant <tt>heading</tt> in <tt>wikitext-L3</tt>, where it's not allowed. The <tt>heading</tt> can easily be disabled, but the name/value separator can't easily be recovered. To represent the syntactic effect of this, we introduce another rule:

disabled-heading  = heading wikitext-L3       =/ disabled-heading

The disambiguation of <tt>disabled-heading</tt> with <tt>part</tt> works in the same way as the disambiguation of <tt>heading</tt> with <tt>part-L2</tt>, described above.

Possible improvements
If an efficient algorithm could be found for disambiguating the ideal rules, without introducing "broken" rules, that would be great. It would be a b/c break, but probably beneficial. Backwards compatibility was broken anyway by introducing <tt>broken-heading</tt> (the "newsome" bug on MNPP).

Line-eating comments could very easily be made to match at the start of the string. Currently they don't since there is no <tt>LF</tt> at the start of the string, just a <tt>LINE-START</tt>.

The "rightmost opening" rule for bracketed precedence is arbitrary, an artifact of implementation. Leftmost opening would probably be more intuitive.