Preprocessor ABNF

MediaWiki preprocessor syntax in augmented Backus–Naur Form (ABNF) (RFC 5234).

Ideal rules
In inclusion mode, these rules are added:

In non-inclusion mode, these rules are added:

Ideal precedence

 * 1) Angle bracket constructs: ,  ,  ,  ,  ,  ,
 * 2) Bracketed syntax: ,   ,

In ambiguity between angle-bracket constructs, the first-opened structure takes precedence. For example:

&lt;nowiki&gt;&lt;!--&lt;/nowiki&gt;--&gt;

The  wins.

In ambiguity between,   and  , the structure with the rightmost opening takes precedence. For example:

[[

The  wins because it was opened after the.

takes precedence over  where braces alone are involved. But it is neither higher nor lower in precedence than. Sequences of matching braces are thus interpreted as follows:


 * 4: → {&middot;&middot;}
 * 5: →
 * 6: →
 * 7: → {&middot;&middot;}

Practicalities
The main implementation challenge is avoiding infinite backtracking when disambiguating between competing bracketed constructs:,  ,   and. The xmlish elements (including comments) don't suffer this problem because an unclosed xmlish element runs to the end, forcing a literal interpretation of the contents.

For example:

The square brackets are unclosed, and so the pipe characters should be interpreted as separating the  of a. But we don't know if the link is valid until the cursor reaches the end of the long string. This has traditionally been dealt with by adding a number of "broken" rules with the same precedence as the unbroken rules.

Since forever: broken-tplarg  = "{{{" parts-L2 broken-template = "{{" parts-L2 broken-link    = "[[" wikitext-L2

Since MW 1.12: broken-heading = LINE-START 1*6"=" wikitext-L3 LINE-END

Where  is like   except that it allows headings inside it:

part-L2        = ( part-name-L2 "=" part-value-L2 ) / ( part-value-L2 ) part-name-L2   = wikitext-L2 part-value-L2  = wikitext-L2 parts-L2       = [ part-L2 1*( "|" part-L2 ) ]

These "broken" rules, when matched, produce output similar to a literal start followed by ordinary wikitext. The difference is that they compete on the same precedence level as the unbroken rules. So the previous example is parsed as a  containing a   containing a long string and a literal "}}". Based on the ideal rules, we would expect the  interpretation of "}}" to have a lower precedence than its interpretation as the end of a. But with the "broken" rules, the  takes precedence over the , being the rightmost-opened structure.

Broken rules always run to the end of the input string, because the only other way to terminate a broken rule is to turn it into an unbroken rule by closing it.

Because a  or a   can appear in a , there is now ambiguity between the equals sign of the name/value separator, and the equals sign for the heading. We resolve it in the following way:


 * For level 1 headings (i.e. one equals sign on each side), the  takes precedence.
 * For level 2-6 headings, the heading takes precedence.

If the  later becomes a   because the   or   is closed, we could now have an errant   in , where it's not allowed. The  can easily be disabled, but the name/value separator can't easily be recovered. To represent the syntactic effect of this, we introduce another rule:

disabled-heading  = heading wikitext-L3       =/ disabled-heading

The disambiguation of  with   works in the same way as the disambiguation of   with , described above.

Note that even with the changes described in this section, the grammar outlined here has ambiguities and precedence issues and does not correspond to the implementation of the PHP Preprocessor. This spec shouldn't be relied on an authoritative machine-readable reference, but as a useful guide for a human to understand the intended precedence and semantics of the preprocessor.

Possible improvements
If an efficient algorithm could be found for disambiguating the ideal rules, without introducing "broken" rules, that would be great. It would be a b/c break, but probably beneficial. Backwards compatibility was broken anyway by introducing  (the "newsome" bug on MNPP).

Line-eating comments could very easily be made to match at the start of the string. Currently they don't since there is no  at the start of the string, just a.

The "rightmost opening" rule for bracketed precedence is arbitrary, an artifact of implementation. Leftmost opening would probably be more intuitive.