Jump to content

Preprocessor ABNF

From mediawiki.org

MediaWiki preprocessor syntax in augmented Backus–Naur Form (ABNF) (RFC 5234).

Ideal rules

[edit]
; START = start of string
; END = end of string
; LINE-START = start of line
; LINE-END = end of line
;
; The string starts with LINE-START. An LF input produces the tokens 
; LINE-END LF LINE-START, and the string ends with LINE-END.
;
; The starting symbol of the grammar is wikitext-L1.

xml-char = %x9 / %xA / %xD / %x20-D7FF / %xE000-FFFD / %x10000-10FFFF
sptab = SP / HTAB

; everything except ">" (%x3E)
attr-char = %x9 / %xA / %xD / %x20-3D / %x3F-D7FF / %xE000-FFFD / %x10000-10FFFF

literal         = *xml-char
title           = wikitext-L3
part-name       = wikitext-L3
part-value      = wikitext-L3
part            = ( part-name "=" part-value ) / ( part-value )
parts           = [ title *( "|" part ) ]
tplarg          = "{{{" parts "}}}"
template        = "{{" parts "}}"
link            = "[[" wikitext-L3 "]]"

comment         = "<!--" literal "-->"
unclosed-comment = "<!--" literal END
; the + in the line-eating-comment rule was absent between MW 1.12 and MW 1.22
line-eating-comment = LF LINE-START *SP +( comment *SP ) LINE-END

attr            = *attr-char
nowiki-element  = "<nowiki" attr ( "/>" / ( ">" literal ( "</nowiki>" / END ) ) )
; ...and similar rules added by XML-style extensions.

xmlish-element  = nowiki-element / ... extensions ...

heading = LINE-START heading-inner [ *sptab comment ] *sptab LINE-END

heading-inner   =       "=" wikitext-L3 "="                / 
                        "==" wikitext-L3 "=="              /
                        "===" wikitext-L3 "==="            /
                        "====" wikitext-L3 "===="          /
                        "=====" wikitext-L3 "====="        /
                        "======" wikitext-L3 "======"

; wikitext-L1 is a simple proxy to wikitext-L2, except in inclusion mode, where it
; has a role in <onlyinclude> syntax (see below)
wikitext-L1     = wikitext-L2 / *wikitext-L1
wikitext-L2     = heading / wikitext-L3 / *wikitext-L2
wikitext-L3     = literal / template / tplarg / link / comment / 
                  line-eating-comment / unclosed-comment / xmlish-element / 
                  *wikitext-L3

In inclusion mode, these rules are added:

noinclude-element               = "<noinclude" attr ( "/>" / ( ">" literal ( "</noinclude>" / END ) ) )
inclusion-ignored-tag           = "<includeonly>" / "</includeonly>"
closed-onlyinclude-item         = ignored-text "<onlyinclude>" wikitext-L2 "</onlyinclude>"
unclosed-onlyinclude-item       = ignored-text "<onlyinclude>" wikitext-L2
ignored-text                    = literal
onlyinclude-sequence            = *closed-onlyinclude-item *unclosed-onlyinclude-item
xmlish-element                  =/ noinclude-element
wikitext-L1                     =/ onlyinclude-sequence
wikitext-L3                     =/ inclusion-ignored-tag / onlyinclude-sequence

In non-inclusion mode, these rules are added:

includeonly-element             = "<includeonly" attr ( "/>" / ( ">" literal ( "</includeonly>" / END ) ) )
noninclusion-ignored-tag        = "<noinclude>" / "</noinclude>" / "<onlyinclude>" / "</onlyinclude>"
xmlish-element                  =/ includeonly-element
wikitext-L3                     =/ noninclusion-ignored-tag

Ideal precedence

[edit]
  1. Angle bracket constructs: onlyinclude-sequence, xmlish-element, comment, unclosed-comment, line-eating-comment, inclusion-ignored-tags, noninclusion-ignored-tags
  2. Bracketed syntax: tplarg, template , link
  3. heading
  4. literal

In ambiguity between angle-bracket constructs, the first-opened structure takes precedence. For example:

<nowiki><!--</nowiki>-->

The nowiki-element wins.

In ambiguity between template, tplarg and link, the structure with the rightmost opening takes precedence. For example:

[[ {{ ]] }}

The template wins because it was opened after the link.

tplarg takes precedence over template where braces alone are involved. But it is neither higher nor lower in precedence than link. Sequences of matching braces are thus interpreted as follows:

  • 4: {{{{·}}}} → {·{{{·}}}·}
  • 5: {{{{{·}}}}} → {{·{{{·}}}·}}
  • 6: {{{{{{·}}}}}} → {{{·{{{·}}}·}}}
  • 7: {{{{{{{·}}}}}}} → {·{{{·{{{·}}}·}}}·}

Practicalities

[edit]

The main implementation challenge is avoiding infinite backtracking when disambiguating between competing bracketed constructs: template, tplarg, link and heading. The xmlish elements (including comments) don't suffer this problem because an unclosed xmlish element runs to the end, forcing a literal interpretation of the contents.

For example:

{{ [[ x | y |  ...long string... }}

The square brackets are unclosed, and so the pipe characters should be interpreted as separating the parts of a template. But we don't know if the link is valid until the cursor reaches the end of the long string. This has traditionally been dealt with by adding a number of "broken" rules with the same precedence as the unbroken rules.

Since forever:

broken-tplarg   = "{{{" parts-L2
broken-template = "{{" parts-L2
broken-link     = "[[" wikitext-L2

Since MW 1.12:

broken-heading  = LINE-START 1*6"=" wikitext-L3 LINE-END

Where parts-L2 is like parts except that it allows headings inside it:

part-L2         = ( part-name-L2 "=" part-value-L2 ) / ( part-value-L2 )
part-name-L2    = wikitext-L2
part-value-L2   = wikitext-L2
parts-L2        = [ part-L2 1*( "|" part-L2 ) ]

These "broken" rules, when matched, produce output similar to a literal start followed by ordinary wikitext. The difference is that they compete on the same precedence level as the unbroken rules. So the previous example is parsed as a broken-template containing a broken-link containing a long string and a literal "}}". Based on the ideal rules, we would expect the literal interpretation of "}}" to have a lower precedence than its interpretation as the end of a template. But with the "broken" rules, the broken-link takes precedence over the template, being the rightmost-opened structure.

Broken rules always run to the end of the input string, because the only other way to terminate a broken rule is to turn it into an unbroken rule by closing it.

Because a heading or a broken-heading can appear in a part-L2, there is now ambiguity between the equals sign of the name/value separator, and the equals sign for the heading. We resolve it in the following way:

  • For level 1 headings (i.e. one equals sign on each side), the part takes precedence.
  • For level 2-6 headings, the heading takes precedence.

If the part-L2 later becomes a part because the template or tplarg is closed, we could now have an errant heading in wikitext-L3, where it's not allowed. The heading can easily be disabled, but the name/value separator can't easily be recovered. To represent the syntactic effect of this, we introduce another rule:

disabled-heading   = heading
wikitext-L3        =/ disabled-heading

The disambiguation of disabled-heading with part works in the same way as the disambiguation of heading with part-L2, described above.

Note that even with the changes described in this section, the grammar outlined here has ambiguities and precedence issues and does not correspond to the implementation of the PHP Preprocessor. This spec shouldn't be relied on an authoritative machine-readable reference, but as a useful guide for a human to understand the intended precedence and semantics of the preprocessor.

Possible improvements

[edit]

If an efficient algorithm could be found for disambiguating the ideal rules, without introducing "broken" rules, that would be great. It would be a b/c break, but probably beneficial. Backwards compatibility was broken anyway by introducing broken-heading (the "newsome" bug on m:MNPP).

Line-eating comments could very easily be made to match at the start of the string. Currently they don't since there is no LF at the start of the string, just a LINE-START.

The "rightmost opening" rule for bracketed precedence is arbitrary, an artifact of implementation. Leftmost opening would probably be more intuitive.