Markup spec/BNF/Article

Wiki-page
The top-level element is wiki-page which describes the contents of a page. A page can either be a redirect or a normal article.

              ::= [ ] | [ ] ::=    ( | | EOL)            ::= FROM_LANGUAGE_FILE

, ,  and are defined in ../Links/ Notes:

The  is language-specific, and may have more than one possible value. By default the value for the right-hand-side of the expression (replacing FROM_LANGUAGE_FILE) is, but in Estonian it is. This match is case-insensitive (though this again may be overridden in the language file).

should be non-greedy, matching the largest subset of characters that does not contain .

For example,   will match the following, and treat it as a redirect to foo:
 *  #REDireCTnon%^sense[[foo|and this is parsed as article content </tt>


 * Interwiki prefixes may not be supported in redirect links. (Is this configurable?)
 * The following the redirect link is not rendered. However, it is parsed. So, interwiki links, category links and even normal links are still treated and behave "normally".
 * Anchors (Article#Section) are supported, but not yet described in the grammar.

Article
This describes the contents of an article. An article consists of blocks, which come in two flavours: paragraphs and special blocks. Both of them end with a newline. Paragraphs are separated by empty lines.

<special-block-and-more> ::= <special-block> ( EOF | [ ] <special-block-and-more>                                                      | ( | "") <paragraph-and-more> ) <paragraph-and-more>     ::= ( EOF | [ ] <special-block-and-more>                                                  | <paragraph-and-more> )

The nonterminals special-block-and-more</tt> and paragraph-and-more</tt> are not disjoint; the parser should first try to match against special-block-and-more</tt>.

The expression ( | "")</tt> is a greedy version of [ ]</tt>. If both the empty string and a newline can be matched, then the former expression matches the newline, while the latter expression would match the empty string according to the conventions on ../.

Paragraph
Every paragraph ends with a newline character. A paragraph translated in a &lt;p&gt; element. ::= [<lines-of-text>] | <lines-of-text> <lines-of-text>          ::= <line-of-text> [<lines-of-text>] <line-of-text>           ::= <inline-text> <inline-text>            ::= <inline-element> [<inline-text>] <inline-element>         ::= | <magic-link> | <nowiki-tag> | <image-inline> | <gallery-block> | <media-inline> | | | | <magic-word> | <html-block> | <math-block> | <pre-block> | <inline-html> | (more missing)... ::= ( - ) [ ] In the penultimate rule, link</tt>, magic-link</tt> and nowiki-tag</tt> are described in ../Links/, ../Magic links/ and ../Nowiki/, respectively. Math-block, html-block and pre-block are also specified on the nowiki page. Again, link</tt> and text</tt> are not disjoint; the parser should try text</tt> last.

The recursion in the second rule should be non-greedy, i.e., it should match as few lines as possible. For instance,
 * abc</tt>
 * </tt>

should be parsed as one line-of-text</tt> and one horizontal-rule</tt>, but
 * abc</tt>
 * <tt>---</tt>

should be parsed as two <tt>line-of-text</tt> nonterminals.

If a paragraph starts with a newline, the newline is as a &lt;br&gt; element.

Formatting
Bold/italics is the biggest problem with switching to a consume-parse-render parser. It will not be possible to describe the current, extremely esoteric rules in simple (E)BNF. The best we can hope for is to store tokens representing the apostrophe clumps and do a second pass to make more sense of them. It would be very useful to define a second, unambiguous set of formatting syntax (most likely // and **), and encourage people to use those wherever apostrophes and bold/italics meet.

It is currently possible to do this: Some text in bold-italics followed by just italics followed by normal text.

The grammar and parser will be simpler if we disallow this.

<bold-italic-toggle>     ::= "'" <bold-toggle>            ::= "'''" <italic-toggle>          ::= "''"

Inline HTML
A decision has to be made here on whether to attempt to parse these things as a matched set, or whether to leave that to a later pass.

<InlineHTML>             ::=  BoldHTML | ItalicHTML | UnderlineHTML | Superscript | Subscript | Strikethrough | " " | Small | Big | Code | Span ;

In particular, span is a bit different from the others, as this doesn't work: Some text

The same code does work with &lt;B> and others.

Block HTML
(not referred to yet) BlockHTML = Pre | Blockquote | TableHTML | Div | HeaderHTML ;

String Types
''This text came from Meta-Wiki. It's not immediately compatible with the surrounding text (it's EBNF, rather than BNF, for a start). However it is much more precise about the nature of lines and captures rules about whitespace normalisation.''

Fundamental strings

WikiMarkupCharacters = "|" | "[" | "]" | "*" | "#" | ":" | ";" | "<" | ">" | "=" | "'" | "{" | "}" ;

UnicodeCharacter = ? all supported Unicode characters ? - Whitespaces ; UnicodeWiki = UnicodeCharacter - WikiMarkupCharacters ; PlainText = UnicodeWiki | "  " { "|" | "[" | "]" | "<" | ">" | "{" | "}" } "   "          | UnicodeWiki { " " } ( "*" | "#" | ":" | ";" ) | UnicodeWiki [ " " ] "=" [ " " ] UnicodeWiki | UnicodeWiki "'" | " '" UnicodeWiki ; WhiteSpaces = " " | NewLine | ? carriage return ? | ? line feed ? | ? tab ? | ? variants of spaces ? ; NewLine = ? carriage return and line feed ? ;

Article strings

Line = PlainText { PlainText } { " " { " " } PlainText { PlainText } } ; Text = Line { Line } { NewLine { NewLine } Line { Line } } ;

Titles

PageName = TitleCharacter, { [ " " ] TitleCharacter } ; PageNameLink = TitleCharacter, { [ " " | "_" ] TitleCharacter } ; SectionTitle = ( SectionLinkCharacter - "=" ) { [ " " ] ( SectionLinkCharacter - "=" ) } ; SectionLink = SectionLinkCharacter { [ "_" ] SectionLinkCharacter } ; LinkTitle = { UnicodeCharacter { " " } } ( UnicodeCharacter - "]" ) ;

TitleCharacter = UnicodeCharacter - BadTitleCharacters ; BadTitleCharacters = "[" | "]" | "{" | "}" | "<" | ">" | "_" | "|" | "#" ; SectionLinkCharacter = UnicodeCharacter - BadSectionLinkCharacters ; BadSectionLinkCharacters = "[" | "]" | "|" ;

Magic words
Not to be confused with magic links. These seem to be able to be used virtually anywhere: a table of contents in an image caption even works. See m:Help:Magic words. <magic-word>              ::= <magicword-notoc> | <magicword-toc> | <magicword-noeditsection> <magicword-toc>           ::= "" <magicword-notoc>         ::= "" <magicword-noeditsection> ::= ""

Special block
Special blocks are things like itemized lists starting with <tt>*</tt> ; they can only be specified at the start of a line and usually run till the end of the line.

<special-block>          ::= <horizontal-rule> | | <list-item> |   | <space-block> | ...

The dots need to be filled in.

Horizontal rule
A horizontal rule is specified by 4 or more dashes. It is translated to an &lt;hr&gt; element.

<horizontal-rule>        ::= "" [ ] [<inline-text>] ::= "-" [ ]

If the <tt>inline-text</tt> is present, it is not wrapped in a &lt;p&gt; element.

Heading
A level-n heading is translated to an &lt;hn&gt; element.

| <level-3-heading> | <level-2-heading> | <level-1-heading> <level-6-heading>        ::= "======" <inline-text> "======" <space-tabs> <level-5-heading>        ::= "====="  <inline-text> "====="  <space-tabs> <level-4-heading>        ::= "===="   <inline-text> "===="   <space-tabs> <level-3-heading>        ::= "==="    <inline-text> "==="    <space-tabs> <level-2-heading>        ::= "=="     <inline-text> "=="     <space-tabs> <level-1-heading>        ::= "="      <inline-text> "="      <space-tabs>

The alternatives in the first rule need to be tried from left to right.

Some notes (as implied by the grammar):
 * An unterminated heading tag is treated as normal text.
 * Unbalanced tags are treated as the shorter of the two tags (i.e. ==== heading == renders as the level 2 heading == heading)
 * More than 6 = signs are treated as 6, with the extra symbols being included in the header.

List item
<list-item>              ::= <indent-item> |  <enumerated-item> | <bullet-item> <indent-item>            ::= ":" [(<list-item> | <item-body>)] <enumerated-item>        ::= "#" [(<list-item> | <item-body>)] <bullet-item>            ::= "*" [(<list-item> | <item-body>)] <item-body>              ::= <defined-term> | [ ] <inline-text>

<defined-term>           ::= ";" [ ] ::= ":" <inline-text>

Semantics:
 * <indent-item> and are translated to a &lt;dd> element, wrapped in a &lt;dl>
 * A <bullet-item> is translated to a &lt;li> element wrapped in a &lt;ul>.
 * An <enumerated-item> is translated to a &lt;li> element wrapped in a &lt;ol>.


 * A <defined-term> is translated to a &lt;dt> element wrapped in a &lt;dl>.

Notes:
 * The grouping of successive list items cannot be captured in EBNF. The simplest approach would appear to be a second pass whereby successive pairings of close/open list are eliminated. For example, <ol><li>Foo</li> </ol><ol> <li>Boo</li></ol> would be rewritten as <ol><li>Foo</li><li>Boo</li></ol>
 * <list-item> and <defined-term> are obviously matched in preference to <inline-text>. The user has to insert whitespace in order to get inline-text starting with #, ;, * or :.
 * The current parser accepts a wide range of syntax than the above, allowing other list items to appear after a definition list . This appears to be arbitrary, unpredictable and not particularly useful. See bug11894.

Table
From meta...minor reformatting <Table>                  ::=  "{|" [ " " TableParameters ] NewLine TableFirstRow "|}" ; <TableFirstRow>          ::= TableColumnLine NewLine | TableColumnMultiLine | TableRow ; <TableRow>               ::= "|-" [ CSS ] NewLine TableColumn [ TableRow ] ; <TableColumn>            ::= TableColumnLine | TableColumnMultiLine ; <TableColumnLine>        ::= "|" InlineText [ "|" TableColumnLine ] ; <TableColumnMultiLine>   ::= "|" [ TableCellParameters "|" ] AnyText NewLine [ TableColumnMultiLine ] ; <TableParameters>        ::= CSS | ? HTML table attributes ? ; <TableCellParameter>     ::= CSS | ? HTML cell attributes ? ;

Space block
Starting a line with a space creates a pre-formatted block of text similar to using &lt;pre>. The big difference is that the contained text is still parsed and rendered normally. <space-block>            ::= " " <inline-text> [ {<space-block-2} ] <space-block-2>          ::= " " [<inline-text>]


 * Rendering
 * The block is surrounded with &lt;pre>. White space and newlines are preserved literally.
 * Note that the first line of a space block must have text in it. Subsequent lines can be composed of just spaces.

Images, media, gallery
Links to images and media should be handled as normal links. It's inline images and media that are being dealt with here. From meta...minor reformatting

Images
ImageInline               ::= "" ; ImageExtension            ::= "jpg" | "jpeg" | "png" | "svg" | "gif" | "bmp" ; ThumbImageParameter       ::= "thumb" | "frame" | "enframed" | "thumbnail" ; SizeImageParameter        ::= PositiveNumber "px" ; AlignImageParameter       ::=  "left" | "center" | "centre" | "right" ;

Media
MediaInline              ::=  "", "Media:" , PageName "." MediaExtension "" ; MediaExtension = "ogg" | "wav" ;

Gallery
GalleryBlock              ::=   "" ; GalleryImage              ::=   (to be defined: essentially   foo.jpg[|caption] )

Remarks:
 * The gallery block can technically be used in the middle of a sentence so is not a "special block". It doesn't render particularly nicely when you do that though.