Markup spec/BNF/Article

Wiki-page
The top-level element is wiki-page which describes the contents of a page. A page can either be a redirect or a normal article.

              ::= [ ] | [ ] ::=    ( | | EOL)            ::= FROM_LANGUAGE_FILE

, ,  and are defined in ../Links/ Notes:

The  is language-specific, and may have more than one possible value. By default the value for the right-hand-side of the expression (replacing FROM_LANGUAGE_FILE) is, but in Estonian it is. This match is case-insensitive (though this again may be overridden in the language file).

should be non-greedy, matching the largest subset of characters that does not contain .

For example,   will match the following, and treat it as a redirect to foo:
 *  #REDireCTnon%^sense[[foo|and this is parsed as article content </tt>


 * Interwiki prefixes may not be supported in redirect links. (Is this configurable?)
 * The following the redirect link is not rendered. However, it is parsed. So, interwiki links, category links and even normal links are still treated and behave "normally".
 * Anchors (Article#Section) are supported, but not yet described in the grammar.

Article
This describes the contents of an article. An article consists of blocks, which come in two flavours: paragraphs and special blocks. Both of them end with a newline. Paragraphs are separated by empty lines.

<special-block-and-more> ::= <special-block> ( EOF | [ ] <special-block-and-more>                                                      | ( | "") <paragraph-and-more> ) <paragraph-and-more>     ::= ( EOF | [ ] <special-block-and-more>                                                  | <paragraph-and-more> )

The nonterminals special-block-and-more</tt> and paragraph-and-more</tt> are not disjoint; the parser should first try to match against special-block-and-more</tt>.

The expression ( | "")</tt> is a greedy version of [ ]</tt>. If both the empty string and a newline can be matched, then the former expression matches the newline, while the latter expression would match the empty string according to the conventions on ../.


 * For the definition of special block, see ../Special block.

Paragraph
Every paragraph ends with a newline character. A paragraph translated in a &lt;p&gt; element. ::= [<lines-of-text>] | <lines-of-text> <lines-of-text>          ::= <line-of-text> [<lines-of-text>] <line-of-text>           ::= <inline-text> <inline-text>            ::= <inline-element> [<inline-text>] <inline-element>         ::= | <magic-link> | <nowiki-tag> | <image-inline> | <gallery-block> | <media-inline> | | | | <magic-word> | <html-block> | <math-block> | <pre-block> | <inline-html> | (more missing)... ::= ( - ) [ ] In the penultimate rule, link</tt>, magic-link</tt> and nowiki-tag</tt> are described in ../Links/, ../Magic links/ and ../Nowiki/, respectively. Math-block, html-block and pre-block are also specified on the nowiki page. Again, link</tt> and text</tt> are not disjoint; the parser should try text</tt> last.

The recursion in the second rule should be non-greedy, i.e., it should match as few lines as possible. For instance,
 * abc</tt>
 * </tt>

should be parsed as one line-of-text</tt> and one horizontal-rule</tt>, but
 * abc</tt>
 * <tt>---</tt>

should be parsed as two <tt>line-of-text</tt> nonterminals.

If a paragraph starts with a newline, the newline is as a &lt;br&gt; element.

Formatting
Bold/italics is the biggest problem with switching to a consume-parse-render parser. It will not be possible to describe the current, extremely esoteric rules in simple (E)BNF. The best we can hope for is to store tokens representing the apostrophe clumps and do a second pass to make more sense of them. It would be very useful to define a second, unambiguous set of formatting syntax (most likely // and **), and encourage people to use those wherever apostrophes and bold/italics meet.

It is currently possible to do this: Some text in bold-italics followed by just italics followed by normal text.

The grammar and parser will be simpler if we disallow this.

<bold-italic-toggle>     ::= "'" <bold-toggle>            ::= "'''" <italic-toggle>          ::= "''"

Inline HTML
The parser recognises and cleans a large number of HTML tags, as defined in Sanitizer.php.

A decision has to be made here on whether to attempt to parse these things as a matched set, or whether to leave that to a later pass.

A loose definition: <InlineHTML>             ::=  "<" <InlineHTMLtagname> [characters - ">"] ">"


 * The list of "tags that must be closed" :
 * b, del, i, ins, u, font, big, small, sub, sup, h1 h2, h3, h4, h5, h6, cite, code, em, s,
 * strike, strong, tt, var, div, center, blockquote, ol, ul, dl, table, caption, pre,
 * ruby, rt, rb , rp, p, span, u


 * Tags that can appear singly or paired
 * br, hr, li, dt, dd


 * Tags that must not be paired:br, hr


 * Tags that can be nested (source code is dubious on this)
 * table, tr, td, th, div, blockquote, ol, ul, dl, font, big, small, sub, sup, span


 * Tags that can only appear inside a table:
 * td, th, tr,


 * Tags that make lists:ul,ol,


 * And tags that can appear inside lists:li

The significance of these groupings is shown as follows:

A "B C" D E

Here, blockquote and span are both "nesting" tags. When the close-blockquote tag is found inside the span block, it is escaped.

This doesn't work: Some text But this does: '''Some text

Block HTML
(not referred to yet) BlockHTML = Pre | Blockquote | TableHTML | Div | HeaderHTML ;

String Types
''This text came from Meta-Wiki. It's not immediately compatible with the surrounding text (it's EBNF, rather than BNF, for a start). However it is much more precise about the nature of lines and captures rules about whitespace normalisation.''

Fundamental strings

WikiMarkupCharacters = "|" | "[" | "]" | "*" | "#" | ":" | ";" | "<" | ">" | "=" | "'" | "{" | "}" ;

UnicodeCharacter = ? all supported Unicode characters ? - Whitespaces ; UnicodeWiki = UnicodeCharacter - WikiMarkupCharacters ; PlainText = UnicodeWiki | "  " { "|" | "[" | "]" | "<" | ">" | "{" | "}" } "   "          | UnicodeWiki { " " } ( "*" | "#" | ":" | ";" ) | UnicodeWiki [ " " ] "=" [ " " ] UnicodeWiki | UnicodeWiki "'" | " '" UnicodeWiki ; WhiteSpaces = " " | NewLine | ? carriage return ? | ? line feed ? | ? tab ? | ? variants of spaces ? ; NewLine = ? carriage return and line feed ? ;

Article strings

Line = PlainText { PlainText } { " " { " " } PlainText { PlainText } } ; Text = Line { Line } { NewLine { NewLine } Line { Line } } ;

Titles

PageName = TitleCharacter, { [ " " ] TitleCharacter } ; PageNameLink = TitleCharacter, { [ " " | "_" ] TitleCharacter } ; SectionTitle = ( SectionLinkCharacter - "=" ) { [ " " ] ( SectionLinkCharacter - "=" ) } ; SectionLink = SectionLinkCharacter { [ "_" ] SectionLinkCharacter } ; LinkTitle = { UnicodeCharacter { " " } } ( UnicodeCharacter - "]" ) ;

TitleCharacter = UnicodeCharacter - BadTitleCharacters ; BadTitleCharacters = "[" | "]" | "{" | "}" | "<" | ">" | "_" | "|" | "#" ; SectionLinkCharacter = UnicodeCharacter - BadSectionLinkCharacters ; BadSectionLinkCharacters = "[" | "]" | "|" ;

Magic words
Not to be confused with magic links. These seem to be able to be used virtually anywhere: a table of contents in an image caption even works. See m:Help:Magic words. <magic-word>              ::= <magicword-notoc> | <magicword-toc> | <magicword-noeditsection> <magicword-toc>           ::= "" <magicword-notoc>         ::= "" <magicword-noeditsection> ::= ""

Images, media, gallery
Links to images and media should be handled as normal links. It's inline images and media that are being dealt with here. From meta...minor reformatting

Images
ImageInline               ::= "" ; ImageExtension            ::= "jpg" | "jpeg" | "png" | "svg" | "gif" | "bmp" ; ThumbImageParameter       ::= "thumb" | "frame" | "enframed" | "thumbnail" ; SizeImageParameter        ::= PositiveNumber "px" ; AlignImageParameter       ::=  "left" | "center" | "centre" | "right" ;

Media
MediaInline              ::=  "", "Media:" , PageName "." MediaExtension "" ; MediaExtension = "ogg" | "wav" ;

Gallery
GalleryBlock              ::=   "" ; GalleryImage              ::=   (to be defined: essentially   foo.jpg[|caption] )

Remarks:
 * The gallery block can technically be used in the middle of a sentence so is not a "special block". It doesn't render particularly nicely when you do that though.