Markup spec/BNF/Inline text

"Inline text" is the guts of Wikitext formatting. It covers every situation where "normal" text is allowed, such as image captions, table data, and to an extent (yet to work out how to enforce this restriction...), headings and link text.

            ::=  []          ::= |  |  |  |  |  |  |  | 

   ::= |                             |  |                             |  |  | <close-guillemet> | <html-entity> | <html-unsafe-symbol> |                             | <random-character> | (more missing?)... ::= { <harmless-character> }+

Detail:
 * noparse-block: ../Noparse-block
 * link: ../Links/
 * magic-link: ../Magic links/

The parser should try the options in order, with <tt>text</tt> matching if all else fails. In particular, <category-link> should be matched before because a category is a special type of link, and we don't want the normal parsing to occur.

HTML entity
The parser recognises validly constructed HTML entities and leaves them alone. <html-entity>          ::= "&" <html-entity-name> ";" | "&#" <decimal-number> ";" | "&#x" <hex-number> ";" <html-entity-name>     ::= Sanitizer::$wgHtmlEntities (case sensitive) (* "Aacute" | "aacute" | ... *)

Rendering

 * The whole sequence is output literally.
 * The current parser does very complicated things with escaping and de-escaping. So maybe there are places where something more complicated needs to happen.

HTML unsafe symbol
These "unsafe" symbols are turned into HTML entities if they haven't matched part of a valid HTML entity above. It's probably not too efficient having single-character level matching rules...perhaps should be combined with "text". <html-unsafe-symbol>    ::= <unescaped-ampersand> | <unespaced-less-than> | <unescaped-greater-than> <unescaped-ampersand>   ::= "&" <unescaped-less-than>   ::= "<" <unescaped-greater-than> ::= ">"

Rendering

 * <unescaped-ampersand> &rarr;
 * <unescaped-less-than> &rarr;
 * <unescaped-greater-than> &rarr;

Text
Harmless-characters mean characters that couldn't be anything else. I'm not sure how useful this is as a distinction, but perhaps it will help speed things up?

A "random character" is any character which hasn't matched anything else.

<harmless-characters>   ::= /[A-Za-z0-9] etc <random-character>      ::= ? any character ... ?

Rendering
Both types are written literally.

'''This section from the "fundamental elements" section...time to mangle!



::= [ ] <space-tabs>           ::= <space-tab> [<space-tabs>]

<whitespace-char>      ::= <space-tab> |

<space-tab>            ::= | TAB ::= [ ]                ::= " "

<EOL>                  ::= | EOF

<non-whitespace-char>  ::= | <decimal-digit> | ::= <ucase-letter> | <lcase-letter> <ucase-letter>         ::= "A" | "B" | ... | "Y" | "Z" <lcase-letter>         ::= "a" | "b" | ... | "y" | "z" ::= <html-unsafe-symbol> | | "." | "," | ...



<decimal-number>       ::= <decimal-digit> [<decimal-number>] <decimal-digit>        ::= "0" | "1" | ... | "8" | "9"

<hex-number>           ::= <hex-digit> [<hex-number>] <hex-digit>            ::= <decimal-digit> | "A" | "B" | "C" | "D" | "E" | "F" | "a" | "b" | "c" | "d" | "e" | "f"

Formatting
Bold/italics is the biggest problem with switching to a consume-parse-render parser. It will not be possible to describe the current, extremely esoteric rules in simple (E)BNF. The best we can hope for is to store tokens representing the apostrophe clumps and do a second pass to make more sense of them. It would be very useful to define a second, unambiguous set of formatting syntax (most likely // and **), and encourage people to use those wherever apostrophes and bold/italics meet.

Some rules for parsing bold/italics as recognised by the current parser. These must be implemented (Brion said so). In increasing order of complexity:
 * 1) italics, bold, bold-italics
 * italics, bold, bold-italics
 * 1) bold-italics just italics normal
 * bold-italics just italics normal
 * 1) Some text about l'Arc de triomphe.
 * Some text about l'Arc de triomphe.
 * 1) However: bold is l'Arc de triomphe'''.
 * However: bold is l'Arc de triomphe'''.

Optimistic view: ::= <bold-italic-toggle> | <bold-toggle> | <italic-toggle> <bold-italic-toggle>     ::= "'" <bold-toggle>            ::= "'''" <italic-toggle>          ::= "''"

Reality: ::= <apostrophe-jungle> <apostrophe-jungle>      ::= "''" { "'" }

Rendering

 * Once the parser has decided which way the toggles go:
 * bold-toggle-on -> &lt;b>
 * bold-toggle-off -> &lt;/b>
 * italic-toggle-on -> &lt;i>
 * italic-toggle-off -> &lt;/i>
 * bold-italic-toggle-on -> &lt;b> &lt;i>
 * bold-italic-toggle-off-> &lt;/i> &lt;/b>

Determining the behaviour of apostrophes
The following describes the behaviour of repeated postrophes. "Bold" means "toggle bold", rather than "turn bold on". "Bold, italics" means "Toggle bold and italics independently", rather than "turn bold and italics on" or "toggle bold and italics the same way".


 * One : Always a single apostrophe.
 * e.g.  &rarr; hello ' blah
 * Two : Always italics on or off
 * e.g.  &rarr; hello '' blah
 * Three :
 * Bold (default)
 * e.g.  &rarr; hello ''' blah
 * Apostrophe, italics
 * If there is otherwise an odd number of both bold and italics
 * If the preceding characters are <non-space> (and there are no earlier such sequences)
 * e.g.  &rarr; hello lamour louest' blah
 * Else if the preceding characters are <non-space><non-space> (and there are no earlier such sequences)
 * e.g.  &rarr; hello mon'amour blah
 * Else (the preceding character is ) (and there are no earlier such sequences)
 * e.g.  &rarr; hello amour blah 'blah
 * Italics, apostrophe (never)
 * Four :
 * Bold, apostrophe (never)
 * Apostrophe, bold (default, if either bold or italics ends up balanced)
 * e.g.  &rarr; hello 'amour now ''italics unbalanced, but that's ok
 * e.g.  &rarr; hello 'amour now, '''bold unbalanced, but that's ok
 * Apostrophe, apostrophe, italics
 * If the default treatment leads to an odd number of bold and italics then this can meet condition 1 under the second case of three italics, above.
 * e.g.  &rarr; hello 'amour' now bold and italics unbalanced, so invoke this special case
 * Five :
 * Bold, italics; or italics, bold (default, the two cases are equivalent)
 * e.g.  &rarr; hello ' blah
 * Italics, apostrophe, italics (never)
 * More than five:
 * Apostrophes, bold+italics (default)
 * e.g.  &rarr; hello  blah
 * e.g.  &rarr; hello '''bold  blah
 * Bold+italics, apostrophes (never)

Inline HTML
The parser recognises and cleans a large number of HTML tags, as defined in Sanitizer.php.

A decision has to be made here on whether to attempt to parse these things as a matched set, or whether to leave that to a later pass.

A loose definition assuming they are treated individually: <InlineHTML>             ::=  <InlineHTML-Open> | <InlineHTML-Close> | <InlineHTML-OpenClose> | <HTMLComment> <InlineHTML-Open>        ::=  "<" <InlineHTMLtagname> [<extra-characters>] ">" <InlineHTML-Close>       ::=  "</" <InlineHTMLtagname> [<extra-characters>] ">" <InlineHTML-OpenClose>   ::=  "<" <InlineHTMLtagname> [<extra-characters>] "/>" <extra-characters>     ::= <word-boundary-char> {characters - ">"} <word-boundary-char>     ::= " " | "-" | ":" | " " | "\"" | "/" | "*" | "#" | "!" | "$" | "%" | ...


 * Remarks:
 * The range of "word-boundary-char" seems to be an artefact of the regular expression:


 * The list of "tags that must be closed" :
 * b, del, i, ins, u, font, big, small, sub, sup, h1 h2, h3, h4, h5, h6, cite, code, em, s,
 * strike, strong, tt, var, div, center, blockquote, ol, ul, dl, table, caption, pre,
 * ruby, rt, rb , rp, p, span, u


 * Tags that can appear singly or paired
 * br, hr, li, dt, dd


 * Tags that must not be paired:br, hr


 * Tags that can be nested (source code is dubious on this)
 * table, tr, td, th, div, blockquote, ol, ul, dl, font, big, small, sub, sup, span


 * Tags that can only appear inside a table:
 * td, th, tr,


 * Tags that make lists:ul,ol,


 * And tags that can appear inside lists:li

The significance of these groupings is shown as follows: A "B C" D E Here, blockquote and span are both "nesting" tags. When the close-blockquote tag is found inside the span block, it is escaped.

This doesn't work: Some text But this does: '''Some text

Rendering

 * Tags that have to be paired are forced closed according to some sort of logic.
 * <extra-characters> are "sanitized", strip all but pre-approved attributes and styles on a whitelist.
 * Tags are then written out literally:  etc.
 * HTML comments are completely discarded, with some whitespace massaging: (sanitizer.php)
 * To avoid leaving blank lines, when a comment is both preceded and followed by a newline (ignoring spaces), trim leading and trailing spaces and one of the newlines.

Open/close guillemet
This is pretty trivial and used to improve the appearance of quotations in European languages, notably French. Performed directly in the parse method. <open-guillemet>         ::= "&laquo;" " " <close-guillemet>        ::= <non-space-character> "&raquo;" " "

Rendering

 * In both cases, the space is converted to a "&amp;nbsp;" string.
 * In the second, the non-space character is presumably rendered literally. This is probably not a good way to express the original regular expression:

Magic words
Not to be confused with magic links. These seem to be able to be used virtually anywhere: a table of contents in an image caption even works. See m:Help:Magic words. <magic-word>              ::= <magicword-toc> | <magicword-forcetoc> | <magicword-notoc> | <magicword-noeditsection> | <magicword-nogallery> <magicword-toc>           ::= mw("toc") <magicword-forcetoc>      ::= mw("forcetoc") <magicword-notoc>         ::= mw("notoc") <magicword-noeditsection> ::= mw("noeditsection") <magicword-nogallery>     ::= mw("nogallery") <magicword-defaultsort>   ::=, mw("defaultsort"), <defaultsort-key>,

(* I don't really get how these work... *) ::= ""

(* defaults, i->case insensitive, s->case sensitive *) mw("notoc")               ::= ""i mw("forcetoc")            ::= ""i mw("notoc")               ::= ""i mw("noeditsection")       ::= ""i mw("nogallery")           ::= "__NOGALLERY__"i mw("defaultsort")         ::= "DEFAULTSORT:"s | "DEFAULTSORTKEY:"s | "DEFAULTCATEGORYSORT:"s

Notes:
 * These are the "default" strings to be matched. They can be modified in  where Xx is the language.
 * Each magicword can have more than one string associated with it.
 * The magic words are by default case insensitive but this can be changed in the file.
 * Plenty of other "magic words" exist, including "magic variables" (eg August ) which will be handled by the preprocessor. However it looks like all sorts of other "magic words" exist and are processed in different places.

Semantics

 * magicword-toc: a miniature contents page will be rendered and inserted at the first instance of this token.
 * magicword-forcetoc: a contents box will be rendered even if the normal criteria (typically, 4 sections) have not been met. Irrelevant if magicword-toc is present.
 * magicword-notoc: no miniature contents pages will be rendered. Only takes effect if neither magicword-toc nor magicword-forcetoc are present.
 * magicword-noeditsection: no edit links are to be displayed for any sections.
 * magicword-nogallery: unclear. According to the code (parser::stripNoGallery): if the string __NOGALLERY__ (not case-sensitive) occurs in the HTML, do not add TOC. Perhaps it only has an effect in certain namespaces.
 * magicword-defaultsort: sets the default sort key if this page is added to any categories. Overruled by manual sort keys for any category inclusions.

Images, media, gallery
Links to images and media should be handled as normal links. It's inline images and media that are being dealt with here.

Originally from MetaWiki.

Images
ImageInline               ::= "" ; ImageName                 ::= PageName, ".", ImageExtension ImageExtension            ::= "jpg" | "jpeg" | "png" | "svg" | "gif" | "bmp" ; ImageOption               ::= ImageModeParameter | ImageSizeParameter | ImageAlignParameter | ImageVAlignParameter | Caption

ImageModeParameter        ::= ImageModeManualThumb | ImageModeThumb | ImageModeFrame | ImageModeFrameless

ImageModeManualThumb      ::= mw("img_manualthumb"); ImageModeAutoThumb        ::= mw("img_thumbnail"); ImageModeFrame            ::= mw("img_frame"); ImageModeFrameless        ::= mw("img_frameless");

/* Default settings: */ mw("img_manualthumb")     ::= "thumbnail=", ImageName | "thumb=", ImageName mw("img_thumbnail")       ::= "thumbnail" | "thumb"; mw("img_frame")           ::= "framed" | "enframed" | "frame"; mw("img_frameless")       ::= "frameless";

ImageOtherParameter       ::= ImageParamPage | ImageParamUpright | ImageParamBorder ImageParamPage            ::= mw("img_page") ImageParamUpgright        ::= mw("img_upright") ImageParamBorder          ::= mw("img_border")

/* Default settings: */ mw("img_page")            ::= "page=$1" | "page $1"  ??? (where is this used?) mw("img_upright")         ::= "upright" [, ["=",] PositiveInteger] mw("img_border")          ::= "border"

ImageSizeParameter        ::= mw("img_width"); /* Default setting: */ mw("img_width")           ::= PositiveNumber "px" ;

ImageAlignParameter       ::= ImageAlignLeft | ImageAlign|Center | ImageAlignRight | ImageAlignNone ImageAlignLeft            ::= mw("img_left") ImageAlignCenter          ::= mw("img_center") ImageAlignRight           ::= mw("img_right") ImageAlignNone            ::= mw("img_none")

/* Default settings: */ mw("img_left")            ::= "left" mw("img_center")          ::= "center" | "centre" mw("img_right")           ::= "right" mw("img_none")            ::= "none"

ImageValignParameter      ::= ImageValignBaseline | ImageValignSub | ImageValignSuper | ImageValignTop | ImageValignTextTop | ImageValignMiddle | ImageValignBottom | ImageValignTextBottom

ImageValignBaseline       ::= mw("img_baseline") ImageValignSub            ::= mw("img_sub") ImageValignSuper          ::= mw("img_super") ImageValignTop            ::= mw("img_top") ImageValignTextTop        ::= mw("img_text_top") ImageValignMiddle         ::= mw("img_middle") ImageValignBottom         ::= mw("img_bottom") ImageValignTextBottom     ::= mw("img_text_bottom")

/* By default: */ mw("img_baseline")        ::= "baseline" mw("img_sub")             ::= "sub" mw("img_super")           ::= "super" | "sup" mw("img_top")             ::= "top" mw("img_text_top")        ::= "text-top" mw("img_middle")          ::= "middle" mw("img_bottom")          ::= "bottom" mw("img_text_bottom")     ::= "text-bottom" Caption                   ::= <inline-text>

Semantics

 * Renders an image inline using the  tag.
 * It is not an error to specify multiple alignment parameters; the first specified is the one used.
 * It is not an error to specify multiple captions; the last specified is the one used.
 * The caption has no effect if ThumbImageParameter is not given.

Media
MediaInline              ::=  "", "Media:" , PageName "." MediaExtension "" ; MediaExtension = "ogg" | "wav" ;

Gallery
GalleryBlock              ::=   "" ; GalleryImage              ::=   (to be defined: essentially   foo.jpg[|caption] )

Remarks:
 * The gallery block can technically be used in the middle of a sentence so is not a "special block". It doesn't render particularly nicely when you do that though.