Specs/HTML/2.1.0

This page defines a MediaWiki-specific DOM based on HTML5 and RDFa. The semantics of MediaWiki-specific functionality are encoded using RDFa.

Changes since Specs/HTML/2.0.0

 * Links to missing media are no longer rendered as broken images, but instead as redlinks. See T169975
 * Block figures are always rendered with a figcaption for consistency and styling, even if no caption content was provided.

RDFa structures
Global prefix mappings:
 * Convention: Capital for types, lowercase for attributes.
 * Generally use the prefix instead of vocab definitions to avoid clashes (and allow mixing) with user-supplied RDFa. User-supplied RDFa with the mw prefix is moved to a non-clashing prefix in Parsoid.
 * Generally use the prefix instead of vocab definitions to avoid clashes (and allow mixing) with user-supplied RDFa. User-supplied RDFa with the mw prefix is moved to a non-clashing prefix in Parsoid.

Versioning
An integer version number is set in the head section of the returned HTML document. This version is incremented whenever this DOM spec or any other important aspect of the Parsoid HTML output changes. See for details.

ID attributes on all elements
In pagebundles, we assign ID attributes to all elements, and use this to associate external metadata with those elements: Element_IDs. So far, we've moved data-parsoid (private, so should not matter to users) and will likely also move data-mw (public) from the DOM into JSON objects keyed on the ID.

Expectations of editing clients
This section only applies to clients that edit HTML and expect to convert that HTML to wikitext without introducing unrelated diffs in that wikitext.

A  protects DOM structures from any editing. Clients are expected to preserve / protect subtrees marked as such. Clients are also expected to preserve any DOM subtrees marked up with,  ,   in the http://mediawiki.org/rdf/ namespace they don't understand. This decouples clients from Parsoid development, and lets them concentrate on editing constructs whose special semantics they understand without having to implement all possible content elements.

Exceptions
Captions contained within media can be modified, for example a  within a. An exception to this exception would be when the media is itself generated by a transclusion, so a  which also has   or an about attribute. See template markup.

Headings and Sections
Status: Implemented. Tracking bug for sections; tracking bug for HTML5 IDs.

Given the illustrative wikitext: The corresponding HTML DOM would be: Note the following properties:
 * There is always a  tag for the lead section, even if it is empty.  It will be the first   tag in the document, and will have   unless it is uneditable.
 * The  attribute will either be   or greater and correspond exactly to a PHP section ID (as used for   for example), or will be   (indicating an uneditable non-pseudo section) or   (indicating a pseudo-section, which is also uneditable). Further discussion at Parsing/Notes/Section Wrapping.
 * There will be a  element as the first child of   iff it is not a pseudo-section.
 * The  attribute on the   element matches the 'html5'  .  If needed, an empty   with   will be added to hold an   attribute matching the 'legacy'  . The   attributes and the empty spans are ignored during serialization back to wikitext; only the contents of the heading element are significant.
 * The  tags are properly nested.

Images
Status: Implemented. Tracking bug.

In the examples below, the original size of the example image is 1941 × 220 pixels (these are the dimensions of the Foobar.jpg used in parserTests). The width and height in the DOM represent the actual scaled image height (not the bounding box dimensions specified in the wikitext). When image dimensions are modified or images with a non-default size are created, we will serialize to a square bounding box around the given width and/or height attributes. In the future: When using a (possibly scaled) version of the default thumbnail size, we will serialize using the  or   option to enforce a square thumbnail bounding box (see ).

The basic tree structure of all images, regardless of formatting options, alignment, or thumbnails, is: The outer &lt;figure&gt; element needs to become a &lt;figure-inline&gt; element when the figure is rendered inline, since otherwise the HTML5 parser will interrupt a surrounding block context. The inner &lt;figcaption&gt; element is rendered as a  attribute in this case (since block content in an invisible caption would otherwise break parsing). The inner &lt;a&gt; element needs to become a span if there is no link; see. An "alt" attribute on the &lt;img&gt; is present if (and only if) the "alt=" options are present in the wikitext markup. If the "lang=" option is present, the &lt;img&gt; tag will have a "lang" attribute. The "resource" attribute on the &lt;img&gt; tag specifies the wiki title and namespace for the image (so it doesn't have to be reverse-engineered from the "src" attribute); it should point to a relative URL based on the image title. The "link=" option will be present in generated wikitext if and only if the "resource" attribute of &lt;img&gt; differs from the "href" attribute of the &lt;a&gt; tag.

The &lt;img&gt; tag will have,  , and   attributes indicating the original (unscaled) size and type of the image. See.

Summary of semantic info for images
Summary of semantic info that is present in the HTML generated for images:
 * wrapper node: for block images and  for inline images
 * typeof attribute on the wrapper: mw:Image, mw:Image/Thumb, mw:Image/Frame, mw:Image/Frameless for different image uses
 * figure classes: mw-valign-{baseline,middle,sub,super,text-top,text-bottom,top,bottom} are only applied to inline media (rendered as ): mw-halign-{left,right,center,none} and optionally mw-image-border and mw-default-size for full-size images and thumbs scaled to the wiki's and user's default thumb size
 * figcaption sub-element: The caption
 * resource attribute on image: link to image resource page. TODO: what to use for images from commons?
 * width and / or height on image: scaled image size. Only one of width or height is fine for easier client-side scaling without aspect ratio issues.
 * alt attribute on image: alt property
 * src attribute on image: thumb governed by explicit thumb option or implicit from image
 * href attribute on a around image: link target, normally just the image page- BUT a element can be absent if link is explicitly empty.

Specific image examples
(Note 1)

Without a link, we use the same basic DOM structure, but use a span instead of an a wrapper :

(Note 1)

Adding 'left' causes the image to be rendered in block context, so the outer &lt;span&gt; becomes a &lt;figure&gt;:

(Note 2, Note 5)

Scaling, vertical alignment of an inline image:

(Note 1)

Caption (containing disallowed markup) on an inline image:

(Note 2, Note 5)

(Note 2)

(Note 3, Note 4)

(Note 3)

(Note 5)

Note that "border" can be combined with "frameless".

(Note 5)

Manual thumbnails; note that the  attribute points at the original image, the   attribute points to the manually-specific thumbnail image, and the   attribute indicates the resource name of the thumbnail (so it doesn't have to be inferred from the  ):

Resizing images with the "scale" option:

Resizing thumbs with the "scale" option (this is a square 220x220px bounding box, see ):

Resizing with the "upright" option (note that this is converted to an appropriate "scale" option, see above):

See enwiki help for all options, see mw for inline/float details

Note 1: The PHP parser adds a default alt attribute to the &lt;img&gt; tag, with content "Foobar.jpg". Client-side post-processing will need to add this for compatibility. (Parsoid does not add this attribute because it does not correspond to anything in the wikitext.)

Note 2: In this case the PHP parser adds a title attribute to the &lt;a&gt; and an alt attribute to the &lt;img&gt;, both with the value "caption". Note that this is a markup-stripped version of the supplied caption in some cases. Client-side post-processing will need to add these.

Note 3: The PHP parser adds a  element inside the &lt;figure&gt;. Parsoid adds this with css.

Note 4: The default thumbnail width is a user-specified preference for the PHP parser. Parsoid uses a fixed 220px thumbnail width. The "mw-default-size" class indicates "no size given" and can be used to resize thumbs according to user preferences.

Note 5: In this example, the caption is not visible in PHP output, so the there should be a rule in the default stylesheet like (IE7+ and other modern browsers): In the PHP parser output, the caption does appear as a title attribute on the &lt;a&gt; and an alt attribute on the &lt;img&gt;; client side post-processing should add these (unless there are existing title and alt attributes, resulting from "title=" and "alt=" properties in the wikitext).

Audio/Video
Status: Implemented. See tracking bug for details.

The basic  wrapper for audio and video media is identical to that for images, described in the section above, including provisions for inline players and captions. (Note that the PHP implementation does not properly render manual thumbnails or inline.)

The inner  element tracks the elements emitted by the video.js implementation in T100106.

Notes:
 * As a general rule, attributes derived from inspection of the original media file (original size, etc.) get  prefixes.  Attributes of derived/transcoded media get plain   attributes.  See T133670
 * The  and   tags are ignored during HTML-to-wikitext serialization; all information encoded in wikitext is represented on the ,  ,  , and   elements.
 * The wikitext  option does not exist for audio / video (it can be specified but is not added to output, since the html5 spec defines that it should not be present since accessibility for a/v is via captions specified by the   element).  It is represented in our html as a hidden attribute in.
 * The wikitext  option does not exist for audio / video (it can be specified but is not added to output, since we want the clicks to play the media, not follow a link) --videos always produce , never  .  It is represented in our html as a hidden attribute in.


 * The wikitext,  , and   options are deprecated and we mark them as bogus options, surfaced in linter for editors to clean up.  See T134880 and T135537
 * Since it is not guaranteed that the original file is one of the sources listed, the  attribute on   represents that data.

More examples
The  option is editable through a   attribute and influences the seek time of the poster.

The  and   options are editable through   attributes and influence the media fragments  on the source urls.

Browsers will ignore dimensions on elements but we supply them to be enforced dynamically, if desired. The wrapper is annotated with  to indicate audio files. See T133673

Other Media
Some complex media, like PDFs, permit previewing with the "page" option:

Missing media
If Parsoid fails to fetch the media info for a file, it keeps the same structure with stuffed span in place of the media element and links it to the missing media. See T169975

Wiki links

 * The href attribute is UTF8 (as everything else), with a relative link prefix that always navigates up to the top of the wiki namespace, especially in subpages / pages containing slashes in the title. Example: './Foo', or (in a subpage) './../Foo'. We percent-encode percents and question marks in hrefs to support following links to wiki pages with question marks in their name. On the way in (when posting HTML to Parsoid) we assume href values to be urlencoded and decode them during serialization. Modified link hrefs without ./ or ../ prefix are temporarily assumed to be absolute to the wiki namespace for now, but will also be interpreted as relative to the page soon to support relative links in other HTML content. After that change, the equivalent of an absolute wikilink  would need to return an href="/Foo" instead.

Link with tail:

Media links
Media links of the form  or   are a special case of wikilinks and are represented as below. Note the mw:MediaLink rel attribute value.

Nowiki blocks
There are two options to handle nowiki editing:
 * 1) Strip the tags from the DOM and let the serializer add those that are needed after each edit
 * 2) Keep them in the DOM for more accurate round-tripping of manually created nowiki blocks, and prevent non-text content from being entered into these blocks in the editor (TODO)

We picked option 2 for now. The nowiki content remains editable. If the content is modified in a way that makes nowiki unnecessary Parsoid can remove the wrapper in the serializer.

HTML entities
HTML entities are wrapped with a span tag with a mw:Entity typeof attribute. For example, in wikitext generates the following HTML output:

Editing clients that wish to prevent the entities from being decoded when transformed to wikitext have to wrap them with a span tag like above.

Display space
An  is a non-breaking space, added for the purpose of improving the visual display of punctuation, particularly for the French language. It's not present in the wikitext but added as a post-processing step on the output. (Previously, this had an additional  typeof, which was removed in T254502.)

Behavior switches
Behavior switches are primarily represented as a meta tag as a placeholder to mark the presence and place where the behavior switch showed up on the page. This lets editing clients support editing of these behavior switches in some fashion. The actual page modification that the behavior switch targets is not always flagged right now.

The table below shows the property string for the different behavior switch. The meta tag is of the form

Redirects
(T104502: This no longer creates a category.)

Note that interwiki links generate redirect tags; the client is responsible for not doing an HTTP 301 or 302 redirect to an external site.

Note that, unlike the PHP parser, using language links still generates correct redirect tags in Parsoid. The client is again responsible for not doing an HTTP redirect to an external wiki.

Regular transclusions
Many transclusion parameters contain arbitrary wikitext, styles, template names and other non-semantic / DOM strings. We also have very little information which attributes are semantic and which are presentational. So for now, we will thus expose all attributes in the "wt" (wikitext) format:

The  property is used to associate additional information with each transclusion or extension fragment. This lets us support inline editing of things like infobox parameters in the future without changes to the JSON data structure.

Parameter names are represented by their index, if not explicitly named, or by the name that will be used when replacing them. In the case that the normalized parameter named is different from the actual parameter name in the text, a key.wt attribute is used to keep the name as it appears in the text. For example:

Compound content blocks that include output from several transclusions like this football table is represented by interspersing wikitext strings with transclusion information in the data-mw.parts array:




 * $$1+1$$
 * }
 * }

Editing support for the interspersed wikitext is difficult to implement on the server side, as those wikitext edits need to be restricted in their effect to the original DOM range. A potential solution to this could be to wrap the multi-template compound block into a template hook that expands its content to a well-balanced DOM structure. Arbitrary wikitext edits within this tag would still only affect the original DOM range, both in Parsoid and the PHP parser. This is lower priority though, so for now the interspersed wikitext will be read-only.

Variables and Parser Functions
These other magic words, apart from the two defined in the behaviour switches section above (DEFAULTSORT and DISPLAYTITLE), render similarly to templates but have a function property in their data-mw, as opposed to an href. For example, the wikitext  renders as,

Generated attributes of HTML tags
Status: Implemented. See

This is the representation of attributes in links, tables, and html tags whose keys and/or values were not present as literal text in the input. When only attributes are affected, the element is be assigned an  typeof attribute and the   JSON object will provide additional specific information about the keys or values that are fully or partially generated. If other parts of the content are also transclusion-affected, the element will be marked up as a general transclusion instead.

It is conceivable to think up use-cases where part of an attribute value is generated by a template (ex: color of a background-color of a style attribute), but not as much for attribute-keys. This spec also assumes that a template can only generate one attribute rather than multiple attributes.

A few examples are worked out below.

Example 1:

Example 2:

Example 3:

Parameter Substitution at the top-level
This section is only present for the sake of completeness. Unexpanded parameter markup is unlikely to be useful in top-level content, and if found, it is either a draft, syntax error, or a copy-paste without being fixed up.

This section specifies wrapping for parameter uses in the top-level namespace where all parameter substitutions evaluate to a null value. The structure borrows heavily from transclusion content, described above, with some slight divergences. The target corresponds to the parameter name, and the params contain the default value.

Extension tags
The data-mw attribute is a JSON object. It is meant as an extensible public interface, so more top-level members can be added. The top-level structure depends on the content type, with the main types being transclusions and extensions. See also the transclusion content section.

At present, Parsoid has few native extension handlers. See Specs/HTML/2.1.0/Extensions for details on editing their content.

noinclude / includeonly / onlyinclude
We only care about these in the actual page context, not in transcluded pages / templates.

The content in includeonly blocks is exposed to clients for editing and diffing, etc. using the data-mw attribute.

Language conversion blocks
Status: experimental.

The attribute is named  since it affects the read-only rendering of the page, and   attributes are supposed to be ignored for rendering and only needed for editing.

Top-level fields in the JSON are:,  ,  ,  ,  , and. If the wikitext "show" flag is not present or implicit, the DOM markup will use the  element. If "show" is present or implicit, the DOM markup will use  if all possible contents are inlineable, or   otherwise.

Further discussion (and historical background) at and Language conversion blocks.

Error handling
See :
 * For API errors because of a non-existing image, data-mw.errors.key is set to "missing-image".
 * For API errors getting image info, data-mw.errors.key is set to "api-error" and data-mw.errors.message has more information about the specific error.
 * For image wikitext where a manual thumbnail is specified and it is not present, the data-mw.errors.key is set to "missing-thumbnail" and data-mw.errors.message is set to "This thumbnail does not exist.".

Ex:

Ex: