Jump to: navigation, search

A page about using HTML5 features in MediaWiki. See the spec.

The use of HTML5 by MediaWiki is controlled by $wgHtml5. This is set to true (output normal wiki pages as HTML5) by default, and has been the default on Wikimedia wikis since September 2012.

FAQ about MediaWiki use of HTML5[edit]

What exactly do you mean by HTML5?
This page only discusses using HTML5 for our static HTML markup, instead of XHTML 1.0 Transitional. Although HTML5 includes JavaScript APIs (getElementsByClassName(), drag-and-drop, etc.) and other things too, using these is entirely uncontroversial and isn't covered here.
Why should we use HTML5?
First, HTML5 is the next HTML standard. All new HTML features will be added there, and we'll have to use it eventually if we want to take advantage of them. Currently MediaWiki uses XHTML 1.0, which became a W3C Recommendation in January 2000 – 16 years ago. There are some modest immediate benefits to using HTML5 already (see #Already done and #Short term below), so we may as well switch sooner rather than later.

Second of all, more idealistically, a major goal of HTML5 is to advance the open web by supplanting proprietary technologies like Flash and Silverlight. It goes to great lengths to reduce the need for closed-source and vendor-locked software by introducing elements like <video> and <canvas>. It also aims to drastically lower the bar to creating a new browser by specifying huge amounts of behavior that previously had to be painstakingly reverse-engineered from existing browsers. All of this accords very closely with Wikimedia's values of "thriving open formats and open standards on the web", and both MediaWiki and Wikipedia should do what they can to be trendsetters and help advance these goals.

But HTML5 is tag soup!
HTML5 doesn't require XML well-formedness – e.g., you can omit attribute quote marks – but it does permit it. MediaWiki currently still outputs well-formed XML by default. This means that by default, you can still (modulo bugs) parse MediaWiki pages using XML libraries, transform them via XSLT, etc. MediaWiki administrators who want to reduce the size of output HTML can disable $wgWellFormedXml. When HTML5 has been around for a while and HTML5 parsing libraries are as prevalent as XML parsing libraries, this benefit might not be so compelling anymore.
So MediaWiki outputs XHTML5?
According to the HTML5 spec, XHTML5 must be output with an XML MIME type. If you configure MediaWiki and/or your server to serve an XML MIME type, then it's possible you can get it to serve XHTML5, because XHTML5 is fairly close to a subset of HTML5, and it's possible by design to create documents that can be served as either HTML5 or XHTML5. However, for practical purposes, people don't use XML MIME types much, so normally MediaWiki won't output XHTML5, but rather HTML5 that happens to be well-formed XML (if $wgWellFormedXml is true). This gives most of the advantages of XML (such as they are), but doesn't cause minor output bugs or typos in system messages to break the site.
See the W3C's Polyglot Markup: HTML-Compatible XHTML Documents.
Will MediaWiki continue to support XHTML 1.0 Transitional?
MediaWiki trunk still supports XHTML 1.0 Transitional (just disable $wgHtml5). There is a proposal to drop XHTML 1.0 in the 1.22 release.

Already done[edit]

Useful things that have already been at least partially implemented in trunk, and require $wgHtml5 to be on.

  • HTML5 form attributes: required, pattern, etc. Maybe also new input types like date. Started in r54567.
  • Remove useless elements and attributes: <head>, type="" on script/style. Started in r54695, lots more done in later revisions but lots more still to do.
  • Use the spellcheck attribute where useful. r59360 does this on edit summaries.

Short term[edit]

Things we can start doing as soon as we like without too much effort, without harmful side effects.

  • Support <video>/<audio> without JavaScript.
    • Need to confirm fallback behavior on Safari w/o XiphQT, to make sure the fallback content [eg, Java player] is displayed correctly.
      • Should be possible to do with JavaScript. Currently we require JavaScript to get videos to work on any browser, so this will be a step forward! At least Firefox and Chrome won't need it.
    • See Extension:TimedMediaHandler ( it does not output the object tag or java cortado as a child of the video tag, but we could add that in ).
  • Use data-* if that's useful anywhere (HTML diffs used something like this at some point before removing them for XHTML validity reasons).
  • Use more comprehensible HTML IDs. Currently we restrict them to forms acceptable to HTML4, meaning a subset of ASCII, which results in horrible stuff for foreign languages (or even punctuation in English) with lots of dots and hex codes when we auto-generate header anchors. HTML5 says IDs can be any Unicode string that doesn't contain whitespace, which would allow us to output much nicer-looking IDs. (XHTML1 is more generous than HTML4 as well, but significantly less generous than HTML5: it doesn't allow most punctuation.)
    • Things to be careful of: what do browsers accept in practice? (Need to test anchors in links, in HTTP redirects, in CSS selectors, in JavaScript, anything else?) Should we rule out some ASCII characters for the sake of better compatibility with, e.g., CSS? (IDs with "." or ">" or such will require escaping for use in CSS.) How should we handle backward compatibility with existing links from external sources?
    • The code for this is largely done (grep for $wgExperimentalHtmlIds), but can't be enabled until at least Opera 10.10 becomes irrelevant.

Medium term[edit]

Things that will take more care or effort, or will require broader browser support to be useful.

  • Remove closing tags, attribute quotes, etc. Need to be careful: breaking XML well-formedness might break bots.
    • The potential benefit here is reduced HTML output size, thus reduced bandwidth usage and faster page loads. (Hypothetically, inconsistent use of quotes might increase gzipped HTML size, but testing on a sample page gave 4061 bytes gzipped when always using quotes and 4045 when not.) Gains will be moderate, but they add up. The downside noted is that client-side tools doing UI screen scraping with an XML parser would fail; scrapers would need to use a proper HTML parser instead or move to using the API... worst case would be that they switch to regex-based scraping. :)
    • Also note that some devs have expressed strong objections to moving away from well-formed XML. This will need some arguing over.  :)
    • This is now controlled by the $wgWellFormedXml setting, defaults to true (keep outputting well-formed XML).
  • Embed MathML and SVG inline, at least for some users. We'd have to be very careful about sanitizing this to avoid XSS ― especially in the case of a browser that doesn't support inline MathML/SVG, and so will treat the contents of the tags as HTML. (We could do this with XHTML 1 too, but only if we serve content with an XML MIME type, which we probably don't want to if we can avoid it. So it would be more convenient with HTML5.)

Long term[edit]

Things that we can't do without more browser support. Not much point in working too much on this; too much will depend on how browser development progresses.

  • Start using semantic HTML5 tags like <article>, <section>, etc., and allow (some of) them in user input. This doesn't work acceptably in IE right now without JavaScript hacks: the elements can't be used for styling, so are mostly pointless.
  • Use new HTML5 functional tags like <meter> (in addition to <video>/<audio>). Long term because this doesn't seem to be possible to do usefully in a backward-compatible manner without script (is it?).
    • Yes, it is: the contents of <meter> and <progress> and such are available as fallback. But it's hard to think of uses for these.

Validity issues[edit]

Once we're sure we're going with HTML5, we need to start handling validity issues. One validity checker is at (although of course, like any validator, it will not catch all errors). Bugs should be filed on these, but here are some that are already known:

  • There are likely still places where the software outputs deprecated stuff like cellpadding, align, font, etc.
  • Line-initial : without a ; is usually used for indentation, and creates a <dl> without any <dt>'s. We need to make this output <div class=indent> or something, if we can't persuade people to use semantic markup instead.
    • Might be a bad idea, seeing how this is being used on talk pages. —Ms2ger 14:23, 17 July 2009 (UTC)
      • That's why we can't just make it invalid. (It's used in articles too, to indent quotations and such.) We can still change the markup generated so it's clearly presentational rather than pretending to be a definition list, while maintaining the current visual effect. —Simetrical (talk • contribs) 16:52, 17 July 2009 (UTC)
  • Users can currently still input now-invalid elements/attributes like <font>, cellpadding, etc. We could try to automatically translate these, but it would be tricky in some cases. Also, maybe it's better to give the validation errors and encourage users to use semantic HTML instead of auto-translating their presentational garbage?

Also note Manual:$wgValidateAllHtml, but is tidy ready for HTML5?

Avoid HTML named entities[edit]

The HTML that MediaWiki outputs is in general valid XML. Rather than be dependent on external DTDs to define HTML named entities such as &mdash; and &nbsp; (which would complicate using the simple HTML5 <!DOCTYPE html>), MediaWiki PHP code does not use named entities apart from four of the five predefined entities in basic XML:

  • &lt; <
  • &gt; >
  • &amp; &
  • &quot; "

When necessary to encode apostrophe ' , MediaWiki code follows the advice at w:List of XML and HTML character entity references#Entities representing special characters in XHTML and uses the numeric reference &#39;, because the named entity &apos; was never defined in legacy HTML.

Instead of using named entities in MediaWiki code, you should:

  • simply use the UTF-8 character directly (by copying and pasting the glyph, or using the editor's "Special characters" menu),
  • or use a numeric character reference
    • particularly for the non-breaking space you can use &#160;, although using runs of non-breaking spaces for layout is usually a mistake

See the "Named entity references and XML well-formedness" discussion on wikitech-l.

Compatibility issues[edit]

  • There is no transitional doctype and hence no limited quirks mode (also called almost standards mode) in HTML5. With a strict doctype, some elements (notably inline images, but also elements whose display property is inline-block or inline-table) have descender space that they don't have when they appear within table cells with the current transitional doctype. The real fix for this is img { display: block; }, but that is highly problematic for the content area because it can break things. img { vertical-align: middle; } mostly hides the issue (except for images that are smaller than the current line-height) and Monobook and Vector do this anyway, but the Modern skin breaks. --Entlinkt 18:48, 8 August 2010 (UTC)
    • Investigation shows that the difference between limited-quirks mode and no-quirks mode is actually larger: inline images, inline-blocks and inline-tables don't create line boxes in quirks mode unless there is also a text node on the same line; this is true even outside tables. I have raised this issue at dewiki and nobody seems to care as long as it's not visibly broken. Maybe we should break it rather sooner than later, so people can bring themselves to fix it. --Entlinkt 21:37, 11 August 2010 (UTC)
      • People should be able to fix it pretty quickly. I don't think there's any better transition plan than just breaking it. This is all held up on the fact that we're running months-old code on Wikimedia, trunk has been HTML5 by default for ages. There are some fixes for XML well-formedness on trunk that need to be deployed before we can switch Wikimedia. —Simetrical (talk • contribs) 23:25, 11 August 2010 (UTC)
  • Not strictly HTML5 related because HTML4 already allowed colons in IDs, but still: These need to be escaped in CSS, but only Internet Explorer >= 8 allows the nice and easy way \:. IE 6 and 7 need the more complicated \3A syntax. IE <= 5.5 does not support escaping in CSS at all. --Entlinkt 18:48, 8 August 2010 (UTC)
    • Did some more testing on problematic characters and found these so far:
Codepoint Character HTML CSS Fragment identifier getElementById
U+0000 through U+0020 Control characters and whitespace invalid - - -
U+0023 # OK OK invalid, also confuses IE OK
U+0025 % OK OK invalid OK
U+0030 0 OK #\0 works in Mozilla despite being undefined, use #\30 OK OK
U+003A : OK #\: broken in IE 6 and 7, use #\3A OK OK
U+005B [ OK OK invalid OK
U+005D ] OK OK invalid OK
U+005F _ OK #_ broken in IE 6, use #\_ OK OK
U+007F through U+009F Control characters invalid - - -
U+00A0 No-break space OK #\  broken in IE 6 and 7, use #\A0 invalid OK
General issues  ? IE 5.5 cannot escape anything in CSS "invalid" means that throws an error if used in href. Errors go away when percent-encoding these characters, but percent-encoded fragments don't work in IE 8 and Opera 10.70. None found
  • Section 3.5 of RFC 3986 contains harsh restrictions about fragment identifiers. See this thread, and also another one about location.hash issues. --Entlinkt 17:07, 20 August 2010 (UTC)
    • RFC 3987 is more relevant. But the worst we have to do is URL-encode the fragment, then it's fine. (pchar includes pct-encoded.) In practice we probably don't even have to do that, browsers will handle the encoding for us if it's unambiguous. —Simetrical (talk • contribs) 17:20, 20 August 2010 (UTC)
      • Percent-encoding causes trouble. Jumping to a percent-encoded fragment works in Mozilla and Chrome, but not IE 8 and Opera (plain Unicode works in all browsers). location.hash is even more funny: It returns the decoded value in Mozilla, but the encoded one in Chrome, IE 8 and Opera. I wasn't able to find out what is supposed to be correct. Should we try to get the spec(s) changed or clarified? --Entlinkt 04:12, 21 August 2010 (UTC)
        • As long as it validates and works, I don't really care if a buggy RFC claims it has to be percent-encoded. I did test this while implementing $wgExperimentalHtmlIds, and vaguely recall something unexpected like you say. I actually filed a spec bug about location.hash on Friday, after reading the threads you linked to and finding the spec unclear. —Simetrical (talk • contribs) 16:50, 23 August 2010 (UTC)
          • Thanks. You're right about RFC 3986 vs. RFC 3987, but there's still a problem with at least 4 characters ("#" and "%" in practice plus "[" and "]" in theory). I've filed bug 24918 about them. --Entlinkt 07:23, 24 August 2010 (UTC)

HTML in wikitext[edit]

What about HTML in wikitext, together with Parser tags and Extension tags (like <ref>)? Will these be treated as HTML5, rather than as XHTML 1.0? If so – and using HTML5 and HTML5-style syntax seems sensible long-term – then there are some compatibility issues and will require changes to tag extensions (like Cite).

The most significant issue that I see is with <ref name="foo" /> and <references /> since this syntax is valid XHTML 1.0, but invalid HTML5. That is, the minimized tag syntax (like <foo />) is allowed in XHTML for both elements that never have content, like br, and elements that can have content but don’t, like an empty p. In HTML5 this syntax is only allowed for void elements (elements that can never have content), where the slash (/) is optional, and in any case is simply discarded. Thus in HTML5 <ref name="foo" /> and <references /> become bare start tags, with no end tag: <ref name="foo"> and <references> (…and thus eats the rest of the page, which is not desired!).

I’ve elaborated on this Cite issue, which seems the most significant, at Extension talk:Cite#HTML5-style_syntax?

Looking at HTML in wikitext, the only other tags that jump out as potentially incompatibly minimized are <p /> and maybe <div />, in both cases used for spacing. In HTML5 <p> has an implied end tag in many circumstances, so this should in principle be ok, but could be messy to support properly. OTOH, having <div /> become <div> could cause serious messes with layout, so it’s something to be aware of.

More generally, the issue of implied closing (such as for <p> and <li>) could cause headaches, notably on how it interacts with wikitext (do you close a <p> when you hit a blank line? etc.).

AFAICT the current parsing of HTML in wikitext is pseudo-XHTML (it allows unquoted attribute values, for example); if we move to HTML5-style parsing we’ll need to address these compatibility issues. HTH!

Nils von Barth (talk) 10:55, 23 February 2013 (UTC)
HTML5 support has nothing to do with html parsing in WikiText. The syntax used in WikiText is going to stay the same. And we're already outputting HTML5. Parser tag extensions are also neither HTML nor XHTML. They are parsed by their own rules different than that of HTML and XHTML. For example they don't support nesting. You can't nest <x><x></x></x> it'll be treated as an "x" tag containing the text <x> and followed by the text </x>. Likewise the contents of the tag are considered a flat blob of text. Normal WikiText rules don't even apply unless the extension does something special. Daniel Friesen (Dantman) (talk) 01:14, 8 May 2013 (UTC)

Deployment history[edit]

HTML5 was first enabled in MediaWiki by default in r53034 (July 10, 2009), and was enabled on Wikimedia sites on November 12 (see request on bugzilla:27478). Then immediately disabled again because it caused problems with XMLHttpRequest not liking named entities. Yay. As of r59741, that bug should be fixed, and HTML5 was enabled again on Wikimedia wikis in February 23, 2011; then disabled again shortly thereafter. HTML5 was again enabled for all Wikimedia wikis on September 17, 2012 (bug 27478).

See also[edit]