Parsing/Notes/HTML5

This page records some notes and observations about the HTML5 spec and parsing algorithm as a quick / easy reference and will be filled out progressively.

Content categories
The spec defines a bunch of content categories. Elements can belong to zero or more categories. The list below should give you a sense of what the categories represent. Elements that are Flow but not Phrasing: table, lists, headings, p, div, blockquote, section, figure, header, footer and other uncommon ones (loosely speaking, this is the block node notion from HTML4)
 * Flow content - pretty much everything except a few elements
 * Metadata content - link, meta, ..
 * Heading content - h1 - h6, ..
 * Sectioning content - h1-h6, section, ...
 * Embedding content - audio, video, embed, object, etc.
 * Interactive content - forms, buttons, and the like
 * Phrasing content - all phrasing content is flow content; heading & sectioning content cannot be phrasing content (loosely speaking, this is the inline node notion from HTML4)

Palpable content:
 * elements in this category should provide at least one non-empty text node or audio/video.
 * effectively discourages empty elements
 * we may not enforce this in MediaWiki but rely on linting tools

Content model

 * Transparent content model: they inherit content models from their nearest non-transparent ancestor.
 * Nothing content model: no content can be present / nested in these elements

Paragraphs

 * Paragraphs in HTML5 is a structural concept, not a semantic / logical content.
 * Runs of phrasing content form paragraphs. In other words, tags can only contain phrasing content.
 * can be omitted if followed by a set of tags.  I imagine this is just grandfathering in the html seen in the wild.
 * Not required to add tags around runs of phrasing content that form paragraphs. But, better to add them for clarity and to avoid edge cases in rendering. We'll probably always add them in MediaWiki.

Algorithm notes

 * For each element, build a map of context in which a node can show up and content model it expects. There is clearly a hierarchical relation here. The content model for a node determines the context in which children can show up. So, these constraints should line up properly.
 * This map can probably be used to come up with a set of composition rules / spec when document fragments need to be composed into a final document.