Parsing/Notes/HTML5

This page records some notes and observations about the HTML5 spec and parsing algorithm as a quick / easy reference and will be filled out progressively.

Content categories
The spec defines a bunch of content categories. Elements can belong to zero or more categories. The list below should give you a sense of what the categories represent. Elements that are Flow but not Phrasing: table, lists, headings, p, div, blockquote, section, figure, header, footer and other uncommon ones (loosely speaking, this is the block node notion from HTML4)
 * Flow content - pretty much everything except a few elements
 * Metadata content - link, meta, ..
 * Heading content - h1 - h6, ..
 * Sectioning content - h1 - h6, section, ...
 * Embedding content - audio, video, embed, object, etc.
 * Interactive content - forms, buttons, and the like
 * Phrasing content - all phrasing content is flow content; heading & sectioning content cannot be phrasing content (loosely speaking, this is the inline node notion from HTML4)

Palpable content:
 * elements in this category should provide at least one non-empty text node or audio/video.
 * effectively discourages empty elements
 * we may not enforce this in MediaWiki but rely on linting tools

Content model

 * Transparent content model: they inherit content models from their nearest non-transparent ancestor.
 * Nothing content model: no content can be present / nested in these elements

Paragraphs

 * Paragraphs in HTML5 is a structural concept, not a semantic / logical content.
 * Runs of phrasing content form paragraphs. In other words, p-tags can only contain phrasing content.
 * can be omitted if followed by a set of tags.  I imagine this is just grandfathering in the html seen in the wild.
 * Not required to add p tags around runs of phrasing content that form paragraphs. But, better to add them for clarity and to avoid edge cases in rendering. We'll probably always add them in MediaWiki.

Algorithm notes

 * For each element, build a map of context in which a node can show up and content model it expects. There is clearly a hierarchical relation here. The content model for a node determines the context in which children can show up. So, these constraints should line up properly.
 * This map can probably be used to come up with a set of composition rules / spec when document fragments need to be composed into a final document.

Composition constraints
One of the things to work out with the balanced templates RFC and the Wikitext 2.0 proposal is to figure out how to properly compose fragments to yield a well-formed spec-conformant document. Note that since we have well-formed DOM fragments, we don't need to worry about the parts of the HTML parsing algorithm that deal with unclosed or misnested tags. We only need to worry about the content model constraints.

Looking at the table below, the following is a (summary of composition constraints (partial since it only covers a largish subset of elements): So, overall it looks like we can come up with a fairly reasonable set of fragment composition rules based on common sense notions (derived from the HTML5 spec). Within the wikitext markup spec, we might even specify exceptions / minor variations from the spec if it aids reasoning and/or eliminates edge cases.
 * Nodes that only accept phrasing content: h1 - h6, p, pretty much all the text-content elements (span, i, b, em, strong, small, sup, sub, etc. -- see section 4.5 below) You have 2 options here:
 * strip non-phrasing tags from the content: This seems the right approach for h1 - h6 tags
 * split the parent node to ensure constraints are satisfied: This seems the right approach for p and text-content elements
 * Custom exclusions / constraints: No a-tags inside a; no table-tags inside caption; No main inside nav, aside, ... ; etc.
 * The best solution here is to strip the offending tags from the fragment. So, if you have an a-tag being used inside another a-tag, the a-tag is stripped out. An alternative is to convert the a-tag to text. But, in either case, the a-tag itself is removed. This has an impact on real use cases on wikipedias.  seems to be found on wikis which leads to broken rendering for reads and headaches for Parsoid for editing and round-tripping. The solution proposed here is a better uniform solution.
 * Constraint on insertion context: li inside ol/ul, td/th inside tr, ...; etc. There are a few possibilities here
 * Suppress the fragment content entirely: Might work for some cases, but probably not a good idea
 * Insert necessary required tags, i.e., insert a ul-tag or a table tag as necessary: Unclear that this is a good solution
 * Strip just the offending tags, i.e.  is converted to
 * Deviations from content-model and context constraint:
 * It looks like the HTML5 parser does not enforce content model constraints in some cases. Ex: try parsing . The parser allows Flow content inside the pre tag and lets the li tag be used outside a list. To be clear the spec does say that context requirements are non-normative, so there is that.