Specs/HTML

See Parsoid/HTML5 DOM with microdata for the general idea and background. This is work in progress, feel free to suggest improvements!

The data-gen attribute
We use the data-gen attribute and RDFa attributes to mark up special structures in the DOM. Whenever a DOM subtree needs some special treatment or knowledge of the RDFa content type, we set the data-gen attribute on its root node to one of 'wrapper', 'content', or 'both'.


 * data-gen="wrapper" means that the root node itself was generated and requires special treatment depending on its type. The children are however plain content only subject to the HTML5 content model restrictions for the node type.
 * data-gen="content" means that the root node is actually a plain node that just carries RDFa data to identify the special behavior of its children.
 * data-gen="both" finally means that both the root node and its children require special treatment depending on the object type identified by RDFa.

Whenever an editor or processor encounters a node with data-gen set, it needs to check the RDFa attributes and element type to see if it matches something it knows how to handle. If those attributes don't match anything it knows, it needs to preserve the DOM subtree unmodified for round-tripping. It is still free to display the subtree in a read-only mode.

RDFa structures
Global prefix mappings (the latter might not be worth it):

Wiki links

 * Have data-gen="both" set if the link text is derived from the link target. Otherwise, data-gen="wrapper" is set. When an editor converts a formerly auto-generated link text into a customized one, it needs to set data-gen="wrapper".
 * : This produces a triple of type http://mediawiki.org/rdf/wikiLink from the current article to the link target.
 * We might want to add more information about this link (presence of alternate link text, namespace..). It is not clear how to do this in RDFa without adding extra HTML structures.
 * remaining info (presence of generated link content, tail) in data-mw round-trip info. This is private to Parsoid, must not be modified and can change without notice.
 * in data-mw indicates generated content.
 * in data-mw indicates link tail (see example below)

Nowiki blocks
There are two options to handle nowiki editing:
 * 1) Strip the tags from the DOM and let the serializer add those that are needed after each edit
 * 2) Keep them in the DOM for more accurate round-tripping of manually created nowiki blocks, and prevent non-text content from being entered into these blocks in the editor (TODO)

We picked option 2 for now.

TODO
The following constructs still need a RDFa markup definition. They will initially only be marked with data-gen="both" for simple read-only round-tripping.
 * Unexpanded and expanded templates
 * template parameter references
 * noinclude, onlyinclude, includeonly
 * behavior switches (only data-gen="both" currently, source-based round-tripping)
 * category links (only data-gen="both" currently, source-based round-tripping)
 * tag extensions including citations
 * redirects
 * ISBN / RFC / PMED autolinks