Parsoid/HTML5 DOM with RDFa

(An early version of this document was discussed in a wikitext-l thread in February 2012.)

Wikitext can be divided into shorthand notation for HTML elements and higher-level features like templates, media display or categories.

The shorthand portion of wikitext maps quite directly to an HTML DOM. Details like the handling of unbalanced tags while building the DOM tree, remembering extra whitespace or wiki vs. html syntax for round-tripping need to be considered, but appear to be quite manageable. This should be especially true if some normalization in edge cases can be tolerated. We plan to localize normalization (and thus mostly avoid dirty diffs) by serializing only modified DOM sections while using the original source for unmodified DOM parts. Attributes are used to track the original source offsets of DOM elements.

Higher-level features can be represented in the HTML DOM using different extension mechanisms:


 * Introduce custom elements with specific attributes:  For display or WYSIWYG editing these elements then need to be expanded with the template contents, thumbnail html and so on. Unbalanced templates (table start/row/end) are very difficult to expand.


 * Expand higher-level features to their presentational DOM, but identify and annotate the result using custom attributes. This is the approach we have taken so far in the JS parser. Template arguments and similar information are stored as JSON in data attributes, which made their conversion to the JSON-based WikiDom format quite easy.

Both are custom solutions for internal use. For an external interface, a standardized solution would be preferable. HTML5 microdata seems to fit our needs quite well.

Assuming a template that expands to a div and some content, this would be represented like this:

 A static header from the template The name

In this case, an expanded template argument within (for example) an infobox is identified inside the template-provided HTML structure, which could enable in-place editing.

Unused arguments (which are not found in the template expansion) or unexpanded templates can be represented using non-displaying meta elements:

 A static header from the template The rendered name 

The itemref mechanism can be used to tie together template data from a single template that does not expand to a single subtree:

The itemtype attributes in these examples all point to the template location, which normally contains a plain-text documentation of the template parameters and their semantics. The most common application of microdata however references standardized schemas, often from http://schema.org as those are understood by Google, Microsoft, and Yahoo!. A mapping of semi-structured template arguments to a standard schema is possible as demonstrated by http://dbpedia.org/. It appears to be feasible to provide a similar mapping directly as microdata within the template documentation, which could then potentially be used to add standard schema information to regular HTML output when rendering a page.

The visual editor could also use schema information to customize the editing experience for templates or images. Inline editing of fields in infoboxes with schema-based help is one possibility, but in other cases a popup widget might be more appropriate. Additional microdata in template documentation sections could provide layout or other UI information for these widgets.

Additional notes

 * The biggest problem with microdata for our use is that it restricts us to a single itemtype. If we are using an itemtype like http://en.wikipedia.org/Template:Foo which includes all template parameters, this means that we forgo marking up a subset of template parameters as one of the well-known and widely used vocabularies. We thus can no longer use microdata for most use cases it was normally envisioned for. There is a RDFa in HTML spec from W3C on the way, which we should consider as an alternative. Multiple itemtypes are supported, and the general DOM structure is otherwise very similar. Manu Sporny (the RDFa-in-HTML WG chair) also offered his support.
 * Notes on multiple itemtypes in microdata: absolute urls in itemprop names and ideas for itemtypes.
 * Text-only attribute expansions will need to be wrapped into  s. This might break some finicky CSS selectors, but that should be rare and in any case easily fixable.
 * The HTML DOM can be serialized to XML. A few tweaks can be required though ('--' in comments, self-closing tags, use unicode instead of entities, lowercase attribute names). Can definitely be automated. See http://wiki.whatwg.org/wiki/HTML_vs._XHTML for a description of the differences.