Parsoid/HTML5 DOM with RDFa

Wikitext can be divided into shorthand notation for HTML elements and higher-level features like templates, media display or categories.

The shorthand portion of wikitext maps quite directly to an HTML DOM. Details like the handling of unbalanced tags while building the DOM tree, remembering insignificant whitespace or wiki vs. html syntax for round-tripping need to be considered, but appear to be quite manageable. This should be especially true if some change-localized normalization in edge cases can be tolerated. To further localize the chance of normalization (and thus avoid dirty diffs), we plan to serialize only modified DOM sections while using the original source for unmodified DOM parts. Attributes are used to track the original source offsets of DOM elements.

Higher-level features can be represented in the HTML DOM using different extension mechanisms:
 * Introduce custom elements with specific attributes: . For display or WYSIWYG editing these elements would need to be expanded with the template contents, thumbnail html and so on.
 * Expand higher-level features to their presentational DOM, but identify and annotate the result using custom attributes. This is the approach we have taken so far in the JS parser. Template arguments and similar information are stored as JSON in data attributes, which made their conversion to the JSON-based WikiDom format quite easy.

The biggest disadvantage of these extension methods is their custom and ad-hoc nature. Microdata as defined in the HTML5 spec promises a standardized but still very flexible solution, which is otherwise very similar to custom data attributes in its support for fully expanded templates or thumbnail structures.

Assuming a template that expands to a div and some content, this would look somewhat like this:  A static header from the template The rendered name

In this case, an expanded template argument within (for example) an infobox is identified inside the template-provided HTML structure, which could enable in-place editing.

Unused arguments (which are not found in the template output) could still be represented as data attributes. The microdata spec however proposes to use non-displaying meta elements for this purpose:

 A static header from the template The rendered name 

The itemref mechanism can be used to tie together template data from a single template that does not expand to a single subtree:

The itemtype attributes in these examples all point to the template location, which normally contains a plain-text documentation of the template parameters and their semantics. The most common application of microdata is however based on standardized schemas (e.g. http://schema.org/Photograph), which is used by search engines and data miners to extract structured information. A mapping of semi-structured template arguments to a standard schema is possible as demonstrated by http://dbpedia.org/. It appears to be feasible to provide a similar mapping directly as microdata within the template documentation, which could then potentially be used to add standard schema information to regular HTML output when rendering a page.

The visual editor might be able to use schema information to provide a custom editing experience for content generated from higher-level features. Inline editing of fields in infoboxes is one possibility, but in other cases a popup widget might be more appropriate. Additional microdata in template documentation sections could be used to customize the edit interface per template.

There are still several issues to solve, but I think the general direction of reusing standards as far as possible and hooking into the thriving HTML5 ecosystem should help to make Wikipedia's data more accessible. It also allows us to reuse quite a few libraries and infrastructure, and makes our own developments more useful to others.

What do you think?