Specs/HTML

See Parsoid/HTML5 DOM with microdata for the general idea and background. This is work in progress, feel free to suggest improvements! See http://rdfa.info/ for RDFa documentation and a live parser.

RDFa structures
Global prefix mappings:
 * Convention: Capital for types, lowercase for attributes.
 * Generally use the prefix instead of vocab definitions to avoid clashes (and allow mixing) with user-supplied RDFa. User-supplied RDFa with the mw prefix is moved to a non-clashing prefix in Parsoid.
 * Generally use the prefix instead of vocab definitions to avoid clashes (and allow mixing) with user-supplied RDFa. User-supplied RDFa with the mw prefix is moved to a non-clashing prefix in Parsoid.

mw:Placeholder and general client behavior
A  protects DOM structures from any editing. Clients are expected to preserve / protect subtrees marked as such. Clients are also expected to preserve any DOM subtrees marked up with  in the http://mediawiki.org/rdf/ namespace they don't understand. This decouples clients from Parsoid development, and lets them concentrate on editing constructs whose special semantics they understand without having to implement all possible content elements.

Thumbnails
 

Simple image
 

Wiki links

 * The href attribute is UTF8 (as everything else), with a relative link prefix that always navigates up to the top of the wiki namespace, especially in subpages / pages containing slashes in the title. Example: './Foo', or (in a subpage) './../Foo'. We percent-encode percents and question marks in hrefs to support following links to wiki pages with question marks in their name. On the way in (when posting HTML to Parsoid) we assume href values to be urlencoded and decode them during serialization. Modified link hrefs without ./ or ../ prefix are temporarily assumed to be absolute to the wiki namespace for now, but will also be interpreted as relative to the page soon to support relative links in other HTML content. After that change, the equivalent of an absolute wikilink  Foo  would need to return an href="/Foo" instead.

 alternate linked content 

 Main Page 

Link with tail:  Potatoes 

Category links
 

 

Language links
Status: In development / not yet implemented! See bug 42160.

 Foo </tt>

Interwiki non-language links
Status: In development / not yet implemented! See bug 42160.

 en:Foo </tt>

Autolinked URLs
 http://example.com </tt>

Numbered external link
 </tt>

Named external link
 Link content </tt>

ISBN link
 ISBN 978-1413304541 </tt>

RFC link
 RFC 1945 </tt>

PMID link
 PMID 20610307 </tt>

Nowiki blocks
There are two options to handle nowiki editing:
 * 1) Strip the tags from the DOM and let the serializer add those that are needed after each edit
 * 2) Keep them in the DOM for more accurate round-tripping of manually created nowiki blocks, and prevent non-text content from being entered into these blocks in the editor (TODO)

We picked option 2 for now. The nowiki content remains editable. If the content is modified in a way that makes nowiki unnecessary Parsoid can remove the wrapper in the serializer.

 foo  </tt>

HTML entities
 œ </tt>

Behavior switches
Help:Magic_words. Not yet implemented, tracked in 37909.

 </tt>

 </tt>

 __NEWSECTIONLINK__ </tt>

 __NONEWSECTIONLINK__ </tt>

 __NOGALLERY__ </tt>

 __HIDDENCAT__ </tt>

<tt> __NOCONTENTCONVERT__ </tt>

<tt> __NOCC__ </tt>

<tt> __NOTITLECONVERT__ </tt>

<tt> __NOTC__ </tt>

<tt> </tt>

<tt> __NOINDEX__ </tt>

<tt> __INDEX__ </tt>

<tt> __STATICREDIRECT__ </tt>

Template content
Implementation progress tracked in bug 37911.

<tt> </tt> <meta about="#mw-t1" property="mwt0:Foo#1" content="positional"> <meta about="#mw-t1" property="mw:src" content="http://en.wikipedia.org/wiki/Template:Foo">


 * Define a global prefix for the template namespace (mwt0 in this case). Reasoning: Prefix definitions are scoped to a DOM subtree, so the prefix definition would need to be repeated for multi-rooted template output. This should also be easier to figure out, and makes semantic sense since we are talking about the same property even if it is transcluded repeatedly. The trailing colon in the namespace URL apparently needs to be urlencoded, at least to satisfy http://rdf.info/play.

Templates in attributes
<tt> Some text content </tt>

<tt> <div style="">... </tt>

The exact content of the attribute content for editing purposes could be serialized HTML DOM. Alternatively we could include that directly as a sub-dom in a div-wrapped section at the start or end of the document.

Extension content
<tt> </tt>

noinclude / includeonly / onlyinclude
Not yet implemented, tracked in 40305. We only care about these in the actual page context, not in transcluded pages / templates. <tt> foo bar baz </tt>

<tt> foo bar baz </tt>

<tt> foo bar baz </tt>

TODO
The following constructs still need a RDFa markup definition. They will initially only be marked with typeof="mw:Placeholder" for simple read-only round-tripping.
 * Unexpanded and expanded templates
 * template parameter references
 * noinclude, onlyinclude, includeonly
 * behavior switches (only typeof="mw:Placeholder" currently, source-based round-tripping)
 * tag extensions including citations (partly done)
 * redirects
 * ISBN / RFC / PMED autolinks
 * galleries