Parsoid/MediaWiki DOM spec/Rich Attributes

HThis is an experimental proposal for a future revision of the MediaWiki DOM spec.

The problem
The DOM model of HTML is not orthogonal. Elements can contain elements which can contain elements, in a pleasant tree structure, but  attributes  of elements are limited to plain strings. You can not nest further structure inside an attribute, and you cannot store multiple values within an attribute (although there are hacks involving string-separated tokens). This is a fairly well-known issue with XML, with common advice given such as "If you use attributes as containers for data, you end up with documents that are difficult to read and maintain. Try to use elements to describe data. Use attributes only to provide information that is not relevant to the data." and similar advice elsewhere.

But HTML uses attributes all over the place. And in some places it is essential, for example the  attribute of an   tag. The most natural rendering of the wikitext: is something like: Ideally the same  wrapper we use for the caption to embed metadata about the transclusion (of   in this case) could be used inside the   attribute as well.

Examples of content often embedded within HTML attributes in the MediaWiki dom spec:


 * Transclusions (templates, etc)
 * Language Converter markup
 * i18n/l10n markup (system messages/ux)
 * Annotations (translation boundaries, etc)

Another related issue is "invisible HTML content", for example the invisible caption of an media file which is currently being displayed inline, the output of a suppressed language converter rule, the output for language variants which are not the current one, etc. These can not be embedded directly in the output HTML because they may break the HTML content model -- for example, block type content in a paragraph context. That shouldn't break the paragraph because the content is currently invisible, but if you just dropped it into the document with a  CSS style it would break its container. We typically "hide" this content in an attribute (currently a JSON-valued attribute) but then it complicates HTML traversal: various html2html transformations need to know enough about these special hiding places in order to recurse inside and mutate the embedded HTML.