Parsoid/MediaWiki DOM spec/Rich Attributes

This is an experimental proposal for a future revision of the MediaWiki DOM spec. Patch at gerrit 821281.

The problem
The DOM model of HTML is not orthogonal. Elements can contain elements which can contain elements, in a pleasant tree structure, but  attributes  of elements are limited to plain strings. You cannot nest further structure inside an attribute, and you cannot store multiple values within an attribute (although there are hacks involving string-separated tokens). This is a fairly well-known issue with XML, with common advice given such as "If you use attributes as containers for data, you end up with documents that are difficult to read and maintain. Try to use elements to describe data. Use attributes only to provide information that is not relevant to the data." and similar advice elsewhere.

But HTML uses attributes all over the place. And in some places it is essential, for example the  attribute of an   tag. The most natural rendering of the wikitext: is something like: where  is not trivially clear; ideally, the same   wrapper we use for the caption to embed metadata about the transclusion (of   in this case) could be used inside the   attribute as well.

Examples of content often embedded within HTML attributes in the MediaWiki DOM spec:


 * Transclusions (templates, etc)
 * Language Converter markup
 * i18n/l10n markup (system messages/ux)
 * Annotations (translation boundaries, etc)
 * Style and title attributes, which can have (eg) boldface or other formatting applied in wikitext
 * Generated attributes of HTML tags (with special ad-hoc markup in the DOM spec)
 * This overlaps with many of the categories above, but the following template-generated attributes are specifically called out in the MediaWiki DOM spec: attributes of literal HTML tags in wikitext,  attributes of links,  /width/height/caption/  of media

Another related issue is "invisible HTML content", for example the invisible caption of an media file which is currently being displayed inline, the output of a suppressed language converter rule, the output for language variants which are not the current one, etc. These can not be embedded directly in the output HTML because they may break the HTML content model -- for example, block type content in a paragraph context. That shouldn't break the paragraph because the content is currently invisible, but if you just dropped it into the document with a  CSS style it would break its container. We typically "hide" this content in an attribute (currently a JSON-valued attribute) but then it complicates HTML traversal: various html2html transformations need to know enough about these special hiding places in order to recurse inside and mutate the embedded HTML.

Note that we are focusing on structured data in attribute values here; although one can certainly imagine structured tag in attribute keys (or element tag names !), we are explicitly keeping that out-of-scope. Attribute keys are like element tag names and are identifiers, not user-generated content. (A future spec may add 'key value pairs' to the in/out types allowed for transclusion, which would be the way to support dynamic key names in our framework.)

Current solution
The generated attributes of HTML tags portion of the MediaWiki DOM Spec works out a system for recording the template-affected portions of a attribute value, as an array of "parts", stored in the  value. However, this system has so far been used only for template-affected attributes, not attributes containing (for example) language converter or i18n/l10n-related markup.

The key used in the  value does not always bear a direct relationship to the HTML attribute is is describing, for example this sample markup from the MediaWiki DOM spec for the wikitext  :

Note that template-affected attributes are  and the   tag of the internal   tag, but the attribute information is on the   tag and the names in   are   and. This may in fact be the best/only way to handle complicated situations like media, where the attribute values do not bear a one-to-one resemblance to wikitext, but for the simpler  and   a more direct correspondence would make the Parsoid HTML easier to interpret and traverse.

But you could certainly make the argument that this markup is adequate, if not exactly consistent in all cases, and that we could simply build a better traverser that was aware of  and make this existing markup easier to correctly generate for template-affected content, and to traverse and mutate in html2html passes.

The fact that DOM-valued attributes are flattened into strings during processing causes considerable headaches, however, with any mutating traversal requiring parsing the string to DOM, doing a mutation, and then reserializing the fragment to a string in order to store the string in the attribute. These issues compound if the nested DOM itself has DOM-valued attributes, and our current "data bag" framework does not handle this very well.

Proposed solution: Rich attributes
I'd like to propose a slightly more general solution, which allows Parsoid code to treat attributes as structured values with complex types, and defines a standard serialization of these values into "normal" HTML. In some sense we already are doing this for JSON-valued attributes; we're just going to extend the set of rich values to include not just JSON but also DocumentFragments (and JSON values which contain DocumentFragments). It also separates the idea of "JSON-valued attribute" from the page bundle representation -- not all JSON-valued attributes are in the page bundle, not does the page bundle necessarily contain every JSON-valued attribute. By adding consistency it attempts to allow generic traversal of a DOM, including all document fragments stored in JSON objects or attributes.

We also attempt to preserve the "normal HTML semantics" of attributes, storing a flatted string representation of well-known attributes like,  , and   even when the full structured value is stored elsewhere. We also store the flattened string representation whenever that wouldn't lose information, in order to avoid bloat as much as possible. See Details below.

We define three basic types of attribute values: "string", (JSON) "object", and "DOM" (DocumentFragment), with API methods to match:


 * Note that the associative array returned by this method may also contain s or  s
 * Note that the associative array returned by this method may also contain s or  s
 * Note that the associative array returned by this method may also contain s or  s
 * Note that the associative array returned by this method may also contain s or  s

Additional  setters and   (which sets a default value if the attribute is missing) methods are provided.

These methods work by first encoding any given value as JSON:


 * A plain string value is encoded as
 * A DocumentFragment is encoded as
 * A JSON value (associative array) is encoded by:
 * Any key with a name starting with an underscore has an additional underscore prepended to its name
 * Any array or object value is recursively encoded with this algorithm.
 * (Optionally) object values can also use  and/or a   method to customize their encoding.

So nominally,  would result in: (but see the 'details' section below), and a complex JSON object type may embed like this: Now, as desired, I can look at any attribute value and if the first character is   or   I can use this algorithm to reconstruct the rich value and traverse any embedded  s as desired. In my code I can also directly use  or   and don't have to worry about serialization/deserialization of the value. The object and the DocumentFragment are "live" and can be directly modified, and the mutated value will be properly reflected when the "Rich DOM" is next serialized. Because these values are live, any nested rich attributes within a DocumentFragment are also handled cleanly and in the obvious way.

Details
The following optimizations are done in order to reduce bloat and increase compatibility.


 * 1) If the value to be stored is a simple string (including a DocumentFragment with a single Text node) and the first character of that string value is not   or   then the string value is stored directly in the attribute. Thus, the example above where we set   to the string value   would actually be represented as but if we set the attribute to the string value   it would be serialized as so that rich attributes can always be uniquely identified by the first character of the attribute value.
 * 2) For attributes which have "special HTML semantics" (which for now we'll interpret as "any attribute whose name doesn't start with   or  ), we will always store the "flattened" version of the value under the original attribute name, and (if the value is not a simple string) the encoded version under the attribute name prefixed with  .  So setting the   of an   tag to the rich DocumentFragment   will result in the serialization: Note that this still works fine when the flattened value of the title/href/etc starts with a   or , since the rich value is stored elsewhere.

These optimizations make the output look "less weird" in the common cases and preserve HTML semantics for important attributes like  and. One possible drawback is that a naive implementor might get lulled into inattention by the fact that / /etc are "usually" plain text, and get caught unaware by the need to parse a   attribute value when it (unusually) appears. Similarly, those parsing the string-valued  may be caught unaware when the value starts with a   or   and the representation changes. Providing a good "rich attribute API" to clients and encouraging its consistent use should alleviate these issues.

Traversal
An important use case for rich attributes is traversal/mutation during html2html passes. It is important that embedded HTML be "visible" to post processing passes, so that (for example):


 * i18n fragments inside href are expanded
 * redlinks inside language converter markup still work
 * redlinks/language converter markup/i18n fragments inside "invisible media captions" are properly processed -- so that if the VE user toggles the media style from inline to thumb the proper i18n/language converter markup/etc should already be present in the caption.

In order to do this with the current system, we need to consistently use  for all such cases (even things like hidden captions) and write a generic traverser which always goes through   to pull out (parse to JSON, parse to DOM, mutate, serialize HTML, serialize JSON) all rich attributes.

With the rich attributes proposal, we simply look at every attribute value looking for a leading  or   and then recurse into the rich value of any such attributes we find. (For initial transition, we'll probably want an allow list or block list for attribute names we haven't yet converted to the rich attribute system, but this should be straightforward.)