Parsoid/MediaWiki DOM spec/Rich Attributes

This is an experimental proposal for a future revision of the MediaWiki DOM spec. Patch at gerrit 821281; see also T214994.

The problem
The DOM model of HTML is not orthogonal. Elements can contain elements which can contain elements, in a pleasant tree structure, but  attributes  of elements are limited to plain strings. You cannot nest further structure inside an attribute, and you cannot store multiple values within an attribute (although there are hacks involving string-separated tokens). This is a fairly well-known issue with XML, with common advice given such as "If you use attributes as containers for data, you end up with documents that are difficult to read and maintain. Try to use elements to describe data. Use attributes only to provide information that is not relevant to the data." and similar advice elsewhere.

But HTML uses attributes all over the place. And in some places it is essential, for example the  attribute of an   tag. The most natural rendering of the wikitext: is something like: where  is not trivially clear; ideally, the same   wrapper we use for the caption to embed metadata about the transclusion (of   in this case) could be used inside the   attribute as well.

Examples of content often embedded within HTML attributes in the MediaWiki DOM spec:


 * Transclusions (templates, etc)
 * Language Converter markup
 * i18n/l10n markup (system messages/ux)
 * Annotations (translation boundaries, etc)
 * Style and title attributes, which can have (eg) boldface or other formatting applied in wikitext
 * Generated attributes of HTML tags (with special ad-hoc markup in the DOM spec)
 * This overlaps with many of the categories above, but the following template-generated attributes are specifically called out in the MediaWiki DOM spec: attributes of literal HTML tags in wikitext,  attributes of links,  /width/height/caption/  of media

Another related issue is "invisible HTML content", for example the invisible caption of an media file which is currently being displayed inline, the output of a suppressed language converter rule, the output for language variants which are not the current one, etc. These can not be embedded directly in the output HTML because they may break the HTML content model -- for example, block type content in a paragraph context. That shouldn't break the paragraph because the content is currently invisible, but if you just dropped it into the document with a  CSS style it would break its container. We typically "hide" this content in an attribute (currently a JSON-valued attribute) but then it complicates HTML traversal: various html2html transformations need to know enough about these special hiding places in order to recurse inside and mutate the embedded HTML.

Note that we are focusing on structured data in attribute values here; although one can certainly imagine structured values for attribute names (or element tag names !), we are explicitly keeping that out-of-scope. Attribute names are like element tag names and are identifiers, not user-generated content. (A future spec may add "key value pairs" to the argument/output types allowed for transclusion, which would be the way to support dynamic key names in our framework.)

Current solutions
The generated attributes of HTML tags portion of the MediaWiki DOM Spec works out a system for recording the template-affected portions of a attribute value, as an array of "parts", stored in the  value. This mechanism works for  and for   but isn't a fully-general mechanism; eg it doesn't work for.

The key used in the  value does not always bear a direct relationship to the HTML attribute is is describing, for example this sample markup from the MediaWiki DOM spec for the wikitext  :

Note that template-affected attributes are  and the   tag of the internal   tag, but the attribute information is on the   tag and the names in   are   and. This may in fact be the best/only way to handle complicated situations like media, where the attribute values do not bear a one-to-one resemblance to wikitext, but for the simpler  and   a more direct correspondence would make the Parsoid HTML easier to interpret and traverse. Even for media, you could argue that (eg) the  attribute should be markup applied to the   whereas the   markup applies it to the wrapper.

We also have a "shadow attribute" mechanism which is similar, in that it stores a "richer" value for a given attribute in a hidden   property.

Relatedly, structured values for  and   are supported via a core interface to fetch a "JSON attribute":   which returns an associative array. The implementation of this mechanism is discussed further in Parsoid/OutputTransform/HtmlHolder. Many of the features of structured-value attributes in Parsoid (such as live object representation of values) are restricted to the  and   attributes. Other attributes with structured values accessed via ::getJSONAttribute get a copy of the value which must be explicitly re-written using  after it is mutated.

There is no "built-in" support for storing document fragments in structured-value attributes; in a number of places where this is done the values are manually parsed from/serialized to strings. This does not interoperate well with the  mechanism used for   and   attributes (discussed in link above).

We currently have DOM traversal code which is aware of  and some other places where embedded markup can be stored. Because of the limited support for embedded HTML in structured-value attributes, the traversal code has to explicitly parse embedded HTML and then restore potentially-modified HTML after the traversal has completed, regardless of whether the traversal actually mutated the embedded DOM. This issues compound if the embedded DOM itself has structured-value attributes, potentially including additional embedded HTML.

Proposals
This proposal is called "rich attributes" to distinguish it from the existing "structured-value attribute" support in Parsoid. There are two main pieces to this proposal, which can be discussed and implemented separately. The first proposal is to making our existing structured-value support more general, to (a) support attributes other than  and , (b) extend support to include   values (both at top level and embedded). The second proposal is to (a) adopt a uniform representation for structured-value attributes in "standard" HTML, including plaintext fallback values and alternative names, and (b) use a standardized marker for structured-value attributes, so that generic traversal through the extended DOM including DocumentFragments embedded within structured values is possible.

The first proposal is primarily targeted at internal users: to provide a cleaner mental model and API for manipulation of DOM trees containing structured data, and to better support the traversal and manipulation of document fragments embedded within attributes.

The second proposal is aimed at external users: to allow manipulation of a DOM with rich attributes independent of detailed knowledge about the specific attributes containing structured data.

Proposal 1: New API for structured-value attributes
I'd like to propose a slightly more general solution, which allows Parsoid code to treat attributes as structured values with complex types, and defines a standard serialization of these values into "normal" HTML. In some sense we already are doing this for JSON-valued attributes; we're just going to extend the set of rich values to include not just JSON but also DocumentFragments (and JSON values which contain DocumentFragments). It also separates the idea of "JSON-valued attribute" from the page bundle representation -- not all JSON-valued attributes are in the page bundle, not does the page bundle necessarily contain every JSON-valued attribute. By adding consistency it attempts to allow generic traversal of a DOM, including all document fragments stored in JSON objects or attributes.

We also attempt to preserve the "normal HTML semantics" of attributes, storing a flattened string representation of well-known attributes like,  , and   even when the full structured value is stored elsewhere; this is similar to Parsoid's "shadow attributes". We also store the flattened string representation whenever that wouldn't lose information, in order to avoid bloat as much as possible. See Details below.

We define three basic types of attribute values: "string", (JSON) "object", and "DOM" (DocumentFragment), with API methods to match:


 * Note that the associative array returned by this method may also contain s or  s
 * Note that the associative array returned by this method may also contain s or  s
 * Note that the associative array returned by this method may also contain s or  s
 * Note that the associative array returned by this method may also contain s or  s

Additional  setters and   (which sets a default value if the attribute is missing) methods are provided.

Proposed solution: Rich attributes
These methods work by first encoding any given value as JSON:


 * A plain string value is encoded as
 * A DocumentFragment is encoded as
 * A JSON value (associative array) is encoded by:
 * Any key with a name starting with an underscore has an additional underscore prepended to its name
 * Any array or object value is recursively encoded with this algorithm.
 * (Optionally) object values can also use  and/or a   method to customize their encoding.

So nominally,  would result in: (but see the 'details' section below), and a complex JSON object type may embed like this: Now, as desired, I can look at any attribute value and if the first character is   or   I can use this algorithm to reconstruct the rich value and traverse any embedded  s as desired. In my code I can also directly use  or   and don't have to worry about serialization/deserialization of the value. The object and the DocumentFragment are "live" and can be directly modified, and the mutated value will be properly reflected when the "Rich DOM" is next serialized. Because these values are live, any nested rich attributes within a DocumentFragment are also handled cleanly and in the obvious way. (See the "Live Storage" section below.)

Details
The following optimizations are done in order to reduce bloat and increase compatibility.


 * 1) If the value to be stored is a simple string (including a DocumentFragment with a single Text node) and the first character of that string value is not   or   then the string value is stored directly in the attribute. Thus, the example above where we set   to the string value   would actually be represented as but if we set the attribute to the string value   it would be serialized as so that rich attributes can always be deserialized to the correct type by looking at the first character of the attribute value.
 * 2) For attributes which have "special HTML semantics" (which for now we'll interpret as "any attribute whose name doesn't start with   or  ), we will always store the "flattened" version of the value under the original attribute name, and (if the value is not a simple string) the encoded version under the attribute name prefixed with  .  So setting the   of an   tag to the rich DocumentFragment   will result in the serialization: Note that this still works fine when the flattened value of the title/href/etc starts with a   or , since the rich value is stored elsewhere.

These optimizations make the output look "less weird" in the common cases and preserve HTML semantics for important attributes like  and. One possible drawback is that a naive implementor might get lulled into inattention by the fact that / /etc are "usually" plain text, and get caught unaware by the need to parse a   attribute value when it (unusually) appears. Similarly, those parsing the string-valued  may be caught unaware when the value starts with a   or   and the representation changes. Providing a good "rich attribute API" to clients and encouraging its consistent use should alleviate these issues.

Traversal
An important use case for rich attributes is traversal/mutation during html2html passes. It is important that embedded HTML be "visible" to post processing passes, so that (for example):


 * i18n fragments inside href are expanded
 * redlinks inside language converter markup still work
 * redlinks/language converter markup/i18n fragments inside "invisible media captions" are properly processed -- so that if the VE user toggles the media style from inline to thumb the proper i18n/language converter markup/etc should already be present in the caption.

In order to do this with the current system, we need to consistently use  for all such cases (even things like hidden captions) and write a generic traverser which always goes through   to pull out (parse to JSON, parse to DOM, mutate, serialize HTML, serialize JSON) all rich attributes.

With the rich attributes proposal, we simply look at every attribute value looking for a leading  or   and then recurse into the rich value of any such attributes we find. (Note that the DataBag may already have cached the rich value as a live object; see next section.)

For initial transition, we'll probably want an allow list or block list for attribute names we haven't yet converted to the rich attribute system, but this should be straightforward.

Live Storage
Just as we currently "load" JSON-valued attributes and store them as live objects in a DataBag attached to the DOM element, rich attributes are also stored live, with any HTML-valued components stored as DocumentFragments. This allows efficient traversal (we don't have to repeatedly parse and serialize HTML trees) as well as allowing code to (eg) keep persistent handles to certain elements within the DocumentFragment for marking or other purposes.

A rich attribute named  is stored in the   associated with the Element as a prefixed dynamic property named. This avoids conflict with the existing  and   properties of   (used for   and  ), and the   prefix ensures that other dynamic properties added to   (eg  ) don't inadvertently get serialized as rich attributes. Rich attributes are serialized using a hook in.

Work in Progress

 * For transition purposes, we need to handle existing attributes which may happen to start with  or   but shouldn't be treated as a rich attribute. The easiest solution is to add these to the list of "attributes with special HTML semantics", which requires the affirmative presence of a matching   attribute before a value is treated as rich.
 * As mentioned above, the existing  markup serves many of the same purposes as rich attributes; the "shadow attributes" inside data-parsoid are also very similar.  Neither of these is explicitly used in client code (as far as I am aware) however we should support existing markup using   or shadow attributes which may come from the cache for html2wt.  One approach to this would be to augment the rich attribute "loader" to transparently treat   or a shadow attribute as an equivalent rich attribute.  Another would be to write a simple html2html preprocessing pass which does this remapping in the html2wt direction.