Parsoid/MediaWiki DOM spec/Rich Attributes

This is an experimental proposal for a future revision of the MediaWiki DOM spec. Patch at gerrit 821281; see also T214994.

The problem
The DOM model of HTML is not orthogonal. Elements can contain elements which can contain elements, in a pleasant tree structure, but  attributes  of elements are limited to plain strings. You cannot nest further structure inside an attribute, and you cannot store multiple values within an attribute (although there are hacks involving string-separated tokens). This is a fairly well-known issue with XML, with common advice given such as "If you use attributes as containers for data, you end up with documents that are difficult to read and maintain. Try to use elements to describe data. Use attributes only to provide information that is not relevant to the data." and similar advice elsewhere.

But HTML uses attributes all over the place. And in some places it is essential, for example the  attribute of an   tag. The most natural rendering of the wikitext: is something like: where  is not trivially clear; ideally, the same   wrapper we use for the caption to embed metadata about the transclusion (of   in this case) could be used inside the   attribute as well.

Examples of content often embedded within HTML attributes in the MediaWiki DOM spec:


 * Transclusions (templates, etc)
 * Language Converter markup
 * i18n/l10n markup (system messages/ux)
 * Annotations (translation boundaries, etc)
 * Style and title attributes, which can have (eg) boldface or other formatting applied in wikitext
 * Generated attributes of HTML tags (with special ad-hoc markup in the DOM spec)
 * This overlaps with many of the categories above, but the following template-generated attributes are specifically called out in the MediaWiki DOM spec: attributes of literal HTML tags in wikitext,  attributes of links,  /width/height/caption/  of media

Another related issue is "invisible HTML content", for example the invisible caption of an media file which is currently being displayed inline, the output of a suppressed language converter rule, the output for language variants which are not the current one, etc. These can not be embedded directly in the output HTML because they may break the HTML content model -- for example, block type content in a paragraph context. That shouldn't break the paragraph because the content is currently invisible, but if you just dropped it into the document with a  CSS style it would break its container. We typically "hide" this content in an attribute (currently a JSON-valued attribute) but then it complicates HTML traversal: various html2html transformations need to know enough about these special hiding places in order to recurse inside and mutate the embedded HTML.

Note that we are focusing on structured data in attribute values here; although one can certainly imagine structured values for attribute names (or element tag names !), we are explicitly keeping that out-of-scope. Attribute names are like element tag names and are identifiers, not user-generated content. (A future spec may add "key value pairs" to the argument/output types allowed for transclusion, which would be the way to support dynamic attribute names in our framework.)

Current solutions
The generated attributes of HTML tags portion of the MediaWiki DOM Spec works out a system for recording the template-affected portions of a attribute value, as an array of "parts", stored in the  value. This mechanism works for  and for   but isn't a fully-general mechanism; eg it doesn't work for.

The key used in the  value does not always bear a direct relationship to the HTML attribute it is describing, for example this sample markup from the MediaWiki DOM spec for the wikitext  :

Note that template-affected attributes are  and the   tag of the internal   tag, but the attribute information is on the   tag and the names in   are   and. This may in fact be the best/only way to handle complicated situations like media, where the attribute values do not bear a one-to-one resemblance to wikitext, but for the simpler  and   a more direct correspondence would make the Parsoid HTML easier to interpret and traverse. Even for media, you could argue that (eg) the  attribute should be markup applied to the   whereas the   markup applies it to the wrapper.

Note that  in the extended attributes mechanism is an array of elements with the nominal structure: where a plain string can be used instead of the pair object in cases where the flattened value and the document fragment are identical. This allows for HTML-valued attribute names as well as values.

We also have a "shadow attribute" mechanism which is similar, in that it stores a "richer" value for a given attribute in a hidden   property.

Relatedly, structured values for  and   are supported via a core interface to fetch a "JSON attribute":   which returns an associative array. The implementation of this mechanism is discussed further in Parsoid/OutputTransform/HtmlHolder. Many of the features of structured-value attributes in Parsoid (such as live object representation of values) are restricted to the  and   attributes. Other attributes with structured values accessed via  get a copy of the value which must be explicitly re-written using   after it is mutated.

There is no "built-in" support for storing document fragments in structured-value attributes; in a number of places where this is done the values are manually parsed from/serialized to strings. This does not interoperate well with the  mechanism used for   and   attributes (discussed in link above).

We currently have DOM traversal code which is aware of  and some other places where embedded markup can be stored. Because of the limited support for embedded HTML in structured-value attributes, the traversal code has to explicitly parse embedded HTML and then restore potentially-modified HTML after the traversal has completed, regardless of whether the traversal actually mutated the embedded DOM. The issues compound if the embedded DOM itself has structured-value attributes, potentially including additional embedded HTML.

Proposals
This proposal is called "rich attributes" to distinguish it from the existing "structured-value attribute" support in Parsoid. There are three main pieces to this proposal, which can be discussed and implemented separately. The first phase makes our existing structured-value support more general, to support attributes other than  and , and to extend support to include   values (both at top level and embedded). The second phase adopts a uniform representation for structured-value attributes in "standard" HTML, including plaintext fallback values and alternative names. This aims to make our MediaWiki DOM spec more internally consistent. The third phase introduces a standardized marker for structured-value attributes to make possible generic traversal through the extended DOM, including s embedded within structured values.

The first phase is primarily targeted at internal users: it provides a cleaner mental model and API for manipulation of DOM trees containing structured data, and better supports traversal and manipulation of document fragments embedded within attributes in Parsoid and core code. It need not require any externally-visible changes to generated HTML.

The second and third phases are aimed at external users: they allow manipulation of a DOM with rich attributes independent of detailed knowledge about the specific attributes containing structured data. This allows the specification and creation of a general purpose "structured-value attribute" or "rich attribute" DOM library without hard-coded details of specific Parsoid attributes and uses. By cleaning up the specification and form of rich attributes the proposals help third-party consumers of HTML conforming to the MediaWiki DOM Spec to understand how to parse and properly manipulate structured-valued attributes.

At this time there is general consensus among the Content Transform Team on proceeding with phase 1 of this proposal, including exporing the "template bank" representation for embedded s.  At this time, there is not consensus on proceeding with phases 2 and 3 until the Parsoid Read Views project is further along, as these may included changes to the generated HTML which are not backward-compatible with third-party clients.

Phase 1: New API for structured-value attributes
First we propose a general API to allows Parsoid code to treat attributes as structured values with complex types. We already are doing this for some particular JSON-valued attributes; this is an extension first to arbitrary attributes and second to extend the set of rich values to include not just JSON-encodable arrays but also DocumentFragments, and JSON-encodable arrays which contain s.  Fundamental is a separation of the idea of "structured-valued attribute" from the Parsoid "page bundle" representation: not all attributes with structured values are in the page bundle. See HtmlHolder#Private attributes for more discussion of out-of-band representations for private attributes.

This phase of the proposal does not include a standard serialization of these values, nor does it allow generic traversal of a DOM including all traversing structured-values. At this stage the proposed API can be implemented with ad-hoc serialization strategies to remain consistent with the current MediaWiki DOM Spec.

In the API attributes can have three basic types of value: "string", "object" (json-encodable array), and "DOM". The two main methods provided are:



Support for these three methods can be split into pieces corresponding to the individual methods. The  API is phase 1a, and adding   is phase 1b. Two additional methods will be discussed later, in the context of phase 3:



Corresponding  setters and   methods (which set a default value if the attribute is missing) are provided for each of these four primary methods. We will call these the "setters" of the primary method in subsequent sections.

Phase 1a: Uniform live representation of structured values in DataBag
In the first phase of the work, we implement  and its setters, storing the live value in the DataBag. This is a simple generalization/refactor of the existing,  ,    , etc methods, but allowing an arbitrary attribute name and with corresponding generic support in   and. Like the existing, the structured values are stored inline as JSON in the attributes when the document is serialized, not hoisted into the page bundle.

WLOG, a rich attribute named  is stored in the   associated with the   as a prefixed dynamic property named. This avoids conflict with the existing  and   properties of   (used for   and  ), and the   prefix ensures that other dynamic properties added to   (eg  ) don't inadvertently get serialized as rich attributes. Rich attributes are loaded on-demand and serialized using a hook in.

Ideally, we would use something like JsonCodec to serialize/deserialize the JSON-encoded values, so that :: can return a fully-classed object type (like   or  ) instead of  just a. This is trivial if we could embed type information into the JSON object, as JsonCodec does; for example: However, the HTML bloat caused by the   properties required in such a scheme would be prohibitive. It is also unfortunate that the name embedded in the serialization is for a specific implementation in a specific Wikimedia namespace, although there is precedent in RDF and XML schemas for using explicit namespaces of this sort.

To avoid bloat, we would prefer that the expected type information be provided by the caller of the deserialization code; that is:



This works against our later proposal to make our rich media representation self-describing; we would prefer that the information about the proper class type to use for a given attribute is not external to the document itself. We'll discuss this further under phase 3. Self-description is an issue only for deserialization; for serialization we can assume that we have fully-classed objects and that the objects themselves know how to properly serialize their contents to a JSON string.

Phase 1b: Supporting live DocumentFragments in structured values
In this phase we add the implementation of  and its setters. This is a straightforward extension of the previous work; we primarily need to teach the serializer/deserializer in  and   how to respectively serialize/deserialize   values. We need to recurse into  values in order to   on an embedded fragment before that fragment is serialized into an array value, and similarly recurse into a   generated by   in order to similarly load the embedded fragment.

Just as object-valued attributes are stored live in a  attached to the DOM element, HTML-valued attributes are also stored live as  s.  This allows efficient traversal (we don't have to repeatedly parse and serialize HTML trees) as well as allowing code to (eg) keep persistent handles to certain elements within the   for marking or other purposes.

The most obvious way to serialize/deserialize  values is as an HTML string, like so: Another alternative (briefly discussed in Parsoid/OutputTransform/HtmlHolder) is to store  s in the main document tree itself, as   nodes in the  ).   In this case, the serialization of   values might just be an ID computed as a content hash or assigned to it, and then used to index the "template bank" in the .  For example: This simplifies the   phase since the HTML fragments are already/always present in the main document and don't require additional serialization into attribute values.  Object-valued attributes still need to be  'ed as JSON, but that can be done by traversing the nodes enumerated by the query.

Note that the associative array returned by  may now also contain embedded  s.  The   used should be able to identify and deserialize these to live  s of the owner document. In proposal 3 we introduce marker values to recognize and properly deserialize these values automatically, but at this phase we require that the proper  be provided to   and put the burden on the deserializer for the named class to locate and deserialize embedded  s.  For example: At the end of phase 1, we have live objects and live  s representing attributes values, and our own code can access these uniformly with a simple rich-attribute API, but the serialization and deserialization of these rich values to HTML is adhoc and potentially inconsistent. A third party user must carefully implement bespoke serialization and deserialization logic for every rich attribute it uses. Traversal requires the implementation of attribute deserializers which will properly construct live objects from the varied contents of HTML attributes, but once the document is fully parsed traversal can enumerate the contents of the NodeData object at a node to locate additional  or structured object values to recurse into.

Backward compatibility
Assuming s are serialized as HTML strings, nothing in proposal 1 requires a backward-compatibility break with generated HTML, although the burden is on the object serialization code to maintain compatible formatting. The "template bank" representation can even be implemented internally (ie, when a HTML string is parsed into a, that fragment is hung off a new   element in the  ) without affecting the serialization (the  s are removed and converted back to HTML embedded in JSON or attribute values in  .)

If a gradual shift to a template bank representation for external users is desired, It ought to be possible to use a template bank representation for selected attributes by adding special cases to the serializer. Similarly, it should be possible to use a template bank for internal HTML storage while converting to inline HTML in attributes for external clients, although because structured values are not self-describing (cf proposal 3) the conversion will require detailed knowledge of every place where HTML fragments could be embedded in attribute or object values.

Phase 2: Uniform HTML representation for structured-value attributes
The first step toward a uniform serialized HTML representation for structured-value attributes is to introduce a naming and location convention for them, so that, for example, "the structured value of the  attribute of this   element" can be located.

Let's first define "an attribute with special HTML semantics". This is an attribute whose stripped "non-rich" value is semantically meaningful for HTML. For example, the  attribute of an   tag is the URL which the browser will load when you click on the link. For now we'll say that "any attribute whose name doesn't start with " is an attribute with special HTML semantics.

For attributes without special HTML semantics we store the structured value directly in the named attribute as a JSON-encoded string (if an object) or an HTML string (if a ).

For attribute with special HTML semantics we will store a "flattened" version of the value directly under the named attribute. For a DOM value this is the textContent of the. For a object value this is defined by the object class type. For a string value the flattened value is the string itself. The structured value is stored under the attribute name prefixed with, encoded as for attributes without special HTML semantics. (Obviously, attribute names beginning with  are reserved.)

For illustrative purposes we'll assume s aren't stored in the but instead as stored as HTML strings. Setting the  of an   tag to the rich     will then result in the serialization: Alternatively, we can use the   serialization for attributes with special HTML semantics. In this case instead of storing the value under the name prefixed with  it is stored as the   property under the structured-value   attribute, as an array of   pairs. For the rich  attribute in the previous example, the serialization would be: This has some advantages in terms of initial migration and compatibility with existing markup, and should be supported at least for deserialization at least initially. It has the advantage of requiring only two attribute names to be reserved ( and  ) and all elements with rich attributes can be easily located with. On the other hand, the  attribute is not indexed by the DOM so that query must touch every node in the   anyway, and for practical use the   content of   must coexist and be merged with the "other" structured values stored in   by the MediaWiki DOM spec, making   an unusual corner case. (For example, note that  is going to rewrite the contents of the   attributes as it serializes structured-value attributes, in a departure from current code, and ensure that it does so before hoisting   in one of our alternate represenations.)

One possible drawback of both these representations is that a naive implementor might get lulled into inattention by the fact that / /etc are "usually" plain text, and get caught unaware by the need to parse a   or   value when it (unusually) appears.

The primary work in this phase is migrating corner cases to this uniform representation. Above we identified several attributes in the media representation where the  information was misplaced; there are also bespoke structured-value fields which don't use the standard representation, like the   attributes used for language conversion or hidden inline captions for media. Initially we'll use an allow list or block list for attribute names we haven't yet converted to a uniform representation, along with various hacks in /  to accomodate them. These exceptions should all be migrated to use the standard and the hacks gradually removed. Of course, any new Parsoid features should use structured value attributes in the standard form.

Backward compatibility
The changes in phase 2 can be rolled out piece by piece. If consensus is reached on this proposal as the long term direction for the MediaWiki DOM Spec, the first step is probably just to locate and mark as deprecated/subject to change any usage of "misplaced" structured-value attributes in the existing MediaWiki DOM Spec, aka attributes where the /  is on a different element than the attribute itself, or where the name in   doesn't correspond to an attribute on the element. Then those misplaced attributes can either be rolled out in small steps one-by-one after clients are located, notified, and updated; or else the changes can be bundled together into a single "big" breaking change, with a post-processing downgrade pass available to "move them back" for backward compatibility. The  and   representations ought to be semantically equivalent, so a postprocessing step could be written to convert back-and-forth between them as well. Transitional code could also read from both but write only the preferred or compatible version.

Phase 3: Uniform marking of structured-value attributes
Up to this point, processing a document containing rich attributes requires an external schema giving the data type of each attribute. Given  a client can't tell whether the value of   is supposed to be the literal string   or the   resulting from parsing that value as HTML, and given   the client can't distinguish between the literal value   and the empty object value. If we are encoding HTML by reference to a template bank in the, we can't tell the different between the literal string  and the use of "abcd" as a fragment reference in the template bank.

This is a particular problem if we want to traverse the document to process embedded HTML, since without a schema for the document we don't know which attribute values we should check. There are two mitigating factors:


 * If we use the  representation for structured value attributes than every attribute containing embedded HTML should be marked with that  .  This helps us find a subset of attributes that require further attention, but we still can't distinguish an object-valued attribute from an HTML-valued attribute, nor can we identify HTML within a properly of an object-valued attribute.
 * If we use the  bank representation for embedded HTML, then we can traverse all embedded   elements to be certain of discovering all embedded HTML, although we can't determine for certain where those fragments are actually embedded.  It is certainly some help to be able to traverse the entire document, but because we don't know exactly how the template bank is embedded we are still unable to traverse a portion or subtree of it.

Traversal/mutation during html2html passes is an important use case for the MediaWiki DOM Spec. It is important that embedded HTML be "visible" to post processing passes, so that (for example):
 * i18n fragments inside href are expanded
 * redlinks inside language converter markup still work
 * redlinks/language converter markup/i18n fragments inside "invisible media captions" are properly processed -- so that if the VE user toggles the media style from inline to thumb the proper i18n/language converter markup/etc should already be present in the caption.

In order to do this with the current system, we need to write a specialized traverser with special knowledge of  and   and a number of other internal Parsoid features. As new embedded HTML is added inside internal data structures, the traversal must continue to be extended to handle each place embedded HTML may be found, and before proposal 1a is implemented the traverser must additionally parse from JSON, parse to DOM, mutate, serialize HTML, and then serialize JSON at each place.

In phase 3a, we simply look at every attribute value looking for a leading  or   and then recurse into the rich value of any such attributes we find. In most cases the attributes will already have been loaded into the, so we need only enumerate the   properties of the DataBag.

An alternative in proposal 3b embeds a schema in the document, rather than in the attribute values. This is potentially more rigid: a given attribute may only have a single type, not a union of types, but it potentially provides a more powerful type system than the simple three-type system of proposal 3a.

We can also decide that the "enumerate the entire document" capability provided by the  bank is sufficient, and no additional action be taken under proposal 3.

Phase 3 Alternative 1: Value marking
In proposal 2 we already attemped to preserve the "normal HTML semantics" of attributes by storing a flattened string representation of well-known attributes like,  , and   even when the full structured value is stored elsewhere; this is similar to Parsoid's "shadow attributes".

In this proposal we tweak the encoding of object/array, string, and DocumentFragment values to make them uniquely identifiable: So nominally,  would result in: (but see the 'optimization' section below), and a complex JSON object type may embed like this: To reduce bloat and increase compatibility, if the value to be stored is a simple string (including a DocumentFragment with a single Text node) and the first character of that string value is not   or   then the string value is stored directly in the attribute. Thus, the example above where we set  to the string value   would actually be represented as but if we set the attribute to the string value   it would be serialized as so that the first character of the attribute value can be used to indicate type. If the attribute has "special HTML semantics" (see proposal 2) and the value is a simple string, then the value can be stored directly under the attribute name, without the need for an additional  attribute (which would have the identical value). This optimization makes the output look "less weird" in the common cases and preserve HTML semantics for important attributes like  and. As with proposal 2, one drawback may be that those parsing string-valued  attributes may be caught unaware when the value starts with a   or   and the representation changes. Now by looking at any attribute value to see if the first character is  or   we can determine whether a structured value is stored in a given attribute, and for structured values we can identify any embedded DocumentFragments and property restore or traverse them. We can implement  which uses the value marking to return the proper value for a union-typed attribute. On the other hand, we also need to introduce  and all code must use it instead of the standard DOM   if the attribute value could possibly start with a   or    in order to ensure those are appropriately escaped when necessary. During the transition period we can use an allow or block list to mark attributes which have been ported to be self-describing in this way (ie, to protect attributes whose values may happen to start with  or   but which shouldn't be parsed as a rich attribute), and/or require the affirmative presence of a marker attribute before values are parsed as rich.
 * A plain string value is encoded as
 * A DocumentFragment is encoded as  or
 * A associative array or object value is encoded by:
 * Any key/property with a name starting with an underscore has an additional underscore prepended to its name, otherwise the JSON encoding is used, but
 * Any array or object value is recursively encoded with this algorithm.
 * (Optionally) object values can also use  and/or a   method to customize their encoding.
 * (Optionally, for compatibility with current MediaWiki DOM spec) an property named "html" has the value parsed as a DocumentFragment; and non-DOM property named "html" is renamed "_html". (A "real" property named "_html" would already be renamed "__html" by the above.)

Another variation here is to combine this value marking with a shift from JSON encoding to a more efficient representation which is more easily or compactly embedded in HTML, for example, base85-encoded CBOR, with an even-less-common marker prefix ( or   or a deprecated unicode character like U+0149 ŉ or U+0673 ٳ in the two-byte region of UTF-8) to distinguish encoded from "plain string" values.

Using an example from Proposal 1b above: becomes the following using type marking: and using a CBOR encoding followed by Ascii85 encoding, and using a leading ŉ to indicate the presence of a structured value: It is likely that a more complex example may save more bytes, but it isn't obvious there are significant wins to be had here.

Phase 3 Alternative 2: Type marking with a type dictionary
Instead of using the first character of an attribute value as a type tag, we can also embed an explicit type tag. One encoding option reduces bloat to 6-7 characters per structured attribute value by compressing the inline type information down to a single-character property name and a short numeric identifier and then embedding a map from the numeric type ids to complete type names in the. This might look like the following: To the extent that object types are correlated with attribute names, a gzip encoding of the HTML would be expected to combine the tag and the type prefix into a single dictionary entry (eg ) making the type information "free" from a bandwidth perspective. Client browsers would still need to store the extra characters in their DOM, however. This proposal is backward compatible with existing markup, as it simply adds a new type property to existing attributes without changing the values of existing properties.

The schema presented in the above example hard-codes classes from the current PHP implementation. It may be preferable to use an abstract type system corresponding more closely to the  type system from alternative 1, where a numbered type could be of the form   to indicate "an object type that contains as a member named   a value of   type". Any fields not recursively containing DOM fragments could be elided from the type specification, since the primary purpose is not to give a full semantic type to the value but only to guide traversal of embedded DOM fragments. Other example types could include,  ,   etc.

Phase 3 Alternaive 3: Embedded schema
Another option might be to create a schema document, either embedded in the  or as a standalone document which would be provided to the Rich HTML library when a document was to be parsed which would give the rich type of any attribute. The schema could use CSS-like or xpath specifications, like: or else it could be provided in executable form as a callable mapping an Element and attribute name to the proper type. As above, instead of naming PHP class names from our current implementation, the types could be given as abstract specifications sufficient to locate s

Backward compatibility
Alternative 1 was designed to minimize the changes required to existing serialization. Strings with the most common values are stored unmodified, and object-valued attributes are stored with a leading  as is the current practice. The two main differences are (a) alternate encoding of string values which start with  or , although that can be mitigated using a separate marker attribute, and (b) marking of   values inside object-valued attributes. Currently stored directly as the value of an arbitrary property in the JSON, often but not always named  or , like so: Marking the   moves the HTML string one level down, like so: Since these two alternatives can be distinguished by shape (a string in the first case becomes an object with an   property in the second) it is likely that a custom deserializer could recognize both alternatives during a transitional period. On the other hand, any third party clients would be aware of the change in representation.

Alternative 2 may actually be easiest from a backward-compatibility standpoint, as it simply adds a new type marker property to existing JSON output. External clients can be expected to simply ignore both the extra property and the type map in the. However, marking the type of s still requires moving from a string to an object in the representation, just like alternative 1. Continuing with the example above, the output in proposal 3b would be: Alternative 3 would require no compatibility break, but it would require us to come up with a declarative specification of some sort giving the desired type mapping for the MediaWiki DOM Spec, and every third party user of HTML complying with the MediaWiki DOM Spec would be required to provide a local copy of that specification to their rich attribute library in order to properly parse our output.

Work in Progress

 * Investigate parsing context of  elements to ensure that we can round trip any.
 * STATUS: complete, seems okay.
 * For transition purposes, we need to handle existing attributes which may happen to start with  or   but shouldn't be treated as a rich attribute. The easiest solution is to require the affirmative presence of a matching   attribute before a value is treated as rich, and some similar affirmative marker for attributes which don't use "special HTML semantics".
 * As mentioned above, the existing  markup serves many of the same purposes as rich attributes; the "shadow attributes" inside data-parsoid are also very similar.  Neither of these is explicitly used in client code (as far as I am aware) however we should support existing markup using   or shadow attributes which may come from the cache for html2wt, even if we want to shift the "canonical" representation of rich attributes.  One approach to this would be to augment the rich attribute "loader" to transparently treat   or a shadow attribute as an equivalent rich attribute.  Another would be to write a simple html2html preprocessing pass which does this remapping in the html2wt direction. STATUS: using   natively as the serialization format in present patches.
 * allows for HTML-valued attribute names as well as values . It seems that the same flattening mechanism we use for "attributes with special HTML semantics" can be used to serialize these using a flattened attribute name. Some outstanding questions:
 * How to deal with conflicting flattened names (for example:,  , and  ) which presumably have the same flattened serialized attribute.
 * What the API should look like for getting/setting values on a -valued attribute name.  Perhaps object identity is sufficient for the name lookup? (But that implies we might also have multiple attributes with identical names, but differing   instances.)