Parsoid/MediaWiki DOM spec/Rich Attributes

From mediawiki.org

This is an experimental proposal for a future revision of the MediaWiki DOM spec. Patch at gerrit 821281; see also T214994.

The problem[edit]

The DOM model of HTML is not orthogonal. Elements can contain elements which can contain elements, in a pleasant tree structure, but attributes of elements are limited to plain strings. You cannot nest further structure inside an attribute, and you cannot store multiple values within an attribute (although there are hacks involving string-separated tokens). This is a fairly well-known issue with XML, with common advice given such as "If you use attributes as containers for data, you end up with documents that are difficult to read and maintain. Try to use elements to describe data. Use attributes only to provide information that is not relevant to the data." and similar advice elsewhere.

But HTML uses attributes all over the place. And in some places it is essential, for example the href attribute of an <a> tag. The most natural rendering of the wikitext:

[http://example.com/{{1x|foo/bar}} {{1x|caption}}]

is something like:

<a href="http://example.com/WHAT GOES HERE"><span...>caption</span></a>

where WHAT GOES HERE is not trivially clear; ideally, the same <span> wrapper we use for the caption to embed metadata about the transclusion (of Template:1x in this case) could be used inside the href attribute as well.

Examples of content often embedded within HTML attributes in the MediaWiki DOM spec:

  • Transclusions (templates, etc)
  • Language Converter markup
  • i18n/l10n markup (system messages/ux)
  • Annotations (translation boundaries, etc)
  • Style and title attributes, which can have (eg) boldface or other formatting applied in wikitext
  • Generated attributes of HTML tags (with special ad-hoc markup in the DOM spec)
    • This overlaps with many of the categories above, but the following template-generated attributes are specifically called out in the MediaWiki DOM spec: attributes of literal HTML tags in wikitext, href attributes of links, style/width/height/caption/alt of media

Another related issue is "invisible HTML content", for example the invisible caption of an media file which is currently being displayed inline, the output of a suppressed language converter rule, the output for language variants which are not the current one, etc. These can not be embedded directly in the output HTML because they may break the HTML content model -- for example, block type content in a paragraph context. That shouldn't break the paragraph because the content is currently invisible, but if you just dropped it into the document with a display: none CSS style it would break its container. We typically "hide" this content in an attribute (currently a JSON-valued attribute) but then it complicates HTML traversal: various html2html transformations need to know enough about these special hiding places in order to recurse inside and mutate the embedded HTML.

Note that we are focusing on structured data in attribute values here; although one can certainly imagine structured values for attribute names (or element tag names!), we are explicitly keeping that out-of-scope. Attribute names are like element tag names and are identifiers, not user-generated content. (A future spec may add "key value pairs" to the argument/output types allowed for transclusion, which would be the way to support dynamic attribute names in our framework.)

Current solutions[edit]

The generated attributes of HTML tags portion of the MediaWiki DOM Spec works out a system for recording the template-affected portions of a attribute value, as an array of "parts", stored in the data-mw.attribs value. This mechanism works for <span class="-{foo}-"> and for [[{{1x|Foo}}]] but isn't a fully-general mechanism; eg it doesn't work for <poem class="-{foo}-">.

The key used in the attribs value does not always bear a direct relationship to the HTML attribute it is describing, for example this sample markup from the MediaWiki DOM spec for the wikitext [[File:foo.jpg|{{1x|thumb}}|{{1x|160px}}]]:

<figure
  typeof="mw:File/Thumb mw:ExpandedAttrs"
  about="#mwt3"
  data-mw='{
    "attribs": [
      ["thumbnail",
        {"html":"&lt;span about=\"#mwt1\" typeof=\"mw:Transclusion\"
          data-mw=&apos;{\"parts\":[{\"template\":{\"target\":{\"wt\":\"1x\",\"href\":\"./Template:1x\"},\"params\":{\"1\":{\"wt\":\"thumb\"}},\"i\":0}}]}&apos;>thumb&lt;/span>"}
      ],
      ["width",
        {"html":"&lt;span about=\"#mwt2\" typeof=\"mw:Transclusion\" data-mw=&apos;{\"parts\":[{\"template\":{\"target\":{\"wt\":\"1x\",\"href\":\"./Template:1x\"},\"params\":{\"1\":{\"wt\":\"160px\"}},\"i\":0}}]}&apos;>160px&lt;/span>"}
      ]
    ]
  }'>
    ... Rest of image HTML here ...
</figure>

Note that template-affected attributes are typeof and the width tag of the internal <img> tag, but the attribute information is on the <figure> tag and the names in data-mw.attribs are thumbnail and width. This may in fact be the best/only way to handle complicated situations like media, where the attribute values do not bear a one-to-one resemblance to wikitext, but for the simpler <a href> and <div style="{{1x|....}}"> a more direct correspondence would make the Parsoid HTML easier to interpret and traverse. Even for media, you could argue that (eg) the alt attribute should be markup applied to the <img> whereas the mw:ExpandedAttrs markup applies it to the wrapper.

Note that data-mw.attribs in the extended attributes mechanism is an array of elements with the nominal structure:

{"txt":"<flattened value>","html":"<document fragment>"}

where a plain string can be used instead of the pair object in cases where the flattened value and the document fragment are identical. This allows for HTML-valued attribute names as well as values.

We also have a "shadow attribute" mechanism (WTSUtils::getAttributeShadowInfo) which is similar, in that it stores a "richer" value for a given attribute in a hidden data-parsoid property.

Relatedly, structured values for data-parsoid and data-mw are supported via a core interface to fetch a "JSON attribute": DOMDataUtils::getJSONAttribute() which returns an associative array. The implementation of this mechanism is discussed further in Parsoid/OutputTransform/HtmlHolder#Structured-value attributes and the DataBag. Many of the features of structured-value attributes in Parsoid (such as live object representation of values) are restricted to the data-parsoid and data-mw attributes. Other attributes with structured values accessed via ::getJSONAttribute() get a copy of the value which must be explicitly re-written using ::setJSONAttribute() after it is mutated.

There is no "built-in" support for storing document fragments in structured-value attributes; in a number of places where this is done the values are manually parsed from/serialized to strings. This does not interoperate well with the DataBag mechanism used for data-parsoid and data-mw attributes (discussed in link above).

We currently have DOM traversal code which is aware of mw:ExpandedAttrs and some other places where embedded markup can be stored. Because of the limited support for embedded HTML in structured-value attributes, the traversal code has to explicitly parse embedded HTML and then restore potentially-modified HTML after the traversal has completed, regardless of whether the traversal actually mutated the embedded DOM. The issues compound if the embedded DOM itself has structured-value attributes, potentially including additional embedded HTML.

Proposals[edit]

This proposal is called "rich attributes" to distinguish it from the existing "structured-value attribute" support in Parsoid. There are three main pieces to this proposal, which can be discussed and implemented separately. The first phase makes our existing structured-value support more general, to support attributes other than data-parsoid and data-mw, and to extend support to include DocumentFragment values (both at top level and embedded). The second phase adopts a uniform representation for structured-value attributes in "standard" HTML, including plaintext fallback values and alternative names. This aims to make our MediaWiki DOM spec more internally consistent. The third phase introduces a standardized marker for structured-value attributes to make possible generic traversal through the extended DOM, including DocumentFragments embedded within structured values.

The first phase is primarily targeted at internal users: it provides a cleaner mental model and API for manipulation of DOM trees containing structured data, and better supports traversal and manipulation of document fragments embedded within attributes in Parsoid and core code. It need not require any externally-visible changes to generated HTML.

The second and third phases are aimed at external users: they allow manipulation of a DOM with rich attributes independent of detailed knowledge about the specific attributes containing structured data. This allows the specification and creation of a general purpose "structured-value attribute" or "rich attribute" DOM library without hard-coded details of specific Parsoid attributes and uses. By cleaning up the specification and form of rich attributes the proposals help third-party consumers of HTML conforming to the MediaWiki DOM Spec to understand how to parse and properly manipulate structured-valued attributes.

At this time there is general consensus among the Content Transform Team on proceeding with phase 1 of this proposal, including exporing the "template bank" representation for embedded DocumentFragments. At this time, there is not consensus on proceeding with phases 2 and 3 until the Parsoid Read Views project is further along, as these may included changes to the generated HTML which are not backward-compatible with third-party clients.

Phase 1: New API for structured-value attributes[edit]

First we propose a general API to allows Parsoid code to treat attributes as structured values with complex types. We already are doing this for some particular JSON-valued attributes; this is an extension first to arbitrary attributes and second to extend the set of rich values to include not just JSON-encodable arrays but also DocumentFragments, and JSON-encodable arrays which contain DocumentFragments. Fundamental is a separation of the idea of "structured-valued attribute" from the Parsoid "page bundle" representation: not all attributes with structured values are in the page bundle. See HtmlHolder#Private attributes for more discussion of out-of-band representations for private attributes.

This phase of the proposal does not include a standard serialization of these values, nor does it allow generic traversal of a DOM including all traversing structured-values. At this stage the proposed API can be implemented with ad-hoc serialization strategies to remain consistent with the current MediaWiki DOM Spec.

In the API attributes can have three basic types of value: "string", "object" (json-encodable array), and "DOM" (DocumentFragment). The two main methods provided are:

  • Element::getAttributeObject(string $name): ?object
  • Element::getAttributeDOM(string $name): ?DocumentFragment

Support for these three methods can be split into pieces corresponding to the individual methods. The ::getAttributeObject() API is phase 1a, and adding ::getAttributeDOM() is phase 1b. Two additional methods will be discussed later, in the context of phase 3:

  • Element::getAttributeString(string $name): ?string
  • Element::getAttributeMixed(string $name): object|DocumentFragment|string|null

Corresponding setAttribute* setters and getAttribute*Default methods (which set a default value if the attribute is missing) are provided for each of these four primary methods. We will call these the "setters" of the primary method in subsequent sections.

Phase 1a: Uniform live representation of structured values in DataBag[edit]

In the first phase of the work, we implement ::getAttributeObject() and its setters, storing the live value in the DataBag. This is a simple generalization/refactor of the existing ::getDataParsoid(), ::getDataMw(), ::getDataI18n(), ::getDataParsoidDiff(), etc methods, but allowing an arbitrary attribute name and with corresponding generic support in ::loadDataAttribs() and ::storeDataAttribs(). Like the existing Data18n, the structured values are stored inline as JSON in the attributes when the document is serialized, not hoisted into the page bundle.

WLOG, a rich attribute named foo is stored in the NodeData associated with the Element as a prefixed dynamic property named rich-foo. This avoids conflict with the existing parsoid and mw properties of NodeData (used for data-parsoid and data-mw), and the rich- prefix ensures that other dynamic properties added to NodeData (eg tmp) don't inadvertently get serialized as rich attributes. Rich attributes are loaded on-demand and serialized using a hook in DOMDataUtils::storeDataAttribs.

Ideally, we would use something like JsonCodec to serialize/deserialize the JSON-encoded values, so that ::getAttributeObject() can return a fully-classed object type (like DataMw or DataI18n) instead of just a stdClass. This is trivial if we could embed type information into the JSON object, as JsonCodec does; for example:

{
    "_type_": "Wikimedia\\Parsoid\\NodeData\\DataI18n",
    "/": "...",
    "href": "..."
}

However, the HTML bloat caused by the _type_ properties required in such a scheme would be prohibitive. It is also unfortunate that the name embedded in the serialization is for a specific implementation in a specific Wikimedia namespace, although there is precedent in RDF and XML schemas for using explicit namespaces of this sort.

To avoid bloat, we would prefer that the expected type information be provided by the caller of the deserialization code; that is:

  • Element::getAttributeObject(string $attributeName, string $className = null): ?object

This works against our later proposal to make our rich media representation self-describing; we would prefer that the information about the proper class type to use for a given attribute is not external to the document itself. We'll discuss this further under phase 3. Self-description is an issue only for deserialization; for serialization we can assume that we have fully-classed objects and that the objects themselves know how to properly serialize their contents to a JSON string.

Phase 1b: Supporting live DocumentFragments in structured values[edit]

In this phase we add the implementation of Element::getAttributeDOM(string $name) and its setters. This is a straightforward extension of the previous work; we primarily need to teach the serializer/deserializer in ::storeDataAttribs() and ::loadDataAttribs() how to respectively serialize/deserialize DocumentFragment values. We need to recurse into DocumentFragment values in order to ::storeDataAttribs() on an embedded fragment before that fragment is serialized into an array value, and similarly recurse into a DocumentFragment generated by ::loadDataAttribs() in order to similarly load the embedded fragment.

Just as object-valued attributes are stored live in a DataBag attached to the DOM element, HTML-valued attributes are also stored live as DocumentFragments. This allows efficient traversal (we don't have to repeatedly parse and serialize HTML trees) as well as allowing code to (eg) keep persistent handles to certain elements within the DocumentFragment for marking or other purposes.

The most obvious way to serialize/deserialize DocumentFragment values is as an HTML string, like so:

<span data-outer="&lt;span data-inner=&quot;&amp;lt;span>hello,&amp;lt;/span>&quot;> world&lt;/span>">!</span>

Another alternative (briefly discussed in Parsoid/OutputTransform/HtmlHolder#Design decisions 2) is to store DocumentFragments in the main document tree itself, as <template> nodes in the <head>). In this case, the serialization of DocumentFragment values might just be an ID computed as a content hash or assigned to it, and then used to index the "template bank" in the <head>. For example:

<head>
    <template id="beefbeef"><span>hello</span></template>
    <template id="cafecafe"><span data-inner="beefbeef"> world</span></template>
</head>

<span data-outer="cafecafe">!</span>

This simplifies the ::storeDataAttribs() phase since the HTML fragments are already/always present in the main document and don't require additional serialization into attribute values. Object-valued attributes still need to be storeDataAttrib'ed as JSON, but that can be done by traversing the nodes enumerated by the query head > template. Note that the associative array returned by ::getAttributeObject() may now also contain embedded DocumentFragments. The JsonCodec used should be able to identify and deserialize these to live DocumentFragments of the owner document. In proposal 3 we introduce marker values to recognize and properly deserialize these values automatically, but at this phase we require that the proper $className be provided to ::getAttributeObject($attrName, $className) and put the burden on the deserializer for the named class to locate and deserialize embedded DocumentFragments. For example:

<template id="b0bacafe">A <i>tasty</i> caption</template>
<span data-mw='{"caption":"b0bacafe"}'></span>

At the end of phase 1, we have live objects and live DocumentFragments representing attributes values, and our own code can access these uniformly with a simple rich-attribute API, but the serialization and deserialization of these rich values to HTML is adhoc and potentially inconsistent. A third party user must carefully implement bespoke serialization and deserialization logic for every rich attribute it uses. Traversal requires the implementation of attribute deserializers which will properly construct live objects from the varied contents of HTML attributes, but once the document is fully parsed traversal can enumerate the contents of the NodeData object at a node to locate additional DocumentFragment or structured object values to recurse into.

Backward compatibility[edit]

Assuming DocumentFragments are serialized as HTML strings, nothing in proposal 1 requires a backward-compatibility break with generated HTML, although the burden is on the object serialization code to maintain compatible formatting. The "template bank" representation can even be implemented internally (ie, when a HTML string is parsed into a DocumentFragment, that fragment is hung off a new <template> element in the <head>) without affecting the serialization (the <template>s are removed and converted back to HTML embedded in JSON or attribute values in ::storeDataAttrs().)

If a gradual shift to a template bank representation for external users is desired, It ought to be possible to use a template bank representation for selected attributes by adding special cases to the serializer. Similarly, it should be possible to use a template bank for internal HTML storage while converting to inline HTML in attributes for external clients, although because structured values are not self-describing (cf proposal 3) the conversion will require detailed knowledge of every place where HTML fragments could be embedded in attribute or object values.

Phase 2: Uniform HTML representation for structured-value attributes[edit]

The first step toward a uniform serialized HTML representation for structured-value attributes is to introduce a naming and location convention for them, so that, for example, "the structured value of the href attribute of this a element" can be located.

Let's first define "an attribute with special HTML semantics". This is an attribute whose stripped "non-rich" value is semantically meaningful for HTML. For example, the href attribute of an a tag is the URL which the browser will load when you click on the link. For now we'll say that "any attribute whose name doesn't start with data-" is an attribute with special HTML semantics.

For attributes without special HTML semantics we store the structured value directly in the named attribute as a JSON-encoded string (if an object) or an HTML string (if a DocumentFragment).

For attribute with special HTML semantics we will store a "flattened" version of the value directly under the named attribute. For a DOM value this is the textContent of the DocumentFragment. For a object value this is defined by the object class type. For a string value the flattened value is the string itself. The structured value is stored under the attribute name prefixed with data-mw-attr-, encoded as for attributes without special HTML semantics. (Obviously, attribute names beginning with data-mw-attr- are reserved.)

For illustrative purposes we'll assume DocumentFragments aren't stored in the <head> but instead as stored as HTML strings. Setting the title of an <a> tag to the rich DocumentFragment <b>bold<b> will then result in the serialization:

<p title="bold" data-mw-attr-title='&lt;b>bold&lt;/b>'>

Alternatively, we can use the mw:ExpandedAttrs serialization for attributes with special HTML semantics. In this case instead of storing the value under the name prefixed with data-mw-attr- it is stored as the attrs property under the structured-value data-mw attribute, as an array of <attribute name, structured attribute value> pairs. For the rich title attribute in the previous example, the serialization would be:

<p title="bold" typeof="mw:ExpandedAttrs" data-mw='{"attribs":[["title","&lt;b>bold&lt;/b>"]]}'>

This has some advantages in terms of initial migration and compatibility with existing markup, and should be supported at least for deserialization at least initially. It has the advantage of requiring only two attribute names to be reserved (typeof and data-mw) and all elements with rich attributes can be easily located with Document::querySelectorAll('[typeof="mw:ExpandedAttrs"]'). On the other hand, the typeof attribute is not indexed by the DOM so that query must touch every node in the Document anyway, and for practical use the attribs content of data-mw must coexist and be merged with the "other" structured values stored in data-mw by the MediaWiki DOM spec, making data-mw an unusual corner case. (For example, note that ::storeDataAttrs() is going to rewrite the contents of the data-mw attributes as it serializes structured-value attributes, in a departure from current code, and ensure that it does so before hoisting data-mw in one of our alternate represenations.)

One possible drawback of both these representations is that a naive implementor might get lulled into inattention by the fact that href/title/etc are "usually" plain text, and get caught unaware by the need to parse a data-mw-attr-* or data-mw.attribs value when it (unusually) appears.

The primary work in this phase is migrating corner cases to this uniform representation. Above we identified several attributes in the media representation where the mw:ExpandedAttrs information was misplaced; there are also bespoke structured-value fields which don't use the standard representation, like the data-* attributes used for language conversion or hidden inline captions for media. Initially we'll use an allow list or block list for attribute names we haven't yet converted to a uniform representation, along with various hacks in ::loadDataAttrs/::storeDataAttrs to accomodate them. These exceptions should all be migrated to use the standard and the hacks gradually removed. Of course, any new Parsoid features should use structured value attributes in the standard form.

Backward compatibility[edit]

The changes in phase 2 can be rolled out piece by piece. If consensus is reached on this proposal as the long term direction for the MediaWiki DOM Spec, the first step is probably just to locate and mark as deprecated/subject to change any usage of "misplaced" structured-value attributes in the existing MediaWiki DOM Spec, aka attributes where the mw:ExpandedAttrs/data-mw.attribs is on a different element than the attribute itself, or where the name in data-mw.attrs doesn't correspond to an attribute on the element. Then those misplaced attributes can either be rolled out in small steps one-by-one after clients are located, notified, and updated; or else the changes can be bundled together into a single "big" breaking change, with a post-processing downgrade pass available to "move them back" for backward compatibility. The mw:ExpandedAttrs and data-mw-attr-* representations ought to be semantically equivalent, so a postprocessing step could be written to convert back-and-forth between them as well. Transitional code could also read from both but write only the preferred or compatible version.

Phase 3: Uniform marking of structured-value attributes[edit]

Up to this point, processing a document containing rich attributes requires an external schema giving the data type of each attribute. Given <span data-foo="&lt;b>x</b>"> a client can't tell whether the value of data-foo is supposed to be the literal string <b>x</b> or the DocumentFragment resulting from parsing that value as HTML, and given <span data-foo='{}'> the client can't distinguish between the literal value {} and the empty object value. If we are encoding HTML by reference to a template bank in the <head>, we can't tell the different between the literal string abcd and the use of "abcd" as a fragment reference in the template bank.

This is a particular problem if we want to traverse the document to process embedded HTML, since without a schema for the document we don't know which attribute values we should check. There are two mitigating factors:

  • If we use the typeof=mw:ExpandedAttrs representation for structured value attributes than every attribute containing embedded HTML should be marked with that typeof. This helps us find a subset of attributes that require further attention, but we still can't distinguish an object-valued attribute from an HTML-valued attribute, nor can we identify HTML within a properly of an object-valued attribute.
  • If we use the <template> bank representation for embedded HTML, then we can traverse all embedded <template> elements to be certain of discovering all embedded HTML, although we can't determine for certain where those fragments are actually embedded. It is certainly some help to be able to traverse the entire document, but because we don't know exactly how the template bank is embedded we are still unable to traverse a portion or subtree of it.

Traversal/mutation during html2html passes is an important use case for the MediaWiki DOM Spec. It is important that embedded HTML be "visible" to post processing passes, so that (for example):

  • i18n fragments inside href are expanded
  • redlinks inside language converter markup still work
  • redlinks/language converter markup/i18n fragments inside "invisible media captions" are properly processed -- so that if the VE user toggles the media style from inline to thumb the proper i18n/language converter markup/etc should already be present in the caption.

In order to do this with the current system, we need to write a specialized traverser with special knowledge of mw:ExpandedAttrs and data-mw.attribs and a number of other internal Parsoid features. As new embedded HTML is added inside internal data structures, the traversal must continue to be extended to handle each place embedded HTML may be found, and before proposal 1a is implemented the traverser must additionally parse from JSON, parse to DOM, mutate, serialize HTML, and then serialize JSON at each place.

In phase 3a, we simply look at every attribute value looking for a leading { or [ and then recurse into the rich value of any such attributes we find. In most cases the attributes will already have been loaded into the DataBag, so we need only enumerate the rich-* properties of the DataBag.

An alternative in proposal 3b embeds a schema in the document, rather than in the attribute values. This is potentially more rigid: a given attribute may only have a single type, not a union of types, but it potentially provides a more powerful type system than the simple three-type system of proposal 3a.

We can also decide that the "enumerate the entire document" capability provided by the <template> bank is sufficient, and no additional action be taken under proposal 3.

Phase 3 Alternative 1: Value marking[edit]

In proposal 2 we already attemped to preserve the "normal HTML semantics" of attributes by storing a flattened string representation of well-known attributes like class, href, and alt even when the full structured value is stored elsewhere; this is similar to Parsoid's "shadow attributes".

In this proposal we tweak the encoding of object/array, string, and DocumentFragment values to make them uniquely identifiable:

  • A plain string value is encoded as { "_s": <value> }
  • A DocumentFragment is encoded as { "_h": "<DocumentFragment innerHTML>" } or { "_h": "<template bank id>" }
  • A associative array or object value is encoded by:
    • Any key/property with a name starting with an underscore has an additional underscore prepended to its name, otherwise the JSON encoding is used, but
    • Any array or object value is recursively encoded with this algorithm.
  • (Optionally) object values can also use JsonCodec and/or a ::flatten() method to customize their encoding.
  • (Optionally, for compatibility with current MediaWiki DOM spec) an property named "html" has the value parsed as a DocumentFragment; and non-DOM property named "html" is renamed "_html". (A "real" property named "_html" would already be renamed "__html" by the above.)

So nominally, $p->setAttributeString("data-mw-foo", "hello, world") would result in:

<p data-mw-foo='{"_s":"hello, world"}'>...</p>

(but see the 'optimization' section below), and a complex JSON object type may embed like this:

<p data-mw-foo='{"name":"bar","html":{"_h":"<span>xyz</span>"}}'></p>

To reduce bloat and increase compatibility, if the value to be stored is a simple string (including a DocumentFragment with a single Text node) and the first character of that string value is not { or [ then the string value is stored directly in the attribute. Thus, the example above where we set data-mw-foo to the string value hello, world would actually be represented as

<p data-mw-foo="hello, world">

but if we set the attribute to the string value {hello} it would be serialized as

<p data-mw-foo='{"_s":"{hello}"}'>

so that the first character of the attribute value can be used to indicate type. If the attribute has "special HTML semantics" (see proposal 2) and the value is a simple string, then the value can be stored directly under the attribute name, without the need for an additional data-mw-attr-* attribute (which would have the identical value). This optimization makes the output look "less weird" in the common cases and preserve HTML semantics for important attributes like title and href. As with proposal 2, one drawback may be that those parsing string-valued data-mw-foo attributes may be caught unaware when the value starts with a { or [ and the representation changes.

Now by looking at any attribute value to see if the first character is { or [ we can determine whether a structured value is stored in a given attribute, and for structured values we can identify any embedded DocumentFragments and property restore or traverse them. We can implement ::getAttributeMixed() which uses the value marking to return the proper value for a union-typed attribute. On the other hand, we also need to introduce ::getAttributeString() and all code must use it instead of the standard DOM ::getAttribute() if the attribute value could possibly start with a { or [ in order to ensure those are appropriately escaped when necessary. During the transition period we can use an allow or block list to mark attributes which have been ported to be self-describing in this way (ie, to protect attributes whose values may happen to start with { or [ but which shouldn't be parsed as a rich attribute), and/or require the affirmative presence of a marker attribute before values are parsed as rich.

Another variation here is to combine this value marking with a shift from JSON encoding to a more efficient representation which is more easily or compactly embedded in HTML, for example, base85-encoded CBOR, with an even-less-common marker prefix (\t or \f or a deprecated unicode character like U+0149 ʼn or U+0673 ٳ in the two-byte region of UTF-8) to distinguish encoded from "plain string" values.

Using an example from Proposal 1b above:

<span data-outer="&lt;span data-inner=&39;&amp;lt;span>hello,&amp;lt;/span>&39;> world&lt;/span>">!</span>

becomes the following using type marking:

<span data-outer='{"_h":"&lt;span data-inner=&39;{"_h":"&amp;lt;span>hello,&amp;lt;/span>"}&39;> world&lt;/span>">!</span>

and using a CBOR encoding followed by Ascii85 encoding, and using a leading ʼn to indicate the presence of a structured value:

<span data-outer='ʼnTjhABGX4H5E+*W,A79Rg/ST*?ATBp]`JIQ/BL+sS,V)Sr5W`416:"O[0dJD.7oLs-.k4Um-U&YsDfTZ)4>1bp@;\\7'>!</span>

It is likely that a more complex example may save more bytes, but it isn't obvious there are significant wins to be had here.

Phase 3 Alternative 2: Type marking with a type dictionary[edit]

Instead of using the first character of an attribute value as a type tag, we can also embed an explicit type tag. One encoding option reduces bloat to 6-7 characters per structured attribute value by compressing the inline type information down to a single-character property name and a short numeric identifier ({"@":5,...}) and then embedding a map from the numeric type ids to complete type names in the <head>. This might look like the following:

<head>
    <script type="rich-schema">
        {
            "0": "DocumentFragment",
            "1": "Wikimedia\Parsoid\NodeData\DataMw",
            "2": "Wikimedia\Parsoid\NodeData\DataI18n",
            ...
        }
    </script>
    <template id="richurl">http://example.org/<span>foo</span></template>
    <template id="cafebad">Some <b>HTML</b>!</template>
</head>

<a href="http://example.org/foo" data-mw-attr-href='{"@":0,"_":"richurl"}'>Foo</a>

<span data-mw='{"@":1,"caption":{"@":0,"_":"cafebad"}}'>...</span>

<a data-mw-i18n='{"@":2,"title":{"lang":"x-page","key":"red-link-title","params":["Non existing page"]}}'>...</a>

To the extent that object types are correlated with attribute names, a gzip encoding of the HTML would be expected to combine the tag and the type prefix into a single dictionary entry (eg data-mw='{"@":1,") making the type information "free" from a bandwidth perspective. Client browsers would still need to store the extra characters in their DOM, however. This proposal is backward compatible with existing markup, as it simply adds a new type property to existing attributes without changing the values of existing properties.

The schema presented in the above example hard-codes classes from the current PHP implementation. It may be preferable to use an abstract type system corresponding more closely to the string|fragment|object type system from alternative 1, where a numbered type could be of the form {foo:DOM} to indicate "an object type that contains as a member named foo a value of DocumentFragment type". Any fields not recursively containing DOM fragments could be elided from the type specification, since the primary purpose is not to give a full semantic type to the value but only to guide traversal of embedded DOM fragments. Other example types could include DOM, {foo:{bar:DOM}}, {foo:[DOM]} etc.

Phase 3 Alternaive 3: Embedded schema[edit]

Another option might be to create a schema document, either embedded in the <head> or as a standalone document which would be provided to the Rich HTML library when a document was to be parsed which would give the rich type of any attribute. The schema could use CSS-like or xpath specifications, like:

/*/@data-mw-i18n  Wikimedia\Parsoid\NodeData\DataI18n
/*/@data-mw Wikimedia\Parsoid\NodeData\DataMw
/*/@href DocumentFragment

or else it could be provided in executable form as a callable mapping an Element and attribute name to the proper type. As above, instead of naming PHP class names from our current implementation, the types could be given as abstract specifications sufficient to locate DocumentFragments

Backward compatibility[edit]

Alternative 1 was designed to minimize the changes required to existing serialization. Strings with the most common values are stored unmodified, and object-valued attributes are stored with a leading { as is the current practice. The two main differences are (a) alternate encoding of string values which start with { or [, although that can be mitigated using a separate marker attribute, and (b) marking of DocumentFragment values inside object-valued attributes. Currently stored directly as the value of an arbitrary property in the JSON, often but not always named t or html, like so:

<span typeof="mw:LanguageVariant" data-mw-variant='{"disabled":{"t":"bar<b>baz</b>"}}'></span>

Marking the DocumentFragment moves the HTML string one level down, like so:

<span typeof="mw:LanguageVariant" data-mw-variant='{"disabled":{"t":{"_h":"bar<b>baz</b>"}}}'></span>

Since these two alternatives can be distinguished by shape (a string in the first case becomes an object with an _h property in the second) it is likely that a custom deserializer could recognize both alternatives during a transitional period. On the other hand, any third party clients would be aware of the change in representation. Alternative 2 may actually be easiest from a backward-compatibility standpoint, as it simply adds a new type marker property to existing JSON output. External clients can be expected to simply ignore both the extra property and the type map in the <head>. However, marking the type of DocumentFragments still requires moving from a string to an object in the representation, just like alternative 1. Continuing with the example above, the output in proposal 3b would be:

<span typeof="mw:LanguageVariant" data-mw-variant='{"disabled":{"t":{"@":9,"_":"bar<b>baz</b>"}}}'></span>

Alternative 3 would require no compatibility break, but it would require us to come up with a declarative specification of some sort giving the desired type mapping for the MediaWiki DOM Spec, and every third party user of HTML complying with the MediaWiki DOM Spec would be required to provide a local copy of that specification to their rich attribute library in order to properly parse our output.

Work in Progress[edit]

  • Investigate parsing context of <template> elements to ensure that we can round trip any DocumentFragment.
    • STATUS: complete, seems okay.
  • For transition purposes, we need to handle existing attributes which may happen to start with { or [ but shouldn't be treated as a rich attribute. The easiest solution is to require the affirmative presence of a matching data-mw-attr-* attribute before a value is treated as rich, and some similar affirmative marker for attributes which don't use "special HTML semantics".
  • As mentioned above, the existing mw:ExpandedAttrs markup serves many of the same purposes as rich attributes; the "shadow attributes" inside data-parsoid are also very similar. Neither of these is explicitly used in client code (as far as I am aware) however we should support existing markup using mw:ExpandedAttrs or shadow attributes which may come from the cache for html2wt, even if we want to shift the "canonical" representation of rich attributes. One approach to this would be to augment the rich attribute "loader" to transparently treat mw:ExpandedAttrs or a shadow attribute as an equivalent rich attribute. Another would be to write a simple html2html preprocessing pass which does this remapping in the html2wt direction. STATUS: using mw:ExpandedAttrs natively as the serialization format in present patches.
  • mw:ExpandedAttrs allows for HTML-valued attribute names as well as values. It seems that the same flattening mechanism we use for "attributes with special HTML semantics" can be used to serialize these using a flattened attribute name. Some outstanding questions:
    • How to deal with conflicting flattened names (for example: href, hr<b>ef</b>, and <b>href</b>) which presumably have the same flattened serialized attribute.
    • What the API should look like for getting/setting values on a DocumentFragment-valued attribute name. Perhaps object identity is sufficient for the name lookup? (But that implies we might also have multiple attributes with identical names, but differing DocumentFragment instances.)