Parsoid/OutputTransform/HtmlHolder

When Parsoid (and Parsoid-aware transforms) hold a DOM object model, there are two important features/extensions which an  interface in core needs to be aware of. The first is structured-value and private attributes within the Document, and the other is the representation of standalone document fragments. This document describes these features and presents consequent design decisions relating to the  abstraction.

Structured-value attributes and the DataBag
The MediaWiki DOM Spec contains a large number of "JSON-valued" attributes, to express structured values in HTML attributes in a compact and bandwidth-friendly way. Parsoid implements support for this primarily in the  class, based on the   method which returns a structured value: in the PHP implementation an associative array; formerly in the JavaScript implementation a JS Object. This value is nominally present in "plain old HTML" or via Element::getAttribute as the JSON-encoded value of the array/object, but is stored "live": that is, it is not parsed and re-serialized to a string attribute value every time the attribute is read or modified, but instead is kept as a live array or object value attached to the DOM Node. References can be kept to the live value, and it can be mutated and that change is immediately visible to anyone else which has a reference to the value.

The actual implementation is a bit baroque -- in addition to a proposal ("Rich attributes") to extend this basic mechanism to include DOM DocumentFragment values, there are multiple different serialization formats for these structured values. The nominal "as a JSON-encoded string" version we call "inline attributes". It suffers from a perceived "ugliness" problem, since inside a quoted HTML attribute value all quotes must be escaped, and JSON-encoded values contain a large number of quotation marks. This is mitigated by the use of single-quotes around the attribute value in a minor departure from standard HTML serialization, but if the structured value contains HTML markup escaping becomes inevitable as (a) both available quotation marks have been used, and (b)  and   are additionally required to be escaped. There's a separate but orthogonal issue with the exposure of "private" attributes in this naive serialization, discussed below. For these two reasons, Parsoid has historically supported two additional alternative encodings of structured attributes. By adding a unique id attribute to every node, the values of structured attributes can be hoisted out of the HTML and stored as a mapping from ID to attribute value. In one encoding this map is kept as a separate JSON-encoded blob alongside the HTML; the combination of JSON blob and HTML is called a PageBundle (page bundles have further uses described below). In another representation the combination is kept as a single HTML document, but the JSON-encoded map is stored in a  element in the   of the Document. This reduces the bloat caused by encoding all the quotation marks in the structured attributes, but adds additional bandwidth to record ID attributes on every node and additionally to include those ID values in the key portion of the map in the.

This id-to-attribute map is also used internally to the implementation: instead of hanging the rich attribute values directly off of the DOM Node, in the PHP implementation the ID-to-value map is stored in a  which is attached to the root   object. This is because the existing PHP implementation of the DOM uses ephemeral PHP objects to wrap the "actual" representation of the  implemented by the   library. Those ephemeral PHP wrapper objects are created and destroyed every time a reference to the Node goes into or out of scope in PHP. When the ephemeral PHP wrapper goes out of scope, any data attached to the Node is destroyed, even if a reference to the Node is still present in the native document model. By keeping a persistent reference to the (wrapper of the) main  object in Parsoid's   class, which is kept alive for the duration of the parse, we can prevent the   from being destroyed. (We could also just keep an explicit reference to the  in the   which would avoid the use of dynamic properties in PHP.)

Parsoid contains a "load" mechanism that runs after DOM parsing which loads structured-valued attributes into the, implemented in  , and a corresponding "store" mechanism in. In our implementation changes made to the live object stored in the  are not reflected in the raw attribute value visible via  until a "store" is done. Similarly, several Parsoid helper methods based on structured attributes, like, will not work correctly until a "load" is done. The "eager" loading mechanism could be replaced by a "lazy" loader which didn't locate and load structured values (whether from inline attributes or a map) until requested. Lazy loading could eliminate the need for an explicit "load" step, but since the values are live and can be mutated without notification to the DOM layer, an explicit "save" step will always be necessary to ensure the serialized DOM reflects the latest values for structured-value attributes.

Private attributes
The implementation and encoding of structured-value attributes in Parsoid was also influenced by an API decision that the contents of  attributes were to be considered implementation-private. This was enforced at an API level by stripping the  attributes in HTML provided to most clients, and then re-inserting the attributes from separate storage when necessary, keyed by a render ID assigned to the parse. In addition to strictly enforcing the abstraction boundary, this also saved bandwidth on API responses.

Special treatment was also extended to  attributes, which were used by convention to store information "needed by editing clients but not for readers". The idea was that  attributes would also be stripped in content served for read views or for reader clients to save additional bandwidth.

In this context, an additional benefit to storing the structured attributes outside the  (or in a separate element in the  ) was that it allowed API code to efficiently implement these attribute-stripping strategies without requiring node-by-node traversal. In practice this benefit was undercut by the fact that Parsoid's principal client, VisualEditor, used the contents of  attributes, requiring the   attributes to be explicitly reloaded from the separate storage before the HTML was usable.

Since the implementation of separate storage was tied to abstraction boundary design goals for  and   specifically, the   and load/store mechanism was initially implemented only for structured values of these two attributes. Other structured value attributes used the uncached storage mechanism of  and the values returned were not live but had to be explicitly saved with. The Rich Attribute proposal would extend live storage to all structured-value attributes and separate out the policy decision regarding the precise set of structured attributes to be encoded in separate storage (as opposed to inline).

Design decisions
For an  interface in core, two views of the document are provided: an HTML string and a DOM object model.

We have decided that the DOM representation will contain structured data that has been appropriately "loaded" -- that is, operations provided to core that operate on structured values will work immediately on the DOM returned without requiring an explicit load step. An equivalent to  will be provided in core (or more likely, in an HTML library which may also contain parts of Parsoid's   library) which will work on the DOM as returned by. This is consistent with either an eager "load" step occuring after string-form HTML is parsed, or with a lazy load step integrated with the implementation of the structured value API provided to core.

The HTML string provided by  will be the "naive" inline-attribute serialization of the document, not one of the alternate encodings. When converting from DOM to a string, an appropriate "store" step will be performed to serialize the current live values of structured attributes. Private attributes like  will not be stripped from the HTML string. will therefore need to know about "structured value" HTML (again, as an abstraction provided by the HTML library used), but will not need to specially handle  or any other Parsoid-internal attributes.

Serialization to ParserCache
Note that the actual representation stored in the  (ie, the serialized version of the  ) does not need to be the same as the string form of the HTML returned by. Optimized encodings could be utilized to reduce the "lots of escaped double-quotes and angle brackets" issue with the naive inline-attribute representation. The primary performance requirement is that, if policy decides that read view HTML is to be cached for a specific page, that read view HTML be able to be rendered directly from the serialized value stored in the  with minimal additional processing. But read view HTML is not expected to have many (if any) structured-value attributes in it. So long as optimized encodings do not touch the set of attributes present in read views, then read views ought to still be able to be served directly from the  representation. (Perhaps the optimized serialization can include a flag explicitly indicating when the optimized serialization is suitable for directly serving to clients, based on the absence of structured value attributes found.)

For edit views, the "inline-attribute" representation matches what the VisualEditor client expects, although currently  is stripped by the API. The visual editor API which provides access to edit-mode HTML can choose to reimplement  stripping for performance/bandwidth reasons, but it is not required. We are already serving content with inline  to some VE clients, so the presence or absence of   should not cause issues.

The precise details of the  serialization of   should as far as possible be hidden from clients, and changes made to the serialization format for performance or efficiency reasons should not affect the DOM model or HTML strings provided to callers.

It is worth noting that the JSON serializer used by  is currently implemented in MediaWiki core. Although probably not strictly required, a json codec implemented in an external library would be helpful in ensuring that  is deserialized as an object of the correct type: T346829.

Enumeration of fragments and metadata
In addition to an HTML, wikitext parsing results in a collection of metadata. Historically that metadata was stored in the PageBundle and returned to API clients as JSON, although some portions of the metadata were also returned as HTTP headers in the REST response. The integration of Parsoid with core has eliminated the need for a REST API-focused  structure, and made available the much richer   object to hold metadata generated by parsing. For compatibility with existing calling conventions and the REST API, methods in core exist to convert metadata stored in  objects to "extension data" stored in , and the   interface in Parsoid exists to allow Parsoid to directly write metadata to the   object held by core. We currently accommodate the encoding of structured attributes as a standalone map by embedding that map in the PageBundle, which is then reflected into the  extension data key when the   is stored in a.

The richer variety of metadata represented by  and newly-implemented by Parsoid introduced another issue: instead of one   representing the entire result of the parse, certain piece of metadata were "HTML strings" and thus logically separate  s generated by the parse. Many of these fragments were stripped HTML of one sort or another (page title, TOC entries) but, for example, the "page indicator" mechanism in core represented an entire wikitext fragment that certainly requires post-processing (localization) and likely requires appropriate representation of structured attributes within the fragment as well. Extension implementations seem to want to store Parsoid-generated document fragments in 's extension data mechanism as well, for later use in a final composition step.

This raises two related questions:


 * Should short HTML fragments of this sort be represented by individual  objects?  If the   objects are separate, is the "owner document" for each fragment unique as well, or are all fragments conceptually part of a single Document?
 * For post-processing passes which want to operate on all Parsoid-generated HTML (for example, user-specific localization), how can all such fragments be located within the  (and its extension data) and enumerated so they can be appropriately transformed?

It's worth noting that similar questions arose in the Parsoid implementation regarding the "owner document" of fragments created internally during parse and that after much work most fragments in Parsoid now share the same owner document (although an awkward Remex API means many of these fragments are created as separate documents that then have to be adopted by the main owner). Unifying the owner documents is not a complete solution to the enumeration question, however, since there exists no DOM API for enumerating all child fragments of a given owner document (and to do so would seem to require weak references at least).

Design decisions
The  data structure will be removed from Parsoid and moved from core's   namespace into the REST API implementation, as a feature of the REST interface design but not a core Parsoid abstraction. The metadata written by Parsoid which is not already reflected by appropriate  properties, such as specific content headers needed by the REST API, will be written by Parsoid directly to   extension data using Parsoid's ContentMetadataCollection interface, either using the existing   key or new keys specific to the particular metadata. The main Parsoid entrypoints will use +  as result types rather than page bundle; this will also avoid a serialization step and allow Parsoid to return its DOM result (with live structured attributes) directly to core. The  code can also be removed from Parsoid, since Parsoid's APIs will now be DOM-based. plus support code for structured attributes will likely be moved to a library, which is probably also a good home for serialization code like.

(Tentatively:) An API will be provided to store and fetch  by ID from   elements in the Document. "Child"  instances will serialize themselves as simply the appropriate ID key, and they will fetch the appropriate   from the parent based on ID when necessary. This will allow storage of s (held by the  ) in extension data or in   fields. Live manipulation of structured data contained within these fragments will then be appropriately loaded and stored by the parent  (held by its own  ). Since these child fragments are part of the main document tree, they can be enumerated and mutated in-place by post-processing passes without explicit knowledge, and structured data attributes within the fragment will be transparently held by the  or other mechanism used by the parent. Inside the HTML library, an API will allow easy creation of a new empty /  tied to the owner document, as well as (for legacy compatibility) creating a new   /  tied to the owner document from an HTML string. Enumerating all fragments for post-processing can be done with ; this can also be exposed as an API helper method.

The JSON codec for child holders will need to use the codec context to ensure that child  objects are properly relinked to the parent on deserialization. The Slack discussion on T346829 seemed to get hung up on whether these sort of stateful deserializers should be "discouraged but possible" or whether the JSON codec wanted to explicitly prohibit anything but value objects. If serialization is restricted to simple value objects, then the  and   methods need to include a parent object (ParserOutput, parent HtmlHolder, etc) as an explicit parameter (or as an explicit parameter of a similar-but-not-identical   class) so that the child holder can be relinked to the parent Document after deserialization. (Perhaps even  return a "child" , with the contents of the   tag a special case, and the full parent document is stored elsewhere.  This makes all  s "children" with references to a special  ; ie the parent is the exception, not the child.)

Note that the initial steps in the Rich Attributes proposal (before proposal 3) requires the caller to provide the context type for the deserializer, which is a different design that the JsonCodec used in core, which uses the more typical design where the serialized object contains its own type marker.

I believe that the parsing model for inside  allows serialization of fragments in an appropriate way. and  tags are processed using the "in head" insertion mode, but this seems to match how they are processed "in body". If there is some issue it may be necessary to add another wrapper element inside the  (like a   tag) to reset the parsing mode so we're not "in template" and   etc tags are parsed properly.