Parsoid/OutputTransform/HtmlHolder

When Parsoid (and Parsoid-aware transforms) hold a DOM object model, there are two important features/extensions which an "HtmlHolder" interface in core needs to be aware of. The first is structured and private attributes within the Document, and the other is the representation of standalone document fragments.

Structured attributes and the DataBag
The MediaWiki DOM Spec contains a large number of "JSON-valued" attributes, to express structured values in HTML attributes in a compact and bandwidth-friendly way. Parsoid implements support for this primarily in the  class, with the   method, which returns a structured value: in the PHP implementation an associative array; formerly in the JavaScript implementation a JS Object. This value is nominally serialized in "plain old HTML" as the JSON-encoded value of the array/object, but is stored "live": that is, it is not parsed and re-serialized to a string attribute value every time the attribute is read or modified, but instead is kept as a live array value attached to the DOM Node. References can be kept to the live value and it can be mutated and that change is immediately visible to anyone else which has a reference to the value.

The actual implementation is a bit baroque -- in addition to a proposal ("Rich attributes") to extend this basic mechanism to include DOM Fragments, there are multiple different serialization formats for these structured values. The naive "as a JSON-encoded string" version we call "inline attributes". It suffers from a perceived "ugliness" problem, since inside a quoted HTML attribute value all quotes must be escaped, and JSON-encoded values contain a large number of quotation marks. There's a separate but orthogonal issue with exposing "private" attributes in this naive serialization, discussed below. Due to these two deficiencies, Parsoid has historically supported two additional alternative encodings of structured attributes. By adding a unique id attribute to every node, the values of structured attributes can be hoisted out of the HTML and stored as a mapping from ID to attribute value. In one encoding this map is kept as a separate JSON-encoded blob alongside the HTML; the combination of JSON blob and HTML is a PageBundle (page bundles have further uses described below). In another representation the combination is kept as a single HTML document, but the JSON-encoded map is stored in a  element in the   of the Document. This reduces the bloat caused by encoding all the quotation marks in the structured attributes, but adds additional bandwidth to record ID attributes on every node and additionally to include those ID values in the key portion of the map in the.

This id-to-attribute map is also used internally to the implementation: instead of hanging the rich attribute values directly off of the DOM Node, in the PHP implementation they ID-to-value map is stored in a DataBag which is attached to the root Document object. This is because the existing PHP implementation of the DOM uses ephemeral PHP objects to wrap the "actual" representation of the Node implemented by the libxml library. Those ephemeral PHP objects are created and destroyed every time a reference to the Node goes into or out of scope. When the ephemeral PHP wrapper goes out of scope, any data attached to the Node is destroyed. By keeping a persistent reference to the (wrapper of the) main Document object in Parsoid's Env class, we can prevent the DataBag from being destroyed. (We could also just keep an explicit reference to the DataBag.)

Parsoid contains a "load" mechanism that runs after DOM parsing which loads structured-valued attributes into the DataBag, implemented in, and a corresponding "store" mechanism in. In our implementation changes made to the live object stored in the DataBag are not reflected in the raw attribute value visible via Element::getAttribute until a "store" is done, and similarly several methods based on structured attributes, like, will not work correctly until a "load" is done. The "eager" loading mechanism could be replaced by a "lazy" loader which didn't locate and load structured values (whether from inline attributes or a map) until requested. This could eliminate the need for an explicit "load" step, but since the values are live and can be mutated without notification to the DOM layer, an explicit "save" step will always be necessary to ensure the serialized DOM reflects the latest values for structured-value attributes.

Private attributes
The implementation and encoding of structured-value attributes in Parsoid was also influenced by an API decision that the contents of  attributes was to be considered implementation-private. This was enforced at an API level by stripping the  attributes in values provided to most clients, and then re-inserting them from separate storage (keyed by a render ID) when necessary. In addition to strictly enforcing the abstraction boundary, this also saved bandwidth on API responses.

The naming convention extended to  attributes, which were used to store information "needed by editing clients but not for readers". The idea was that  attributes would also be stripped in content served for read views or for reader clients to save additional bandwidth.

In this context, an additional benefit to storing the structured attributes outside the Document (or in a separate element in the ) was that it allowed API code to efficiently implement these attribute-stripping strategies by dropping the elements. In practice this benefit was undercut by the fact that Parsoid's principle client, Visual Editor, wanted  attributes, so the   attributes needed to be explicitly reloaded from the separate storage before the HTML was usable by VE.

Since the implementation of separate storage was tied to abstraction boundary design goals for  and   specifically, the DataBag mechanism and load/store mechanism was initially implemented only for structured values of these two attributes. Other structured value attributes used the uncached storage machanism of  and the values returned were not live but had to be explicitly saved with. The Rich Attribute proposal would extend live storage to all structured attributes, which keeping orthogonal the design decision regarding the precise set of structured attributes which would be encoded in separate storage (as opposed to inline).

Design decisions
For an  interface in core, two views of the document are provided: an HTML string and a DOM.

We have decided that the DOM representation will contain structured data that has been appropriately "loaded" -- that is, operations provided to core that operate on structured values will work immediately on the DOM returned without requiring an explicit load step. An equivalent to  will be provided in core (or more likely, in a "Rich HTML" library which may also contain parts of Parsoid's DOMCompat library) which will work on the DOM as returned by HtmlHolder. This is consistent with either an eager "load" step occuring after string-form HTML is parsed, or with a lazy load step integrated with the implementation of the structured value API provided to core.

The HTML string provided by  will be the "naive" inline-attribute serialization of the document, not one of the alternate encodings. When converting from DOM to a string, an appropriate "store" step will be performed to serialize the current live values of structured attributes. Private attributes like  will not be stripped from the HTML string.

Serialization to ParserCache
Note that the actual representation stored in the ParserCache (ie, the serialized version of the HtmlHolder) does not need to be the same as the string form of the HTML returned by HtmlHolder. Optimized encodings could be utilized to reduce the "lots of escaped double-quotes" issue with the naive inline-attribute representation. The primary performance requirement is that, if read view HTML is cached, that read view HTML be able to be rendered directly from the value stored in the ParserCache with minimal additional processing. But read view HTML is not expected to have many (if any) structured-valued attributes in it. So long as optimized encodings do not touch the set of attributes used by read views, then read views ought to still be able to be served directly from the ParserCache representation.

For edit views, the "inline-attribute" representation matches what the VisualEditor client expects, although currently data-parsoid is stripped by the API. The visual editor API which provides access to edit-mode HTML can choose to reimplement data-parsoid stripping for performance/bandwidth reasons, but it is not required.

The precise details of the ParserCache serialization should as far as possible be hidden from clients, and changes made to the serialization format for performance or efficiency reasons should not affect the DOM or HTML strings provided to callers.

Enumeration of fragments and metadata
In addition to an HTML Document, wikitext parsing results in a collection of metadata. Historically that metadata was stored in the PageBundle and returned to API clients as JSON, although some portions of the metadata were also returned as HTTP headers in the REST response. The integration of Parsoid with core has eliminated the need for a REST API-focused PageBundle structure, and made available the much richer ParserOutput objects to hold metadata generated by parsing. For compatibility with existing calling conventions and the REST API, methods in core exist to convert metadata stored in PageBundles to "extension data" stored in ParserOutput, and the ContentMetadataCollector interface in Parsoid exists to allow Parsoid to directly write metadata to the ParserOutput object held by core. We currently accommodate storage of structured attributes in the PageBundle, reflecting that into extension data attributes as well.

The richer variety of metadata stored in the ParserOutput and newly-implemented by Parsoid introduced another issue: instead of one HTML Document representing the entire result of the parse, certain piece of metadata were "HTML strings" and thus logically separate DocumentFragments. Many of these were stripped HTML of one sort or another (page title, TOC entries), but the "page indicator" mechanism in core represented an entire wikitext fragment that certainly requires postprocessing (localization) and likely requires appropriate representation of structured attributes in the fragment as well. Extension implementations seem to want to store Parsoid-generated HTML fragments in the ParserOutput's extension data mechanism as well.

This raises two related questions:


 * Should short HTML fragments of this sort be represented by individual HtmlHolder objects? If the HtmlHolder objects are separate, is the "owner document" for each fragment unique as well, or are they conceptually part of a single Document?
 * For postprocessing passes which want to operate on all Parsoid-generated HTML (for example, user-specific localization), how can such fragments be located within the ParserOutput (and extension data) and enumerated so they can be appropriately transformed?

It's worth noting that similar questions arose in the Parsoid implementation regarding the "owner document" of fragments created internally during parse and that after a bunch of work most fragments in Parsoid now share the same owner document (although an awkward Remex API means many of these fragments are created as separate documents that then have to be adopted by the main owner.) This is not a complete solution to the enumeration question, however, since there exists no DOM API for enumerating all child fragments of a given owner document (and to do so would seem to require weak references at least).

Design decisions
The PageBundle data structure will be removed from Parsoid and core's parser namespace and moved to the REST API implementation, as a feature of the REST interface design but not a core Parsoid abstraction. The metadata written by Parsoid which is not already reflected by appropriate ParserOutput properties, such as specific content headers needed by the rest API, will be written by Parsoid directly to ParserOutput extension data using Parsoid's ContentMetadataCollection interface. The main Parsoid entrypoints will use HtmlHolder+ContentMetadataCollector rather than page bundle; this will also avoid a serialization step and allow Parsoid to return its DOM result (with live structured attributes) directly to core. The XmlSerializer code can also be removed from Parsoid, since Parsoid's APIs will now be DOM-based. (HtmlHolder plus support code for structured attributes will likely be moved to a library, which is probably also a good home for serialization code.)

(Tentatively:) A library will be provided to store and fetch  by ID from   elements in the Document. "Child" HtmlHolder instances will serialize themselves as simply the 'ID' key, and fetch the appropriate DocumentFragment from the parent based on ID where necessary. This will allow storage of DocumentFragments (held by the HtmlHolder) in extension data or in ParserOutput fields, with live manipulation of structured data contained within them which is appropriately loaded and stored by the parent Document (held by its own HtmlHolder). Since these child fragments are part of the main document tree, they can be enumerated and mutated in-place by post-processing passes without explicit knowledge, and structured data attributes within the fragment will be transparently held by the DataBag or other mechanism used by the parent.