Parsoid/OutputTransform/HtmlHolder

When Parsoid (and Parsoid-aware transforms) hold a DOM object model, there are three important features/extensions to be aware of. An "HtmlHolder" interface in core needs to be aware of these.

Structured / JSON-valued / Rich attributes and the DataBag
The MediaWiki DOM Spec contains a large number of JSON-valued attributes, to express structured values in HTML attributes in a compact and bandwidth-friendly way. Parsoid implements support for this primarily in the  class, with the   method, which returns a structured value: in the PHP implementation an associative array; formerly in the JavaScript implementation a JS Object. This value is serialized in "plain old HTML" as the JSON-encoded value of the array/object, but is stored "live": that is, it is not parsed and reserialized to a string attribute value every time the attribute is read or modified, but instead is kept as a live array value attached to the DOM Node.

The actual implementation is a bit baroque -- in addition to a proposal to extend this basic mechanism to include DOM Fragments, there are multiple different serialization formats for these values. The naive "as a JSON-encoded string" version we call "inline attributes". It suffers from a perceived "ugliness" problem, since inside a quoted HTML attribute value all quotes must be escaped, and JSON-encoded values contain a large number of quotation marks. There's a separate but orthogonal issue with "private" attributes, discussed below. As a result, Parsoid has historically supported two different encodings of these structured attributes: by adding a unique id attribute to every node, the values of these attributes can be hoisted out of the HTML and stored as a mapping from ID to attribute value. In one encoding this map is kept as a separate JSON-encoded blob alongside the HTML; the combination of JSON blob and HTML is a "Page Bundle" (but see below). In another version the combination is kept as a single document, but the JSON-encoded map is stored in a element in the of the Document. This trades off the bloat required by encoding all the quotation marks in the structured attributes, with the added bandwidth required to add ID attributes to every node and to include those ID values in the key portion of the map.

This id-to-attribute map is also used internally to the implementation: instead of hanging the rich attribute values directly off of the DOM Node, in the PHP implementation they ID-to-value map is stored in a DataBag which is attached to the root Document object. This is because the existing PHP implementation of the DOM uses ephemeral PHP objects to wrap the "actual" representation of the Node implemented by the libxml library. Those ephemeral PHP objects are created and destroyed every time a reference to the Node goes into or out of scope. When the ephemeral PHP wrapper goes out of scope, any data attached to the Node is destroyed. By keeping a persistent reference to the (wrapper of the) main Document object in Parsoid's Env class, we can prevent the DataBag from being destroyed. (Or maybe we just need to keep a reference to the DataBag.)

Parsoid contains a "load" mechanism that runs after DOM parsing which loads structured valued attributes into the DataBag.

For an HtmlHolder interface in core, two views of the document are provided: an HTML string and a DOM. We have decided that the HTML string provided will be the "naive" inline-attribute version of the document, and that the DOM representation will contain structured data that has been appropriately "loaded". An equivalent to DOMDataUtils::getJSONAttribute will be provided in core (or more likely, in a "Rich HTML" library which may also contain parts of Parsoid's DOMCompat library) which will work on the DOM as returned by HtmlHolder without any explicit need to further "load" the document. Similarly, serialization to an HTML string will perform the necessary "save" step to ensure that the live values held for structured attributes are appropriately serialized as inline attributes.

Note that the actual representation stored in the ParserCache (ie, the serialized version of the HtmlHolder) does not need to be the same as the string form of the HTML returned by HtmlHolder. Describe tradeoffs.

Private attributes
data-parsoid is private. That is enforced as an API level by stripping the data-parsoid. One of the benefits of storing structured data outside the document as a separate map is that clients could retrieve the stripped document, and therefore suppress knowledge of the data-parsoid values from clients.