Parsoid/MediaWiki DOM spec/Element IDs

See bug 52936 for the implementation status.

Problem statement

 * We want to associate arbitrary external metadata with elements in a document. Examples: data-parsoid, data-mw, authorship maps, annotations.
 * We want to preserve this metadata across revisions
 * We want to preserve this metadata when copy & pasting
 * Within a page in VE
 * Between different VE instances in the same or different wikis
 * Between a read-only view and VE on the same or different wikis

Add a page-unique id attribute to each node
User-supplied IDs will be used directly as long as they are unique. IDs with the mw- and mwe- prefixes plus a blacklist of UI ids are disallowed. These ids will be stable across revisions, and can be used as a fragment identifier.

Rationale:
 * document.getElementByID is very efficient
 * also supported in SVG and MathML
 * stable ids simplify move detection in the HTML diff algorithm

On copy or view page load, add a revision URL attribute attribute to at least one element
Example clipboard content:

Basically make sure that copied content has at least one element with the revision URL. On paste into an editor, the specified revision URL can be used to fetch additional metadata. Example:, which returns a JSON object keyed on id.

Adding the revision URL can happen on the server (it will compress well), but ideally we'd add it only when needed on copy to keep the DOM clean.

Preventing incorrect metadata being returned on copy-paste
Consider a page P on a wiki and consider an element E with an id x. I now create a non-wiki page P' and add an element E' with the same id x. E' and E are the same nodename but the subtrees are entirely different. I then copy E' off page P' and paste into P. The client now notices id x and issues a metadata request to the API. Without any additional information, the server is no wiser and returns the metadata. However, the metadata should not be used since E' and E are not the same subtrees. This can now cause corruption on serialization, and then parse to a different HTML than what was saved. To prevent this scenario, the client could send a signature (md5 hash of outerHTML, for ex.) of the dom-subtree of the element for which it is requesting metadata for. The server could then verify that the element for which the metadata is being returned is identical to the one for which it was generated and saved. Obviously, this signature is unnecessary in the common case when the client is requesting metadata for elements that it trusts (ex: when the HTML came from a trusted server where such corruption is not expected - this is true for initial HTML loaded from a Parsoid server). But, the bets are off when when content is copied from arbitrary locations.

Handling ID conflicts / changes on edit
When an existing id is changed in the editor (user sets a new id, or content is duplicated), a data-previd attribute is added that records the previous id of this node. This can then be used by the storage backend to update associated metadata by duplicating or renaming existing metadata.

Implementation issues
Keeping the ID assignment stable across wikitext edits will be difficult. We don't need to solve this before HTML storage is implemented.

Idea:
 * Re-parse modified wikitext, and DOM-diff the resulting DOM while ignoring data-parsoid.
 * For each DOM node that did not differ (significantly), transfer the old IDs to the new DOM. Update data-parsoid, and any other element-associated metadata that needs updates (authorship maps for example).