Parsoid/MediaWiki DOM spec/Element IDs

From mediawiki.org

See bug 52936 for the implementation status.

Problem statement[edit]

  • We want to associate arbitrary external metadata with elements in a document. Examples: data-parsoid, data-mw, authorship maps, annotations.
  • We want to preserve this metadata across revisions
  • We want to preserve this metadata when copy & pasting
    • Within a page in VE
    • Between different VE instances in the same or different wikis
    • Between a read-only view and VE on the same or different wikis

Solution[edit]

Add a page-unique id attribute to each node[edit]

User-supplied IDs will be used directly as long as they are unique. IDs with the mw- and mwe- prefixes plus a blacklist of UI ids are disallowed. The assigned ids will be stable across revisions, and can be used as a fragment identifier.

Rationale:

  • document.getElementByID is very efficient
  • also supported in SVG and MathML
  • stable ids simplify move detection in the HTML diff algorithm

On copy or view page load, add a revision URL attribute attribute to at least one element[edit]

Example clipboard content:

<div id="mwMQ" data-rev="http://wiki.com/?oldid=12345">the content</div>

Basically make sure that copied content has at least one element with the revision URL. On paste into an editor, the specified revision URL can be used to fetch additional metadata. Example: http://wiki.com/?oldid=12345/pageprops, which returns a JSON object keyed on id.

Adding the revision URL can happen on the server (it will compress well), but ideally we'd add it only when needed on copy to keep the DOM clean.

Preventing incorrect metadata being returned on copy-paste[edit]

A pasted fragment with non-conflicting data-rev attributes and id attributes might come from arbitrary sources, so might not actually correspond to the expected elements. If those ids also match ids in the linked revision, then the metadata might now be associated with unrelated elements, which might result in corruption and other ill effects.

To prevent this, the client can hash each pasted subtree and send tuples of (id, subtree-hash) to the server when requesting the metadata. The server can then verify that the subtree indeed matches the pasted source, and return an error if that does not check out.

Handling ID conflicts / changes on edit[edit]

When an existing id is changed in the editor (user sets a new id, content is duplicated) or new content is pasted from another document / wiki, data-mw.{previd,revuri} attributes are added that record the previous id and API revision uri of this node. For pasted content from another wiki or revision the revuri is also supplied. This can then be used by the storage backend to update associated metadata by fetching, duplicating or renaming existing metadata.

Implementation issues[edit]

Keeping the ID assignment stable across wikitext edits will be difficult. We don't need to solve this before HTML storage is implemented, but should start working on it as it would also enable switching between HTML and wikitext in VE.

Idea:

  • Re-parse modified wikitext, and DOM-diff the resulting DOM while ignoring data-parsoid.
  • For each DOM node that did not differ (significantly), transfer the old IDs to the new DOM. Update data-parsoid, and any other element-associated metadata that needs updates (authorship maps for example).

Discussion: http://etherpad.wikimedia.org/p/Parsoid_stable_id_brainstorming

ID prefixing / pattern[edit]

After some discussions with Krinkle we settled on the pattern mw<base64-encoded counter>. We can keep the maximum id per page in data-parsoid, so that ids are not reused quickly. This avoids links falsely pointing to an unrelated fragment when the id is reused.