Parsoid/DOM notes

From mediawiki.org

This document discusses well-formed and well-balanced requirements of template output in the context of efficient editability within VE and efficient reparsing of pages in Parsoid.

Goals[edit]

Editability[edit]

  • Edit transclusions in VE. Use HTML for parameter editing whenever possible
  • Correct and efficient preview updated / inserted transclusions in VE after edits

Performance[edit]

  • Re-render only modified transclusions
  • ESI or client-side JS updates of dynamic content

Compatibility[edit]

  • The semantics of existing content should not change
  • Rendering should be consistent between Parsoid and the PHP parser while both are used in parallel

Issues[edit]

  1. wikitext templates are text-based and don't necessarily produce well-formed DOM trees (TODO: Provide examples)
  2. the PHP parser textually expands all transclusions in place, so parsing of transclusions can be affected by arbitrary syntactical contexts. Most common context dependency (and probably the only one we might support) is start-of-line position.
  3. even when well-formed, HTML5 content model could force restructuring of target page beyond the transclusion node (TODO: Provide examples)

Implications[edit]

Editability[edit]

  • Without information about which parameters can be parsed to DOM, parameters currently need to be edited as wikitext (see bug 50587)
  • Page content that interacts with a transclusion (wrapped in unbalanced templates, rendered depending on a template expansion) can only be edited as wikitext. This content can in turn be balanced or unbalanced.
  • When transclusion args are changed, HTML5 content model can cause the changes to leak out of the DOM structure to sibling and ancestor nodes. In the worst case, the full page might need to be re-rendered, which is not feasible (TODO: explain why). The current re-rendering implementation in VE can present incorrect previews.

Performance[edit]

  • In certain scenarios, can force full-page reparsing when transclusions or extensions are re-expanded.
  • Makes dynamic ESI or client-side updates impossible.

Considerations[edit]

  • Acceptability of solution by editors
  • What do we do about all the old revisions which we cannot go in and edit / fix / wrap in <domtree> extension tags?

How to fix it in the longer term[edit]

Define well-formed content blocks and enforce well-formedness[edit]

  • use TemplateData to identify HTML-compatible template parameters
  • wrap existing unbalanced content (mix of page content, transclusions, extensions) in <domtree> tag, which enforces well-formedness for the entire block. PHP parser implementation as an extension tag.
    • Edit templates and use bots to fix uses where possible to minimize cases that require <domtree> wrapping.
  • use existing marking of template/extension-affected content for efficient and correct updates (partly done already)
  • default to well-formed transclusion parsing

The last point might be too hard to implement in the PHP parser.

Steps beyond PHP parser compatibility[edit]

Enforce content model constraints[edit]

In addition to simple well-formedness, we need to enforce some content model constraints to bound and possibly narrow down the scope of template-affected content. We currently (partly) enforce these constraints:

  • paragraphs cannot be nested: suppress paragraph generation in nested content
  • links cannot be nested: currently break up the outer link (not ideal, but MW behavior)

To further reduce the scope of what can be affected by template re-rendering, we could consider also enforcing some of the following constraints:

  • inline content
  • table elements in a table (avoid foster-parenting)

We could further expand this to also include MediaWiki-specific syntactic constraints:

  • wikitext lists depend on uninterrupted wikitext list items
  • trailing (or leading) templated newlines {{echo|foo\n\n}}bar affect the syntactic context and thus the parsing of following content. In this example, the trailing newlines cause 'bar' to be wrapped in its own paragraph.
  • autolinks/magic links can either (a) not have any template-affected parts, or (b) have all template-affected parts completely contained.
    For example, with (b) this would be valid: RFC {{echo|1234}} but this would be invalid: {{RFC {{echo|1234 trailing content}}

In general, constraints can also guide the VE in the selection of possible templates to insert in a specific context.

Our current marking of template-affected content does not fully take into account content model issues. For correct dependency tracking, we have a choice between 1) enforcing constraints and 2) marking large amounts of content as transclusion-affected. Enforcing constraints is clearly preferable both from a performance and the editing perspective. The downside of enforcing constraints is that we risk moving further away from the PHP parser's interpretation of the same content. There might be ways to implement similar behavior in the PHP parser. If we move fast enough to make Parsoid page rendering the default at that point this might however not be an issue any more.

TODO:

  • consider use site constraints vs. global per-template constraints and their interaction with old revisions

SEE ALSO:

Develop DOM-based templating alternatives[edit]

Separate data from presentation. For example in large tables, there is a lot of repetitive wikitext that serves no purpose except to introduce syntax errors, foster-parentable content, etc. See our roadmap.

Explanatory Notes[edit]

This section is a bunch of notes which elaborate in a little more detail problems / solutions referenced in earlier sections.

Efficient re-rendering on edits[edit]

For this section and the rest of this document, let us assume that the output of a template is always a well-formed DOM fragment (a forest of adjacent DOM trees).

Given a page P, let F be a DOM fragment that corresponds to the transclusion of a template T. Let N be the container node within which F gets inserted. There are three edit scenarios that we have to consider:

  1. Page P is edited to P'. Output of F is unchanged. How do we reuse F from P when parsing P'? This is the common workflow for Parsoid on page edits.
  2. Page P is unchanged. Template T that produces F is changed which now produces F'. How do we now re-render P to incorporate F'?
  3. Page P is edited in VE. Parameters to T are modified in VE which changes F to F'. How does VE re-render P to incorporate F'?

In Parsoid, when reparsing P, currently F is converted to representative wrapper tokens which then participate in various transformations (indent-pre wrapping, list creation, p-wrapping, etc). During post-processing of the DOM, F is unwrapped and inserted into N. This technique will let us handle scenarios 1 and 2. But, without additional guarantees/constraints on F' and the container node N, VE won't be able to just take F' and drop that in place on the client-side. In the worst case, it will require a serialize + reparse to get HTML nesting constraints (as implemented in the HTML5 parser) exactly right.

In general, the acceptability criteria is whether P' == HTML5.parse((P' = P.replace(F, F')).outerHTML). If yes, then F' can be dropped into N (both in Parsoid and VE) without any additional analysis or transformations. But, this check, while sufficient, is quite expensive and unrealistic to do on every template edit. We can improve on this by exploiting DOM scopes (wikitext sections, for example) each of which can be processed completely independently, by enforcing additional constraints on template output (F above -- global per-template constraints), or on nodes where they can be used (N above --- use-site constraints).