Parsoid/DOM notes

This document discusses well-formed and well-balanced requirements of template output in the context of efficient editability within VE and efficient reparsing of pages in Parsoid.

Editability

 * Edit transclusions in VE. Use HTML for parameter editing whenever possible
 * Correct and efficient preview updated / inserted transclusions in VE after edits

Performance

 * Re-render only modified transclusions
 * ESI or client-side JS updates of dynamic content

Compatibility

 * The semantics of existing content should not change
 * Rendering should be consistent between Parsoid and the PHP parser while both are used in parallel

Issues

 * 1) wikitext templates are text-based and don't necessarily produce well-formed DOM trees (TODO: Provide examples)
 * 2) even when well-formed, HTML5 content model could force restructuring of target page beyond the transclusion node (TODO: Provide examples)

Editability

 * Without information about which parameters can be parsed to DOM, parameters currently need to be edited as wikitext (see bug 50587)
 * Page content that interacts with a transclusion (wrapped in unbalanced templates, rendered depending on a template expansion) can only be edited as wikitext. This content can in turn be balanced or unbalanced.
 * When transclusion args are changed, HTML5 content model can cause the changes to leak out of the DOM structure to sibling and ancestor nodes. In the worst case, the full page might need to be re-rendered, which is not feasible (TODO: explain why). The current re-rendering implementation in VE can present incorrect previews.

Performance

 * In certain scenarios, can force full-page reparsing when transclusions or extensions are re-expanded.
 * Makes dynamic ESI or client-side updates impossible.

Considerations

 * Acceptability of solution by editors
 * What do we do about all the old revisions which we cannot go in and edit / fix / wrap in extension tags?

Define well-formed content blocks and enforce well-formedness

 * use TemplateData to identify HTML-compatible template parameters
 * wrap existing unbalanced content (mix of page content, transclusions, extensions) in tag, which enforces well-formedness for the entire block. PHP parser implementation as an extension tag.
 * Edit templates and use bots to fix uses where possible to minimize cases that require wrapping.
 * use existing marking of template/extension-affected content for efficient and correct updates (partly done already)
 * default to well-formed transclusion parsing

The last point might be too hard to implement in the PHP parser.

Enforce content model constraints
In addition to simple well-formedness, we need to enforce some content model constraints to bound and possibly narrow down the scope of template-affected content. We currently (partly) enforce these constraints:
 * paragraphs cannot be nested: suppress paragraph generation in nested content
 * links cannot be nested: currently break up the outer link (not ideal, but MW behavior)

To further reduce the scope of what can be affected by template re-rendering, we could consider also enforcing some of the following constraints:
 * inline content
 * table elements in a table (avoid foster-parenting)

We could further expand this to also include MediaWiki-specific syntactic constraints:
 * wikitext lists depend on uninterrupted wikitext list items
 * trailing (or leading) templated newlines bar affect the syntactic context and thus the parsing of following content. In this example, the trailing newlines cause 'bar' to be wrapped in its own paragraph.

In general, constraints can also guide the VE in the selection of possible templates to insert in a specific context.

Our current marking of template-affected content does not fully take into account content model issues. For correct dependency tracking, we have a choice between 1) enforcing constraints and 2) marking large amounts of content as transclusion-affected. Enforcing constraints is clearly preferable both from a performance and the editing perspective. The downside of enforcing constraints is that we risk moving further away from the PHP parser's interpretation of the same content. There might be ways to implement similar behavior in the PHP parser. If we move fast enough to make Parsoid page rendering the default at that point this might however not be an issue any more.

TODO:
 * consider use site constraints vs. global per-template constraints and their interaction with old revisions

Develop DOM-based templating alternatives
Separate data from presentation. For example in large tables, there is a lot of repetitive wikitext that serves no purpose except to introduce syntax errors, foster-parentable content, etc. See our roadmap.