Status: A tweaked version of the algorithm below is now implemented, and seems to work fine so far.
Tracked in bugzilla:37911.
Generally we would like to avoid any changes to the default HTML5 tree builder algorithm, as this would allow us to use the built-in HTML parser in modern browsers, or unchanged libraries. There are however some tasks seem to be hard to solve otherwise, and require only small changes to the tree builder. These tasks all have to do with unbalanced token soup, which should be confined to the server.
- Spurious end-tags are ignored by the tree builder, while (some) are displayed as text in current MediaWiki. Text display is helpful for authors. The necessary change to the html tree builder to replicate this would be small, but is not possible if a browser's built-in parser is used. The visual editor hopefully reduces the need for this kind of debugging aid in the medium term.
- Propagate attribute information for end tag tokens (especially source information for tokens originating from templates) to matching start tag, to make sure that the full scope of template-affected subtrees is captured.
This is hard to do without a modification to the tree builder. Only needs to be performed on the server side. Relatively simple modification.
- Idea for a possible solution:
Self-closing tags like meta and text are never stripped, but might be subject to foster-parenting. The relative order between text and self-closing tags is stable, and attributes on self-closing tags are preserved.
document.body.innerHTML = '<table _t=1><td _t=2></td>foo<meta _t=3 _tref=2><meta _t=4></table><meta _t=5 _tref=4>'; console.log( document.body.innerHTML); -> foo<meta _t="3" _tref="2"><meta _t="4"><table _t="1"><tbody><tr><td _t="2"></td></tr></tbody></table><meta _t="5" _tref="4"> Before tree building, for each template: generate uid for template transclusion (simple counter) if first token is an element: add tplstart=uid attribute to element else: add meta element with tplstart=uid attribute add tplcontent=uid to each table start tag in the content of the template add meta element with tplend=uid attribute after end of template. If last token is text, also insert a meta with tplcontent=uid before trailing text tokens. On DOM: build list of nodes with tplstart / tplcontent / tplend attribute set depth-first traverse the DOM when encountering an element with tplstart set: meta element: if tplend for this uid was already found: template ended somewhere in the next sibling table element and was parent-fostered else: following content until corresponding tplend was produced by template non-meta element: element and possibly following siblings were produced by a template: look for tplend when encountering a (meta) element with tplcontent or tplend set: if there is a following sibling table node: if that has the corresponding tplstart or tplcontent set: DOM until that table was template-affected (foster-parented meta) else if tplstart was already found: match all sibling DOM trees between tplstart and tplend node (walk up the tree until common parent is reached) else: there must be a sibling table node with tplstart set. Match all content up to it. Matching sibling DOM nodes: * wrap text nodes in span with attributes * set meta info on first element