Jump to content

Parsoid/Incremental re-parsing after wikitext edit

From mediawiki.org

There are two classes of issues:

  • Wikitext syntax changing the semantics of neighboring content: <nowiki>, unbalanced }} / {{ etc
  • Unbalanced tags changing the DOM structure or combining with the DOM of neighboring content: Somebody inserting an unclosed <div> somewhere etc., converting a paragraph in the middle of lists into a list item etc.

The first issue can be treated conservatively by matching all diffs against a regexp containing potentially problematic wikitext syntax, and reverting to a full re-parse if any diff matches.

As for the second, we can handle it as follows. If 's' is the offset in the original wikitext where the diff shows up, we can use DSR information in the DOM to find B, the top-level block (a direct child of the body) in the original DOM corresponding to the offset 's'. We now reparse the new wikitext corresponding to B (original wikitext of B + wikitext diffs applied to it) giving us a new set of top level nodes. If one of the following is true, the range needs to be expanded to include neighboring top-level nodes or (conservatively) the entire document.

  • the new set has more than one top-level node (ex: P changes to P,P ) OR
  • the name of the single top-level node is different than B's name (ex: P changes to UL) OR
  • the single top-level node has auto-inserted end tags inserted by the tree

Since most edits are simple text changes, link insertion or other simple inline edits, this simple and conservative approach should still provide a major capacity improvement. If it turns out to be worth it, we can later refine the handling of some cases from the conservative re-parse everything to something more intelligent.