User:SSastry (WMF)/Notes:Whitespace

Notes about handling whitespace
This entire note is predicated on two assumptions:


 * That it complicates algorithms and code to treat all white-space as being significant and that we can actually simplify code if we didn't have to.
 * That it is actually better to progressively normalize wikitext to canonical forms by introducing dirty diffs across a large number of edits. (SSS: To resolve: What is the diff. between canonical and normal -- almost seem like synonyms in my mind)

Possible strategy for dealing with whitespace:


 * 1) Parsoid decorates the DOM with wikitext source positions (s, e), i.e. given the original wikitext string S, if a DOM node N has positions (s, e), this means that the HTML represented by N corresponds to wikitext S(s,e).  The question to ask here is whether such (s,e) assignment is possible at all nodes N in the DOM.  If not, what properties do S/N have to satisfy for such assignment to be possible?
 * 2) Parsoid can then, if found useful to the visual editor, completely canonicalize all white-space so all non-significant whitespace is reduced to its minimal form.  The same could possibly be done for other nodes (ex: inline node minimization routine minimize_inline_tags in mediawiki.DOMPostProcessor.js).  But, in the interest of minimizing parsoid overhead, if VE can handle it, we may decide not to run this normalization/canonicalization before passing the DOM to the visual editor.
 * 3) The visual editor completely ignores non-significant white-space when building the linear model from the DOM.  What would be critical is that the VE preserve the (s,e) offsets on the DOM tree.  In addition, the VE would have to track what parts of the DOM were modified by the user, and mark dirty DOM sub trees.
 * 4) When Parsoid gets the new DOM, as part of serialization, it propagates a new flag (say, has-modified-child) up ancestor paths all the way up to the root.  Then, for every maximal DOM subtree that is unmodified, it uses (s,e) values on that subtree root to render original wikitext.  For every modified subtree, it applies the existing wikitext serialization algorithm but also uses the opportunity to aggressively canonicalize whitespace and get rid of non-signficant whitespace.  Ex:  Given a sub-tree N-[A,B] where A,B are children of N, and where A is unmodified and B is modified, B's wikitext is serialized (with canonicalized whitespace), A's wikitext is extracted from the original wikitext string S, and N's wikitext is composed based on N's elt. type and output of A and B.

This strategy has a few potential benefits:


 * The visual editor can be white-space insensitive for the most part.
 * Parsoid only has to serialize the smallest pieces of the DOM.
 * The document can be progressively normalized -- i.e. dirty diffs that might be introduced by a 1-time normalization is now spread out across a lot more edits making the impact of dirty diffs much smaller.
 * The Parsoid and VE are not hampered by having to accurately preserve white space under all circumstances, which is ultimately unnecessary. What is more important is that dirty diffs not overwhelm any single edit.

Potential pitfalls:


 * Of course, the success of this strategy hinges on the ability of Parsoid to assign accurate (s,e) offsets to DOM nodes, and the ability of visual editor to ignore non-significant whitespace without much trouble.
 * As you can imagine, templates could be a problem and complicate this strategy:
 * Let us first considered templates that output well-formed HTML. For proper roundtripping, the generated DOM will have to track source template information.  So, by treating (s,e) offsets relative to a file/template attribute of the nearest ancestor carrying such an attribute, we will be able to correlate (s,e) offsets with the correct wikitext stream (file/template).
 * Let us now consider templates that don't output well-formed HTML. For example oo ==> DOM ==>  t1    pool    t2  .  While it is harder to assign offsets to individual nodes in this DOM (we can assign to a subset of them actually), we can still assign dom-offsets to the root of this subtree and still canonicalize white space here.  As long as the user does not modify the content of this subtree, we can correctly serialize this by extracting the wikitext string oo .  However, if the user does modify pieces of the subtree, we are now going to be introducing dirty whitespace diffs on a larger section of the subtree.  How big of a problem is this going to be?