User:SSastry (WMF)/Notes:Whitespace

From mediawiki.org

Notes about handling whitespace[edit]

This entire note is predicated on two assumptions:

  • That it complicates algorithms and code to treat all white-space as being significant and that we can actually simplify code if we didn't have to.
  • That it is actually better to progressively normalize wikitext to canonical forms by introducing dirty diffs across a large number of edits. (SSS: To resolve: What is the diff. between canonical and normal -- almost seem like synonyms in my mind)

Possible strategy for dealing with whitespace:

  1. Parsoid decorates the DOM with wikitext source positions (s, e), i.e. given the original wikitext string S, if a DOM node N has positions (s, e), this means that the HTML represented by N corresponds to wikitext S(s,e). The question to ask here is whether such (s,e) assignment is possible at all nodes N in the DOM. If not, what properties do S/N have to satisfy for such assignment to be possible?
  2. Parsoid can then, if found useful to the visual editor, completely canonicalize all white-space so all non-significant whitespace is reduced to its minimal form. The same could possibly be done for other nodes (ex: inline node minimization routine minimize_inline_tags in mediawiki.DOMPostProcessor.js). But, in the interest of minimizing parsoid overhead, if VE can handle it, we may decide not to run this normalization/canonicalization before passing the DOM to the visual editor.
  3. The visual editor completely ignores non-significant white-space when building the linear model from the DOM. What would be critical is that the VE preserve the (s,e) offsets on the DOM tree. In addition, the VE would have to track what parts of the DOM were modified by the user, and mark dirty DOM sub trees.
  4. When Parsoid gets the new DOM, as part of serialization, it propagates a new flag (say, has-modified-child) up ancestor paths all the way up to the root. Then, for every maximal DOM subtree that is unmodified, it uses (s,e) values on that subtree root to render original wikitext. For every modified subtree, it applies the existing wikitext serialization algorithm but also uses the opportunity to aggressively canonicalize whitespace and get rid of non-signficant whitespace. Ex: Given a sub-tree N-[A,B] where A,B are children of N, and where A is unmodified and B is modified, B's wikitext is serialized (with canonicalized whitespace), A's wikitext is extracted from the original wikitext string S, and N's wikitext is composed based on N's elt. type and output of A and B.

This strategy has a few potential benefits:

  • The visual editor can be white-space insensitive for the most part.
  • Parsoid only has to serialize the smallest pieces of the DOM.
  • The document can be progressively normalized -- i.e. dirty diffs that might be introduced by a 1-time normalization is now spread out across a lot more edits making the impact of dirty diffs much smaller.
  • The Parsoid and VE are not hampered by having to accurately preserve white space under all circumstances, which is ultimately unnecessary. What is more important is that dirty diffs not overwhelm any single edit.
  • This loosens the tight dependence between Parsoid and VE (i.e. the two can be developed a little bit more independently, at least in the white-space realm!). The same applies to any other tools that might work with Parsoid/VE -- they need not worry about significant/semantic whitespace.

Potential pitfalls:

  • Of course, the success of this strategy hinges on the ability of Parsoid to assign accurate (s,e) offsets to DOM nodes, and the ability of visual editor to ignore non-significant whitespace without much trouble.
  • As you can imagine, templates could be a problem and complicate this strategy:
    • Let us first considered templates that output well-formed HTML. For proper roundtripping, the generated DOM will have to track source template information. So, by treating (s,e) offsets relative to a file/template attribute of the nearest ancestor carrying such an attribute, we will be able to correlate (s,e) offsets with the correct wikitext stream (file/template).
    • Let us now consider templates that don't output well-formed HTML. For example {{t1}}oo{{t2}} ==> DOM ==> <b><i> t1 </i> pool <i> t2 </i></b1>. While it is harder to assign offsets to individual nodes in this DOM (we can assign to a subset of them actually), we can still assign dom-offsets to the root of this subtree and still canonicalize white space here. As long as the user does not modify the content of this subtree, we can correctly serialize this by extracting the wikitext string {{t1}}oo{{t2}}. However, if the user does modify pieces of the subtree, we are now going to be introducing dirty whitespace diffs on a larger section of the subtree. How big of a problem is this going to be?

How it is handled currently[edit]

Currently, all whitespace tokens are accurately represented, and is significant. Parsoid outputs newlines during wikitext parsing, VE preserves them, and Parsoid uses newlines to properly serialize the output DOM back to wikitext. This also requires VE to treat most whitespace as significant and for VE and Parsoid to agree on whitespace semantics. IMPORTANT NOTE: This also works because the user cannot manipulate HTML directly except as wikitext -- this took me a little while to figure out because I was confusing this with rich-text editors where you can switch between presentation and HTML and manipulate HTML. However in this case, the switch is between presentation and wikitext, not HTML which means the user cannot manipulate the DOM directly. I am adding it here so I dont get confused again.

In any case, this creates a tighter dependency between VE and Parsoid plus any tool that might want to work with Parsoid/VE would have to respect similar white-space semantics. By treating whitespace as non-significant, we get the added robustness that changes to VE/Parsoid or any other component that works with these components will not have to worry about whitespace.

So, for example, right now, only the following HTML output by VE can be parsed properly if we treat whitespace as being significant. The other forms below will also generate incorrect wikitext. So, the VE will have to output this HTML only (or some other canonical form that can handle list nesting scenarios properly).

<ul><li>foo
</li><li>bar
</li></ul>

The following 4 equivalent HTML forms that render identically in a browser cannot be serialized back to wikitext properly because the newline positions are significant (determined by the initial wikitext --> DOM parse, in turn determined by how the current PHP parser handles it).

<ul><li>
foo</li><li>
bar</li></ul>


<ul><li>foo</li><li>bar</li></ul>


<ul>
<li>foo</li>
<li>bar</li>
</ul>


<ul>
<li>
foo
</li>
<li>
bar
</li>
</ul>