User:PerfektesChaos/WikidiffLX/suggestions

Further details on the suggestions: Notes on implementation and methods.

The coding as such is detailed here.

Improve modified consecutive lines
Objective: The line based algorithm makes minor corrections in consecutive lines appear as a dramatic change.

Example: results currently in:

This shall be made more readable by an improved algorithm:

Background: Line comparison algorithms have been developed in the 1970s. Due to the size of a punched card no program line was longer than 80 characters. This gives meaningful results when detecting modified and unchanged sections of lines.

However, wikitext lines (I would like to call them ‘paragraphs’ here) may consist of 1000 characters and more, containing several sentences in human language. Any slight change, even a single space, makes the entire paragraph to be ‘changed’. The line comparison algorithm needs to skip over this and looks for the next recovering point with absolute identity.

Suggestion: The following rules illustrate how splitting is to be performed. Figures like 100 and 250 are just clarifying the envisioned size, they are set by #define and may be chosen as desired. The choice of period+space for a break point is conducted by the following assumption: A long paragraph is quite likely written in human language. If the author reformulates something, the entire sentence might have been subject to modification. The following sentence is kept unchanged, hopefully. A common separator in human languages is the period and a space; period might have another meaning (e.g. abbreviation) but doesn’t matter. There are many other ways to terminate a sentence in various human languages, but we won’t look for them for the sake of simplicity and speed of the program.
 * 1) Split long lines by insertion of 'virtual' breaks.
 * 2) Run the diff engine as is (cheat by virtual lines).
 * 3) Analyze the result and merge adjacent virtual lines.
 * 4) Display the differences based on the original paragraphs.
 * If a line is longer than 250 bytes
 * search a possible position for breaking
 * start at byte 100
 * look for ". " (period+space, \x2E\x20)
 * if remaining length is longer than 100 bytes
 * insert virtual break
 * if remaining length is longer than 250 bytes
 * start at byte 100 of the remainder as above

Splitting yields to

The result of the diff engine is an amalgamated sequence of parts, containing the original from and to as well as a code op (details here). Note that from and to lines with the same op code are already merged into line arrays by diff engine.

The example above results in the traditional diff engine output: Following the suggested insertion of virtual line breaks the recovering points were found:

This leads to the following rules for merging of virtual line breaks, showing any possible combination of op codes:

Example for merging hard and virtual line break:

Coding
Suggestions for code are made
 * by an extension of explodeLines for detection of virtual breaks.
 * by various modifications in the aftermath when re-adjusting the diff engine result.

Open Issue
If the method was successfully implemented, this might be extended to U+3002 in CJK as well as detection of exclamation or question marks. This would be similar to a regexp and might affect performance, since it needs more than a simple find call.

Avoid confusion by empty lines
Objective: If an “empty line” (maybe with some invisible content) has been inserted or removed by the author and the adjacent paragraphs are modified in some way, the presentation of the result is currently disturbed.

Currently:

This could be remedied as:

Solution: The method is already in effect for Word objects:

Administrate As currently performed for Word strings compare the visible body of Line only in the DiffEngine.
 * a suffix: spaces between last visible character and line termination (if virtual break identified by period+space there is a suffix of at least one space)
 * trailing lines: after hard break any further paragraph without any visible content (ASCII spaces and \t for the moment, might be extended later to invisible Unicode)

Postprocess the Line objects for reconstruction of the original paragraphs, if there are any changes.

Coding
Suggestions for code are made
 * by an extension of explodeLines</tt> for detection of empty lines.
 * by introduction of Line.cpp for storing trailing invisible lines.
 * by various modifications in the various printLines</tt> functions when presenting the diff engine result.

Recovering the hidden invisible lines and invisible trailing whitespace changes needs quite a lot of simple decisions but won’t impair performance. Hiding lines on the other hand makes comparison faster and requires less objects for invisible and empty lines. Outcome will pay off.

Improve visualization of context lines
Objectives: Currently any kind of line is displayed when two lines preceding or following shall give an impression of the unchanged context where a changed block is located. The suggested code has a slightly different behaviour:
 * 1) If one or both of these lines are empty they are put into HTML source but invisible and not informative.
 * 2) If these lines are very long (sometimes each 1000 bytes and more), both paragraphs are displayed anyway, making the output lengthy and hard to survey.
 * 1) The most adjacent non-empty (visible) lines will be shown.
 * 2) Not two paragraphs but the next two virtual lines (each expected of at least 100 bytes if not full paragraph) will be displayed.


 * Another idea: context \n could be represented as &lt;br />
 * Are line numbers still meaningful? They are common when comparing lines of source code, but the wiki author hasn’t a clue where line 57 may be located, and some lines have 3000 characters, others just 15 (scrollbar position doesn’t help).

Visualize space-only differences
Objective: The current standard diff function doesn’t show space-only differences. Readers find identical black text and may guess that the reason is a superfluous space character somewhere, or perhaps a period changed into a comma?

Example for visualized space difference: old: The old lady looks  confused. new: The young girl looks confused.
 * Make space-only differences visible, including heading/trailing space.
 * Enable reader to count number of spacing characters.
 * Don’t confuse reader with space-▯ if there are visible changes: If one of the adjacent words is already red, space difference is negliable.
 * Show not only different number of spacing characters, but also varying types (currently only: ASCII Space U+0020 and HorTab U+0009, but there are many more spaces like U+2004-200A).

Coding
The suggested code enables wikidiff2 to make this visible. Performance is not remarkable influenced, since the diff algorithm is not touched. Only if any difference was already found, displaying the result is improved.

Changes:
 * Word is extended by two methods equals_suffix and get_suffixlength
 * WikidiffLX is extended by printWordDiffSideBlack
 * printWordDiffSide needs to be modified by lastBlack flag in case of copy

Open Issue
Open Issue: Leading whitespace is currently not present in worddiff[]</tt>. If the improvement above is basically adopted, there are two solutions to integrate this feature:
 * 1) Modify explodeWords and don't skip heading break. This requires a Word with bodyStart=bodyEnd=0. That might confuse the operators and String, Diff algorithm could be disturbed.
 * 2) Provide printWordDiffSide with both text1 and text2. If worddiff[0].op==DiffOp::copy visualize heading spaces, if any and different, and continue with loop.

Visualize non-ASCII spaces
If visualization of space-only differences is adopted, the methodology might be extended to other types of space.

Objective: Other spaces shall be treated like ASCII space. This goes for both word-splitting and displaying of modification.

Affected unicodes: 2002;EN SPACE 2003;EM SPACE 2004;THREE-PER-EM SPACE 2005;FOUR-PER-EM SPACE 2006;SIX-PER-EM SPACE 2007;FIGURE SPACE 2008;PUNCTUATION SPACE 2009;THIN SPACE 200A;HAIR SPACE

Coding
Looking into the existing code, I found no point to place this feature into the procedure. Moreover, I was quite confused and got the impression that there has been a longer history of amendments. The resulting state seemed to be not very efficient. Therefore I decided to rewrite the entire explodeWords</tt> business heading for a more clear and accelerated execution.

The code is faster now, since: I think the procedure is much more clearly arranged now, enabling further changes and reducing future efforts in extending.
 * The loop over all characters is run just once.
 * Thai sequence is investigated only if there are really Thai characters in text; more than 99 % of edits won't contain them.
 * Even if there is a Thai string only that particular sequence is broken into words. (See below for Thai issues)
 * UTF-8 analysis is started if UTF-8 encoding really starts, not for every plain English ASCII letter.
 * "inline" functions of just one line integrated for the moment, nowhere else used. May be extracted later if conditions can be shared between method units.

Additional benefit:
 * CJK recognition is extended.
 * Non-ASCII spaces are used for word separation.

Visualize zero-width differences
Objective: There are modified words where the modification keeps invisible, since a zero-width character has been added or removed. The user encounters two red words without any visible difference.

Affected unicodes:
 * 00AD;SOFT HYPHEN</tt>  &amp;shy;
 * 200B;ZERO WIDTH SPACE</tt>
 * 200C;ZERO WIDTH NON-JOINER</tt>   &amp;zwnj;
 * 200D;ZERO WIDTH JOINER</tt>  &amp;zwj;
 * 200E;LEFT-TO-RIGHT MARK</tt>  &amp;lrm;
 * 200F;RIGHT-TO-LEFT MARK</tt>  &amp;rlm;
 * 202A;LEFT-TO-RIGHT EMBEDDING</tt>
 * 202B;RIGHT-TO-LEFT EMBEDDING</tt>
 * <tt>202C;POP DIRECTIONAL FORMATTING</tt>
 * <tt>202D;LEFT-TO-RIGHT OVERRIDE</tt>
 * <tt>202E;RIGHT-TO-LEFT OVERRIDE</tt>

Example (invisible characters shown as HTML entities): old: Meaning&amp;shy;less to &amp;rlm;change&amp;lrm; direction. new: Meaningless to change direction. The users have no clue why these words are red, give’em one: Any red word is affected. Deleted and added lines are not considered.

Coding
Replace appropriate function calls for <tt>printText</tt> by new function <tt>printTextRed</tt> which displays any invisible character in text as replacement symbol.

Open Issue
The same story goes for invalid Unicodes U+007F-009F. They result mainly from Windows codepage (like CP-1250/1252) and won’t be displayed at all by many browsers, best case by a replacement character. Can be captured easily above, since everything > U+007E is bad UTF and UTF starts at U+00C0.

Thai
The wikidiff2 implementation invokes Thai handling always and for any diff result being displayed. This was excluded from <tt>explodeWords</tt> when re-writing, calling a separate function only if a Thai character is really detected. More than 99 % of edits won’t contain any. The wikidiff2 had to run twice over all characters, while there is now one single loop over all characters. Therefore the code is faster now.

Coding
<tt>explodeWordsThai</tt> examines a substring consisting of Thai characters only. Separation is added to the Word segmentation.

CJK extension
<tt>explodeWords</tt> may be improved for detecting CJK punctuation like U+3000 and U+3002 in the same manner Thai characters are handled.