User:PerfektesChaos/WikidiffLX

From mediawiki.org

The current standard diff (Wikidiff2) suffers from some limitations. The implementation can be improved a bit without major decrease in performance. A million of users world wide should not be flabbergasted by an entirely new appearance. The basic algorithm (line based diff engine) is quite fast and should be kept. However, preparing the input and displaying the results might be enhanced.

Six improvements are suggested.

It is expected that performance is not impaired. Suggested code is longer, but saves a lot of work from existing and untouched diff engine, e.g. by hiding empty lines. Showing the differences in a better way should pay off even some slight additional resources.

The implementation for all suggestions has been done recently. For some simple cases the new code has been tested locally already. More sophisticated examples need to be designed now in order to detect obstacles. A statement on performance can't be made at this stage.


Improve in 2011[edit]

Avoid confusion by empty lines[edit]

Objective: If an “empty line” (maybe with some invisible content) has been inserted or removed by the author and the adjacent paragraphs are modified in some way, the presentation of the result is currently disturbed.

Example:

previous line previous line
small modification  
change little  
  minor modification
   
  change slighty
following line following line

Could be presented as:

previous line previous line
small modification minor modification
   
change little change slighty
following line following line

more…

Improve modified consecutive lines[edit]

Objective: The line based algorithm makes minor corrections in consecutive lines appear as a dramatic change.

Example:

previous line small modification …whaffle… …blah…   …Blah… …Whaffle… change little following line
previous line minor modification …whaffle… …blah… added line …Blah… …Whaffle… change slighty following line

results currently in:

previous line previous line
small modification

…whaffle…

…blah…
 
…Blah…

…Whaffle…

change little
 
  minor modification

…whaffle…

…blah…
  added line
  …Blah…

…Whaffle…

change slighty
following line following line

This shall be made more readable:

previous line previous line
small modification

…whaffle…

…blah…
minor modification

…whaffle…

…blah…
  added line
…Blah…

…Whaffle…

change little
…Blah…

…Whaffle…

change slighty
following line following line

more…

Improve visualization of context lines[edit]

Objectives: Currently any kind of line is displayed when two lines preceding or following shall give an impression of the unchanged context where a changed block is located.

  1. If one or both of these lines are empty they are put into HTML source but invisible and not informative.
  2. If these lines are very long (sometimes each 1000 bytes and more), both paragraphs are displayed anyway, making the output lengthy and hard to survey.

The suggested code has a slightly different behaviour:

  1. The most adjacent non-empty (visible) lines will be shown.
  2. Not two paragraphs but the next unchanged two virtual lines (each expected of at least 100 bytes if not full paragraph) will be displayed.

Visualize space-only differences[edit]

Objective: The current function doesn’t show space-only differences. Readers find identical black text and may guess that the reason is a superfluous space character somewhere, or perhaps a period changed into a comma?

Example:

old: The old  lady looks  confused.
new: The young girl looks confused.
The old lady looks▯▯confused. The young girl looksconfused.
  • Make space-only differences visible, including heading/trailing space.
  • Enable reader to count number of spacing characters.
  • Don’t confuse reader with space-▯ if there are visible changes: If one of the adjacent words is already red, space difference is negliable.
  • Show not only different number of spacing characters, but also varying types (currently only: ASCII Space U+0020 and HorTab U+0009, but there are many more spaces like U+2004-200A).

more…

Visualize non-ASCII spaces[edit]

Objective: Other spaces shall be treated like ASCII space. This goes for both word-splitting and displaying of modification.

Affected unicodes:

2002;EN SPACE
2003;EM SPACE
2004;THREE-PER-EM SPACE
2005;FOUR-PER-EM SPACE
2006;SIX-PER-EM SPACE
2007;FIGURE SPACE
2008;PUNCTUATION SPACE
2009;THIN SPACE
200A;HAIR SPACE

more…

Visualize zero-width differences[edit]

Objective: There are modified words where the modification keeps invisible, since a zero-width character has been added or removed. The user encounters two red words without any visible difference.

Affected unicodes:
  • 00AD;SOFT HYPHEN   ­
  • 200B;ZERO WIDTH SPACE
  • 200C;ZERO WIDTH NON-JOINER   ‌
  • 200D;ZERO WIDTH JOINER   ‍
  • 200E;LEFT-TO-RIGHT MARK   ‎
  • 200F;RIGHT-TO-LEFT MARK   ‏
  • 202A;LEFT-TO-RIGHT EMBEDDING
  • 202B;RIGHT-TO-LEFT EMBEDDING
  • 202C;POP DIRECTIONAL FORMATTING
  • 202D;LEFT-TO-RIGHT OVERRIDE
  • 202E;RIGHT-TO-LEFT OVERRIDE

Example (invisible characters shown as HTML entities):

old: Meaning­less to ‏change‎ direction.
new: Meaningless to change direction.
Meaningless to change direction. Meaningless to change direction.

The users have no clue why these words are red, give’em one:

Meaning▯less to ▯change▯ direction. Meaningless to change direction.

Any red word is affected. Deleted and added lines are not considered.
more…

Further ideas[edit]

Interpunctation, pipe[edit]

The word separation algorithm might reduce the extension of red marked changed regions beyond whitespace separators.

  • Global punctuation characters like Latin ,.;!?'" should be considered. Recently there have been complaints about compilation of spring 2011 changes in fall 2011.
  • Wikisyntax characters should limit modified pieces (“words”), especially [|]{}<>=:/_ since a keyword change within template should not highlight the entire transclusion.

Letter based differences[edit]

A request might be heard that differences shall be shown on a letter by letter base.

However, I don't think that this is a good idea for general cases. If a text is entirely changed, incidentally there might be some characters identical. The word “or” is half the same as the word “on”, which causes both to be displayed half red, half black. This might be quite confusing, if the complete sentence was replaced and has a different meaning now.

If ever implemented, the number of changed words per line (=paragraph) should be counted; if there are more than about two changed words, the interior of words should not be inspected. If only some two words differ, and if the number of changed letters is limited to one (word lengths may not deviate more than one char, as first condition) a further investigation might be done: If only one letter (or punctuation) is changed, inserted or removed, that particular char might be displayed in red.

Anyway, if this results in a single red period within a paragraph of over 1000 chars, users will complain that they cannot see this little red dot. That might have been the motivation why the adjacent word is highlighted together with the punctuation.

Switch off line numbers[edit]

In common line difference tools, line numbers are displayed. Wikidiff2 shows this traditional behaviour, too. However, this is not very meaningful in wiki context:

  • There is no GOTO LINE 145 in wiki edit tools. The author has no idea whether line 145 is at the end, or beginning, or how to find.
  • The common tools are used to compare lines of source code of software. Line length is limited to some 80 or 130 characters, and old programmers have a certain idea in which 5000 line range the module is located when they intend to change lines of code. Current line number is displayed permanently.

With a wiki page, some line lengths may be over 1000, many lines (table separators with two chars only) can be very short and the editors have absolutely no clue which line they currently see.

Caching[edit]

Just a thought; no idea whether feasible, or perhaps already implemented?

  • The most recent diff requests (some 10.000) could be cached.

I have no idea whether dynamic diffpages are already in cache. As far as I know there is only entire page content and template expansion stored yet.

Within the first hours and days after a change several users might be interested what happened and ask for a diffpage. The core diff result might be kept in store for a while. Older diffs will be called less often; a change between two unpredictable versions in history three years ago is not supposed to be available via cache. If an issue is of particular interest right now, several diff calls may be triggered. The saving of computation efforts needs to be much higher than work for maintaining a cache infrastructure, even more if only two or three users are interested in one specific modification.

General remarks on diff presentation[edit]

There are two basic approaches for presenting differences to the user:

  1. Show older and newer version side by side.
  2. Show the changes within one text and markup deviating previous and current text inline.
    This was chosen by WikEdDiff and previously intended by Visual diffs/2008 project. (See also Visual diffs.)

Both methods have advantages and disadvantages. It depends on the amount and type of changes. Experienced users might find it easy to understand minor modifications in the inline method. Even though context and grammar of a sentence in human language are lost if several times interrupted by insertion from the other sentence. Larger block movements and rearranging sequence of paragraphs is always a problem.

For new users dealing with archeology or football it is reasonable to present both versions undisturbed. Therefore it is a good choice to provide Wikidiff2/WikidiffLX as standard method.

For those who are familiar with interpretation of diff results a change of the presentation type is welcome, to be configured as default or on a case by case base. Using WikEd the inline presentation can be requested if helpful at the moment.

The choice of the identifier for this tiny project results from Wikidiff Line approach eXtended.


See also[edit]