Wikidiff2/How we improved Wikidiff2

In 2017/2018 the developer team around WMDE's Technical Wishes Project worked on improvements to the diff algorithm used in MediaWiki, named Wikidiff2. The work originated from a wish by the German speaking community that lead to a new feature in the Wikidiff2 extension: It is now easier to find moved paragraphs and the changes inside them.

In this text we want to give an overview of the approaches we tried and talk about what worked out well and what didn't while working on Wikidiff2.

Where did we come from?
When editors on Wikipedia and other wikis check what a previous edit was about, they look at the "diff": a two-column comparison of two versions of the same page, with the differences highlighted. Removed text appears in yellow on the left, added text in blue on the right. If text is moved, it appears removed on the left, and re-added on the right. Spotting additional changes in moved paragraphs can be quite difficult. The rough idea to solve this was to somehow visualize paragraphs in the diff that are basically the same and that just got moved around.

So in a first step we applied code that compared all paragraphs that are considered additions and deletions with each other, to see if two of them would match a basic similarity check. Based on word level diffs of paragraphs, we calculated the ratio of the changed and unchanged parts for each pair. If that ratio is above a certain threshold we assume that the pair is basically the same paragraph that just got moved from one position to the other. Changes in these pairs were then made visible by applying a word level comparison – a step that would have been skipped before. As last step we improved the formatted output by adding symbols to visualize moves and made them explorable with links and anchors.

Where did that lead us?
We implemented that approach and tested it using a bigger dataset with random text. There we realized that many moved paragraphs could not be detected, because they were not considered additions and deletions in the first place. Instead they were considered as text changes and were paired with unrelated text from accidentally aligned paragraphs:



What did we do?
The above observation made us rethink the original algorithm. In some cases it made sense to have a second look at paragraphs that are already considered a change. We applied the similarity check used before with a slightly different threshold to see if these changes could better be categorized as individual additions and deletions. The check for moved paragraphs was then applied after these reevaluations.

Our test data showed, that we really had better detection and matching. As a result we ended up with two different patches, improving the general algorithm first (356582) and as second step matching moved paragraphs (319866).

What did we run into?
Being quite confident about the improvements in our code, the new Wikidiff2 version got deployed on Beta and Test wikis and eventually on a first set of servers in production.

While users were using the new version we got reports that in some cases the diffs look weird and less intuitive than before (T180259). For some of these we could confirm this just being a feeling (T180259), but in others it seemed, that the similarity check used to determine if a change could be considered an add and deletion produced false positives (T180259).

What went wrong?
Our original approach worked well with large texts, but was too inaccurate when it came to smaller changes. Also the threshold used was just a wild guess that worked well with the quite homogenous test set but did not in other cases. Last but not least we had not enough tests with realistic sets taken from a real wiki.

How we fixed all the things!*
*) almost

Fix the tests
We realized that the tests used were obviously not good enough. So before working on the algorithms, we needed to make sure that our next changes could be tested better.

A first part of that was a PHPUnit test that takes the essential issues of some edge cases as input, paired with an expected good output as result of the diff (404294). Since these tests were just touching edge cases and implementing a huge test set using that method is quite costly, we created a custom test system. There we imported a big set of random changes from the live Wikipedia and ran a script on these to compare the diffs between the original and the new implementations. The resulting set of diffs in which something changed between the different Wikidiff2 versions was then manually checked if things changed to the worse or got better. This method was then used to adjust thresholds and test the changes in the algorithms in general.

Fix the algorithms
We took a closer look at the examples with strange looking diffs to identify the edge cases potentially causing problems. Some of them really were false positives produced by our changes, others looked bad but seemed to follow some other schema. So while fixing our regression issues we also tried to improve other problems in the diff.

The two main things identified were:

1) Changes where the ratio alone did not work well to classify similarity, e.g. with short text: T180259

2) Changes by adding or removing lines that change the alignment of the diff

Fixing issues with similarity checks using character runs
To fix getting a lot of false positives we tested using a ratio of changed character runs instead of just considering only characters when computing similarity checks (404293/15). As it turned out, this approach did not improve the output, so it was discarded.

Fixing issues with added and removed empty lines
Many issues unrelated to our first set of changes seem to be related to wrong alignment of the lines in the old and new version of text. This problem seems especially confusing when empty lines get added or removed. A first very simple approach solving this was adding the empty lines to the paragraphs above before running the diff algorithm (405700). While this seemed to work for relevant cases (T184531), it did make things worse for others (T184531).

A second approach into that direction also brought no improvements.

Improvements that finally worked
At the end a combination of three things finally did the job:

To avoid false positives and improve general matching and alignment we added a check, which looks for paragraphs that seemed to be moved to the position immediately following. These are then correctly shown again as a side-by-side change, if they happen to be similar enough. In addition to that we split diff operations containing changes as well as added/deleted lines into separate diff operations to further improve matching of moved paragraphs. Last but not least the thresholds were tweaked by iterating over our tests further improving the diff to a point were we felt confident with the results (404293).

What we learned & other stuff

 * Have meaningful tests, with real life data.
 * Thresholds need to be tweaked with care.
 * Try to figure edge cases beforehand.
 * Theory can quickly collide with reality.
 * Prune the byte code cache (T177891)!

Feedback
Feedback and questions are very welcome! Please use the talk page for that.