Extension:Annotation

An annotation of a page is an analysis of the wikitext which determines who authored each part of the text, and in what revision. Source code revision control systems offer a command  (called   in Subversion) which provides a line-by-line walk through of who last edited what. This feature is useful in determining which programmer "broke" the problem with an errant line change. In the context of collaborative wiki editing for encyclopedia articles, similar functionality on a word-by-word basis may be useful, and thus Bug 639.

There are quite a few implementation details to be hammered out for such a feature, and thus the presence of this meta page. User:Ambush Commander has assigned himself to the project, and has finished the base annotation code.



Base code
The base code consists of several classes, primarily,  ,   and. It also has a set of SimpleTest unit tests for most of the functions on  and.

Using the code
The code requires the difference engine package in. If you wish to use the code outside the context of a MediaWiki installation, you will need to get a copy of the code from the latest MediaWiki version, and remove the includes as well as define function  and. Further changes do not seem to be necessary.

The code uses the private class, and thus is not necessarily implemented in the cleanest manner. Seperating the code into its own package and clearly defining the interface, making it public, may be desirable. is too coarse and integrated for our needs, and is not necessary for the functioning of the code. However, it is worth noting that it has this conditional code:

if ( $wgUseExternalDiffEngine ) { # For historical reasons, external diff engine expects # input text to be HTML-escaped already $otext = str_replace( "\r\n", "\n", htmlspecialchars ( $otext ) ); $ntext = str_replace( "\r\n", "\n", htmlspecialchars ( $ntext ) ); if( !function_exists( 'wikidiff_do_diff' ) ) { dl('php_wikidiff.so'); }	$difftext = wikidiff_do_diff( $otext, $ntext, 2 ); } else { $ota = explode( "\n", str_replace( "\r\n", "\n", $otext ) ); $nta = explode( "\n", str_replace( "\r\n", "\n", $ntext ) ); $diffs =& new Diff( $ota, $nta ); $formatter =& new TableDiffFormatter; $difftext = $formatter->format( $diffs ); }

Where  is an optimized function defined by   that outputs HTML for displaying the diff. In its current state, we cannot use the optimization, but if an interface defined by wikidiff_do_diff allows for $diffs to be created from the external diff engine, using it is possible. The two, however, must have exactly the same format.

When copy is not copy
WordLevelDiff has a surprising quirk: copied text does not always stay the same. A good example is the diff from text "Lalala" to "Lalala is cool." The WordLevelDiff incorrectly reports that "Lalala " (note the space) was copied in "Lalala is cool." I have not investigated further, but there is some code in the copy class to handle this deficiency. As I find out more quirks, further compatibility may be necessary.

Annotation
The standard style to call, where   is actually an array of ExtendedStrings, each with the full text of the revision, and the information attached to that revision. Because this is not the style MediaWiki uses, some transformation into this format may be necessary.

Reverts
Currently, the code assumes that all contributors are good faith and nothing like vandalism or revert warring goes on. Sadly, in the real world, this is not the case. Some how, the annotation code has to be able to detect when a revert happens, and then scroll back to an earlier version instead of replacing the whole document with itself, assigned completely to the person who reverted, while not keeping a copy of every single iteration in the memory or offering too much of a performance hit.

Furthermore, if we are incrementally updating the Annotation, it would not know about any past revisions except the one immediately before the change. This implies that some sort of meta data must be preserved in the Annotation for the cost of higher storage.

Printing the annotation
Printing a word level annotation presents interesting problems. How do we preserve readability of the wikitext while packing as much metadata in as we can? How much JavaScript do we want to use? Do we offer "topographic" style annotations, that color code based on how old the piece of text is? This requires discussion.

Color spans
What gradients of colors are still readable on white backgrounds? Which colors should be definable via user stylesheet, and which colors should be dynamically inserted (and thus immutable)? Should the form allow for the colors to be changed? Should the text be colored, or the background be colored?

Performance
It appears that pulling old revisions is a very costly operation. This means that even with a maximum depth level for revisions, annotations will not be able to be built at run time. This requires a few extra changes:


 * 1) Creation of an   table to store compiled annotations
 * 2) Creation of a maintenance script to populate this table
 * 3) Creation of page save hook that recompiles the annotation after a page is edited

This will probably require the work of more experienced developers.