From MediaWiki.org
Jump to: navigation, search
MediaWiki extensions manual
Crystal Clear action run.png

Release status: experimental

Implementation Special page, Database, User activity
Description Analysis of the wikitext history to determine who authored what text and when.
Author(s) Edward Z. Yang
Latest version n/a (2006-03-03)
MediaWiki n/a
License GNU General Public License 2.0 or later
Download No link

Translate the Annotation extension if it is available at translatewiki.net

Check usage and version matrix; code metrics

An annotation of a page is an analysis of the wikitext which determines who authored each part of the text, and in what revision. Source code revision control systems offer a command annotate (called blame in Subversion) which provides a line-by-line walk through of who last edited what. This feature is useful in determining which programmer "broke" a program with an errant line change. In the context of collaborative wiki editing for encyclopedia articles, similar functionality on a word-by-word basis may be useful, and thus Bug 639.

There are quite a few implementation details to be hammered out for such a feature, and thus the presence of this meta page. User:Ambush Commander has assigned himself to the project, and has finished the base annotation code.

User:Dcoetzee's vision of annotation

Base code[edit]

The base code consists of several classes, primarily Annotation, ExtendedString, PhantomString and RevisionInfo. It also has a set of SimpleTest unit tests for most of the functions on Annotation and ExtendedString.

Using the code[edit]

The code requires the difference engine package in DifferenceEngine.php. If you wish to use the code outside the context of a MediaWiki installation, you will need to get a copy of the code from the latest MediaWiki version, and remove the includes as well as define function wfProfileIn() and wfProfileOut(). Further changes do not seem to be necessary.

The code uses the private class WordLevelDiff, and thus is not necessarily implemented in the cleanest manner. Separating the code into its own package and clearly defining the interface, making it public, may be desirable. DifferenceEngine is too coarse and integrated for our needs, and is not necessary for the functioning of the code. However, it is worth noting that it has this conditional code:

if ( $wgUseExternalDiffEngine ) {
	# For historical reasons, external diff engine expects
	# input text to be HTML-escaped already
	$otext = str_replace( "\r\n", "\n", htmlspecialchars ( $otext ) );
	$ntext = str_replace( "\r\n", "\n", htmlspecialchars ( $ntext ) );
	if( !function_exists( 'wikidiff_do_diff' ) ) {
	$difftext = wikidiff_do_diff( $otext, $ntext, 2 );
} else {
	$ota = explode( "\n", str_replace( "\r\n", "\n", $otext ) );
	$nta = explode( "\n", str_replace( "\r\n", "\n", $ntext ) );
	$diffs =& new Diff( $ota, $nta );
	$formatter =& new TableDiffFormatter();
	$difftext = $formatter->format( $diffs );

Where wikidiff_do_diff() is an optimized function defined by php_wikidiff.so that outputs HTML for displaying the diff. In its current state, we cannot use the optimization, but if an interface defined by wikidiff_do_diff allows for $diffs to be created from the external diff engine, using it is possible. The two, however, must have exactly the same format.

When copy is not copy[edit]

WordLevelDiff has a surprising quirk: copied text does not always stay the same. A good example is the diff from text "Lalala" to "Lalala is cool." The WordLevelDiff incorrectly reports that "Lalala " (note the space) was copied in "Lalala is cool." I have not investigated further, but there is some code in the copy() class to handle this deficiency. As I find out more quirks, further compatibility may be necessary.


The standard style to call Annotation::newFromRevisions($revisions), where $revisions is actually an array of ExtendedStrings, each with the full text of the revision, and the information attached to that revision. Because this is not the style MediaWiki uses, some transformation into this format may be necessary.


Currently, the code assumes that all contributors are good faith and nothing like vandalism or revert warring goes on. Sadly, in the real world, this is not the case. Some how, the annotation code has to be able to detect when a revert happens, and then scroll back to an earlier version instead of replacing the whole document with itself, assigned completely to the person who reverted, while not keeping a copy of every single iteration in the memory or offering too much of a performance hit.

Furthermore, if we are incrementally updating the Annotation, it would not know about any past revisions except the one immediately before the change. This implies that some sort of meta data must be preserved in the Annotation for the cost of higher storage.

Printing the annotation[edit]

Printing a word level annotation presents interesting problems. How do we preserve readability of the wikitext while packing as much metadata in as we can? How much JavaScript do we want to use? Do we offer "topographic" style annotations, that color code based on how old the piece of text is? This requires discussion.

Color spans[edit]

What gradients of colors are still readable on white backgrounds? Which colors should be definable via user stylesheet, and which colors should be dynamically inserted (and thus immutable)? Should the form allow for the colors to be changed? Should the text be colored, or the background be colored?


It appears that pulling old revisions is a very costly operation. This means that even with a maximum depth level for revisions, annotations will not be able to be built at run time. This requires a few extra changes:

  1. Creation of an `annotation` table to store compiled annotations
  2. Creation of a maintenance script to populate this table
  3. Creation of page save hook that recompiles the annotation after a page is edited

This will probably require the work of more experienced developers.