User:Jeblad/Contributors

Contributors is an extension that provides a special page that lists main contributors for a page. Several licensing schemes have a clause that the main contributor should be credited. In an collaboratively written wikitext this can be difficult to satisfy as it is extremely time consuming to wade through long page histories. This extension automates this process.

There are several configurable methods of measuring the contributions, each will give slightly different values. It is important to note that there is no single correct method for measuring contributions.

During analysis several techniques are used to short circuit otherwise heavy calculations, but the calculations could still be to heavy for heavily loaded servers. Care must be taken to verify that limits are set and that they give acceptable server load.

Background
It is important to note that there are no single "correct" way to calculate user contributions, but there are a number of approximations that can be used to figure out who is the most likely main contributors. An example illustrates the problem. Imagine a user adding a few characters randomly in a text, then the same number of characters as a continuous string. The number of keys she pressed can be similar, if we disregard navigation in the text, yet the overall change to the meaning of the content can be very different. Adding a few characters can change the entropy of at least the same number of words, while changing a continuous string will influence fewer words. Neither counting continuous changes or changes to single characters are wrong, they just give different answers.

There is also the the load problem. An extension like this can be set up to scan through the complete history or a subset of the history. It can also preprocess the history so it can be accessed more efficient, possibly also store this in separate tables. Many such optimizations trades precision for speed, increasing the error slightly for faster responses.

This extension splits the processing in a first pass where likely candidates are identified, reverted revisions are removed, and continuous and similar edits are lumped together. This will greatly compress the history so the later processing is more manageable. In the second pass specific revisions from the first pass are analyzed. The analysis for each of our weight methods has an inner kernel operator and it is extremely important that this is as efficient as possible. This is wrapped in a number of loops, and they should terminate as fast as possible. There should be as few nested loops as possible. Methods used in the extension favor constant time solutions over linear time wherever possible, even if some of the methods are rather heavy.
 * Algorithm
 * Run from oldest to newest
 * Perhaps a step to download and identify the most important revisions
 * Ask for more data if necessary before processing
 * Accumulate fingerprints over all users
 * Recurse when identical entries are found
 * Resync to similar entries if large changes
 * Normalize over total path
 * Always show new users
 * Remove users dropping under a certain level
 * Visualization
 * Show each user on a line with a horizontal bar with relative contribution
 * Estimate a mean load time and animate the transitions

Installation
As is standard with other MediaWiki extensions, you may install this extension by extracting the extension somewhere (usually the  folder), and adding to.

In addition you may add your own configuration of specific weight calculations, configure additional ones, or remove existing ones.


 * entropy : Constructs hashes by calculating entropy for phrases used in the final revision, and accumulates those values for individual users
 * triplets : Constructs vectors in a 256-dimensional space by hashing each revision, and accumulates the difference between the new and parent revision for individual users.


 * class: \Contributors\Analyzer\TupletAnalyzer
 * size: 3