Thread:Talk:Requests for comment/Database field for checksum of page text/Similarity measure/reply (4)

Not sure how useful it is to start to describe use cases, it will only be a few I know about&hellip;

A revision A has some position in a vector space, and this position is what you can call the fingerprint of the revision. An other revision B has another position. Between those two revisions there are a distance and this distance approximates the necessary work (edit distance) to bring the article from revision A to revision B. For the distance to be meaningful the vector space must be continuous, and so forth, basically be a linear vector space. The transformation from the text to the vector (fingerprint) must also approximate a linear transform, that is it must be some kind of a Locality-sensitive hashing that is fine grained enough that it is possible to calculate distances.

A measure of involved work is necessary in several situations like identifying article main authors, important site contributors, social crediting, trust modeling, etc. Some types of analyzes needs additional preprocessing, some needs post processing. Post processing can be done in the client browser, while pre processing either imply that the whole text must be downloaded or some script must run at the server. The later pose a real problem.

Many of the possible use cases are more efficient if the similarity measure is available, because then the implementations don't have to transfer the revision text of every revision in the history to the client browser. They even don't need the text (or the fingerprint) of every revision anyhow, they just need the fingerprint of the last edit an user did before another user saved his version. That is consecutive edits should be collapsed unless a revision is detected as reused later in the page history.

At Norwegian (bokmål) Wikipedia there is a gadget to calculate the users real contributions to an article. The gadget is somewhat simplistic as destructive changes are accumulated as constructive contributions. The gadget downloads the text of every revision unconditionally, but if there were more suitable services available then only the revision hash and the fingerprints would be necessary.

A variation of this is to normalize the contributions against the fingerprint of the final revision. This will give emphasis to constructive contributions. This variation imply postprocessing of the calculated fingerprints.

A second variation is to use information entropy as a scaling weight during construction of the fingerprints. This will give a slight emphasis to those that add high-value content. This variation imply preprocessing of the data before generating the fingerprints.

If this is adjusted so the information entropy fades out after first use the contributors of high-value content will get a strong boost. This variation imply postprocessing of the calculated fingerprints.

A third variation is to use a normalized previous and following fingerprint from the revision history, estimate how a fingerprint should relate to those, and accumulate difference over all contributions from a specific user. This will give a measure of the trust for the user as the number will be low if the user gets reverted alot. This variation imply pre- and postprocessing of the fingerprints.


 * See also
 * Bug 21860 - Add checksum field to database table; expose it in API
 * [//no.wikipedia.org/wiki/Special:Gadgets/export/page-authors-simple?uselang=en Norwegian (bokmål) Wikipedia: Export of "Page-authors-simple"] (js, css)
 * w:no:Help:Forfattere av sider (Google Translate) – Help:Page authors