Thread:Talk:Requests for comment/Database field for checksum of page text/Similarity measure/reply (4)

Not sure how useful it is to start to describe use cases, it will only be a few I know about&hellip;

A revision A has some position in a vector space, and this position is what you can call the fingerprint of the revision. An other revision B has another position. Between those two revisions there are a distance and this distance approximates the necessary work (edit distance) to bring the article from revision A to revision B. For the distance to be meaningful the vector space must be continuous, and so forth, basically be a linear vector space. The transformation from the text to the vector (fingerprint) must also approximate a linear transform, that is it must be some kind of a Locality-sensitive hashing that is fine grained enough that it is possible to calculate distances.

A measure of involved work is necessary in several situations like identifying article main authors, important site contributors, social crediting, trust modeling, etc. Some types of analyzes needs additional preprocessing, some needs post processing. Post processing can be done in the client browser, while pre processing either imply that the whole text must be downloaded or some script must run at the server. The later pose a real problem.

Many of the possible use cases are more efficient if the similarity measure is available, because then the implementations don't have to transfer the revision text of every revision in the history to the client browser. They even don't need the text (or the fingerprint) of every revision anyhow, they just need the fingerprint of the last edit an user did before another user saved his version. That is consecutive edits should be collapsed unless a revision is detected as reused later in the page history.

At Norwegian (bokmål) Wikipedia there is a gadget to calculate the users real contributions to an article. The gadget is somewhat simplistic as destructive changes are accumulated as constructive contributions. The gadget downloads the text of every revision unconditionally, but if there were more suitable services available then only the revision hash and the fingerprints would be necessary.

I'll add some more links later.
 * Bug 21860 - Add checksum field to database table; expose it in API
 * [//no.wikipedia.org/wiki/Special:Gadgets/export/page-authors-simple?uselang=en Norwegian (bokmål) Wikipedia: Export of "Page-authors-simple"] (js, css)
 * w:no:Help:Forfattere av sider (Google Translate)