Topic on Talk:Requests for comment/Database field for checksum of page text

Jeblad (talkcontribs)

There are a number of analyzes that would be a lot easier to do if there were some kind of similarity measure in the database or server. Its a bit difficult to say how it should be, but it seems like a vector with some kind of fingerprint is the most interesting solution. I'm not sure if this should be implemented as part of the checksum solution, but the checksum solution will have some impact on how a fingerprinting should be done.

I believe the fingerprinting of revisions should be done on the fly according to parameters supplied through the API, like the number of bins for the hashing scheme. The values should not be stored in the database but delivered to the client and preferably without the accompanying text unless requested by the client. If later revisions in the same run has the same checksum, then the previous fingerprint can (and should) be reused.

In my opinion its better to do the processing on the server and only deliver the fingerprint to the client as it will be less data to transfer. Jeblad 04:45, 15 November 2011 (UTC)

Drdee (talkcontribs)

Dear Jeblad,

I think a quicker way of getting this is so to create a Lucene index and maybe use Mahout to find similar documents. Given the size of the Wikipedia corpus, I am not sure we can do this in (almost) realtime. What kind of usecases do you have in mind?

Drdee 04:51, 15 November 2011 (UTC)

Jeblad (talkcontribs)

Not sure if a Lucene index can be used at all for this, its about measuring similarity between revisions. How much has two versions changed. Right now all projects that needs this kind of data download the revision text to be able to calculate the fingerprints, which is a veeeery dumb approach as it locks up the server for a long time. It is better to calculate this on the server and just transfer the fingerprints. Calculate a fingerprint for a revision are comparable to calculate a complex digest.

Drdee (talkcontribs)

Hi Jeblad,

Can you expand a bit more on the use case? Why is this useful? Drdee 03:50, 16 November 2011 (UTC)

Jeblad (talkcontribs)
Page authors with actual amount of contribution
Page history visualized as waterfall, only char count change

Not sure how useful it is to start to describe use cases, it will only be a few I know about

A revision A has some position in a vector space, and this position is what you can call the fingerprint of the revision. An other revision B has another position. Between those two revisions there are a distance and this distance approximates the necessary work (edit distance) to bring the article from revision A to revision B. For the distance to be meaningful the vector space must be continuous, and so forth, basically be a linear vector space. The transformation from the text to the vector (fingerprint) must also approximate a linear transform, that is it must be some kind of a w:en:Locality-sensitive hashing that is fine grained enough that it is possible to calculate distances.

A measure of involved work is necessary in several situations like identifying article main authors, important site contributors, social crediting, trust modeling, etc. Some types of analyzes needs additional preprocessing, some needs post processing. Post processing can be done in the client browser, while pre processing either imply that the whole text must be downloaded or some script must run at the server. The later pose a real problem.

Many of the possible use cases are more efficient if the similarity measure is available, because then the implementations don't have to transfer the revision text of every revision in the history to the client browser. They even don't need the text (or the fingerprint) of every revision anyhow, they just need the fingerprint of the last edit an user did before another user saved his version. That is consecutive edits should be collapsed unless a revision is detected as reused later in the page history.

At Norwegian (bokmål) Wikipedia there is a gadget to calculate the users real contributions to an article. The gadget is somewhat simplistic as destructive changes are accumulated as constructive contributions. The gadget downloads the text of every revision unconditionally, but if there were more suitable services available then only the revision hash and the fingerprints would be necessary.

A variation of this is to normalize the contributions against the fingerprint of the final revision. This will give emphasis to constructive contributions. This variation imply postprocessing of the calculated fingerprints.

A second variation is to use information entropy as a scaling weight during construction of the fingerprints. This will give a slight emphasis to those that add high-value content. This variation imply preprocessing of the data before generating the fingerprints.

If this is adjusted so the information entropy fades out after first use the contributors of high-value content will get a strong boost. This variation imply postprocessing of the calculated fingerprints.

A third variation is to use a normalized previous and following fingerprint from the revision history, estimate how a fingerprint should relate to those, and accumulate difference over all contributions from a specific user. This will give a measure of the trust for the user as the number will be low if the user gets reverted alot. This variation imply pre- and postprocessing of the fingerprints.

See also
Drdee (talkcontribs)

We need a convincing use case before considering this :) And I really think that Lucene, Mahout and the XML dump files are the way to go. Lucene has built-in similarity functionality and else Mahout has it as well. You can create an index using Lucene and provide his to Mahout as input. Then you can calculate similarity scores for different measures.Drdee 18:29, 16 November 2011 (UTC)

Jeblad (talkcontribs)

Sorry, but Lucene and Mahout isn't about this at all. Pointing to XML-dump is just saying "we don't want to consider making any analysis available from any Wikimedia site". At least take the time to read the bug report, just reiterating the same question is waste of time.

Drdee (talkcontribs)

I am trying to understand the use case, you yourself state that "Not sure how useful it is to start to describe use cases, it will only be a few I know about…" so I understand what you want to do but I don't understand why you want to do it. And I am suggesting an alternative way using Lucene, Mahout and the XML dump files. The code for the checksums (bug 21860) has been checked in and should be part of MW 1.19. If you are interested in the real contributions of editors then have a look at a project I have been working on called the DiffDB (https://github.com/whym/diffindexer). Best, Diederik Drdee 21:53, 16 November 2011 (UTC)

Reply to "Similarity measure"