There are a number of analyzes that would be a lot easier to do if there were some kind of similarity measure in the database or server. Its a bit difficult to say how it should be, but it seems like a vector with some kind of fingerprint is the most interesting solution. I'm not sure if this should be implemented as part of the checksum solution, but the checksum solution will have some impact on how a fingerprinting should be done.
I believe the fingerprinting of revisions should be done on the fly according to parameters supplied through the API, like the number of bins for the hashing scheme. The values should not be stored in the database but delivered to the client and preferably without the accompanying text unless requested by the client. If later revisions in the same run has the same checksum, then the previous fingerprint can (and should) be reused.
In my opinion its better to do the processing on the server and only deliver the fingerprint to the client as it will be less data to transfer. Jeblad 04:45, 15 November 2011 (UTC)
I think a quicker way of getting this is so to create a Lucene index and maybe use Mahout to find similar documents. Given the size of the Wikipedia corpus, I am not sure we can do this in (almost) realtime. What kind of usecases do you have in mind?
Not sure if a Lucene index can be used at all for this, its about measuring similarity between revisions. How much has two versions changed. Right now all projects that needs this kind of data download the revision text to be able to calculate the fingerprints, which is a veeeery dumb approach as it locks up the server for a long time. It is better to calculate this on the server and just transfer the fingerprints. Calculate a fingerprint for a revision are comparable to calculate a complex digest.
Not sure how useful it is to start to describe use cases, it will only be a few I know about…
A revision A has some position in a vector space, and this position is what you can call the fingerprint of the revision. An other revision B has another position. Between those two revisions there are a distance and this distance approximates the necessary work (edit distance) to bring the article from revision A to revision B. For the distance to be meaningful the vector space must be continuous, and so forth, basically be a linear vector space. The transformation from the text to the vector (fingerprint) must also approximate a linear transform, that is it must be some kind of a w:en:Locality-sensitive hashing that is fine grained enough that it is possible to calculate distances.
A measure of involved work is necessary in several situations like identifying article main authors, important site contributors, social crediting, trust modeling, etc. Some types of analyzes needs additional preprocessing, some needs post processing. Post processing can be done in the client browser, while pre processing either imply that the whole text must be downloaded or some script must run at the server. The later pose a real problem.
Many of the possible use cases are more efficient if the similarity measure is available, because then the implementations don't have to transfer the revision text of every revision in the history to the client browser. They even don't need the text (or the fingerprint) of every revision anyhow, they just need the fingerprint of the last edit an user did before another user saved his version. That is consecutive edits should be collapsed unless a revision is detected as reused later in the page history.
At Norwegian (bokmål) Wikipedia there is a gadget to calculate the users real contributions to an article. The gadget is somewhat simplistic as destructive changes are accumulated as constructive contributions. The gadget downloads the text of every revision unconditionally, but if there were more suitable services available then only the revision hash and the fingerprints would be necessary.
A variation of this is to normalize the contributions against the fingerprint of the final revision. This will give emphasis to constructive contributions. This variation imply postprocessing of the calculated fingerprints.
A second variation is to use information entropy as a scaling weight during construction of the fingerprints. This will give a slight emphasis to those that add high-value content. This variation imply preprocessing of the data before generating the fingerprints.
If this is adjusted so the information entropy fades out after first use the contributors of high-value content will get a strong boost. This variation imply postprocessing of the calculated fingerprints.
A third variation is to use a normalized previous and following fingerprint from the revision history, estimate how a fingerprint should relate to those, and accumulate difference over all contributions from a specific user. This will give a measure of the trust for the user as the number will be low if the user gets reverted alot. This variation imply pre- and postprocessing of the fingerprints.
We need a convincing use case before considering this:) And I really think that Lucene, Mahout and the XML dump files are the way to go. Lucene has built-in similarity functionality and else Mahout has it as well. You can create an index using Lucene and provide his to Mahout as input. Then you can calculate similarity scores for different measures.Drdee 18:29, 16 November 2011 (UTC)
Sorry, but Lucene and Mahout isn't about this at all. Pointing to XML-dump is just saying "we don't want to consider making any analysis available from any Wikimedia site". At least take the time to read the bug report, just reiterating the same question is waste of time.
I am trying to understand the use case, you yourself state that "Not sure how useful it is to start to describe use cases, it will only be a few I know about…" so I understand what you want to do but I don't understand why you want to do it. And I am suggesting an alternative way using Lucene, Mahout and the XML dump files. The code for the checksums (bug 21860) has been checked in and should be part of MW 1.19. If you are interested in the real contributions of editors then have a look at a project I have been working on called the DiffDB (https://github.com/whym/diffindexer). Best, Diederik Drdee 21:53, 16 November 2011 (UTC)
Right now the proposal title and committed patch implement this in the revision table. Why is this though ? In my opinion it makes more sense in the text table (which the introduction paragraph of the proposal mentions as target table as well).
It's the hash of the text, not of the revision meta-data. There can (and should be) mutiple revisions with the same hash of the revision text. Right now MediaWiki only re-uses a text-table row if a revision is a direct revert of an earlier revision (using the "rollback" feature). If a normal undo takes place or if there were multiple editors between the vandalism and the user had to dig back manualy and save an old revision, then MediaWiki stores a second copy of the text.
Anyway, just to bring this up. Do we want it in the text table ?
No, some of the use cases for this are primarily aimed at tasks which involve looking at a large number of revisions and comparing equality. In essence things which to do right now you'd have to extract the full text for every one of those revisions. Our 'text table' is really only the default location for text an external store can be used instead, so iirc we don't join on the text table. Putting the checksum alongside the text sounds like it would mean that we would end up right back to making hundreds of requests to the text storage.
I'm using something like this in a number of gadgets and if it should be efficient you should not have to query anything that imply processing of text either in the server or at the client, you only want the hash of the text. The hash should although only be calculated from the text, not from any other metadata.
Yes, if you want to query by hash then you would need obviously an index but I haven't heard a use case yet where we really would want to query often the hash column. In addition, the checksum will not be always unique across different pages. If two different pages have been blanketed then they would have the same hash. So we might need a compounded index in that case but I would like to hear more different use cases first before we decide on including an index.
I'm not aware of any analyzes that needs to query for a specific hash value, all I've seen needs a list of hashes through the history of an article. I can imagine one situation where it is interesting and that is identifying an undo or save of an old version as an identity revert. Such reverts are not that uncommon, and I think they should be identified and tagged if possible.