Thread:Talk:Requests for comment/Database field for checksum of page text/Similarity measure

There are a number of analyzes that would be a lot easier to do if there were some kind of similarity measure in the database or server. Its a bit difficult to say how it should be, but it seems like a vector with some kind of fingerprint is the most interesting solution. I'm not sure if this should be implemented as part of the checksum solution, but the checksum solution will have some impact on how a fingerprinting should be done.

I believe the fingerprinting of revisions should be done on the fly according to parameters supplied through the API, like the number of bins for the hashing scheme. The values should not be stored in the database but delivered to the client and preferably without the accompanying text unless requested by the client. If later revisions in the same run has the same checksum, then the previous fingerprint can (and should) be reused.

In my opinion its better to do the processing on the server and only deliver the fingerprint to the client as it will be less data to transfer. Jeblad 04:45, 15 November 2011 (UTC)