Requests for comment/Database field for checksum of page text

= Proposal Adding checksum field to revision table =

Purpose
Add a checksum field to the revision table that reflects the contents of the "old_text" column.

Reasoning:
 * A checksum field will allow to verify the validity of the XML dump files.
 * This column could then be checked against at each insertion into the text table in order to avoid saving the same text multiple times. This could reduce the necessary disc footprint required to store an amount of revisions.
 * Exposing this checksum field via the API would allow developers to know when it is necessary to request the text of a revision. They may already have it a copy of it that was previously requested and can simply use that.
 * The checksum field would provide a quick mechanism for detecting identity reverts. By identity revert, I mean a reversion of a page to an *exact* previous state such that both revisions would contain text of the same checksum.

Choice of checksum
I will consider the following hash functions to calculate the proposed checksum: MD5, SHA1 and different flavors of the Murmur hash. Both MD5 and SHA1 are cryptographic hashes while Murmur hash is not.

Performance Benchmark
-- -- -- This is for relative performance and not absolute performance comparison, benchmarks were generated using a 2.8Ghz Intel Core 2 Duo 4Gb Macbook Pro and conducted in Python. The SHA1 and MD5 benchmarks in PHP are slightly slower but I guess they use the same underlying C library.
 * Hash           | # Iterations   | # Random words   | time (secs)  |
 * MD5            | 10,000         | 5,000            | 0.86         |
 * SHA1           | 10,000         | 5,000            | 1.22         |
 * Murmur3_x64_128 | 10,000        | 5,000            | 0.19         |
 * Murmur3_x64_64 | 10,000         | 5,000            | 0.19         |
 * Murmur3_x86_128 | 10,000        | 5,000            | 0.22         |
 * Murmur3_x86_64 | 10,000         | 5,000            | 0.21         |

Rob Lanpier suggested that SHA1 might ease integration with Github, see https://bugzilla.wikimedia.org/show_bug.cgi?id=25312#c7

Murmur hash is developed by Austin Appleyby, see http://code.google.com/p/smhasher/ Murmur hash is a non-cryptographic hash but it's very fast, public domain, and has very good distribution, collision, and performance properties (see http://code.google.com/p/smhasher/wiki/SMHasher)

Field type

 * Hash           | # Min Size      | # Field type                         |  Total time | Total space |
 * | (bytes)        |                                      |  (hours)    | (GB)        |


 * MD5            | 16              | BINARY(16) / CHAR(32)                |  4:23:43    | 6.69        |
 * SHA1           | 20              | BINARY(20) / CHAR(40)                |  6:29:12    | 8.35        |
 * Murmur3_x64_128 | 8              | DOUBLE / VARBINARY(20) / VARCHAR(39) |  1:00:04    | 3.34        |
 * Murmur3_x64_64 | 8               | BIGINT / VARBINARY(10) / VARCHAR(20) |  0:59:57    | 3.34        |
 * Murmur3_x86_128 | 8              | DOUBLE / VARBINARY(20) / VARCHAR(39) |  1:07:18    | 3.34        |
 * Murmur3_x86_64 | 8               | BIGINT / VARBINARY(10) / VARCHAR(20) |  1:07:28    | 3.34        |

Total space requirements is an approximation based on 411,257,397 revisions. Total time is an approximation of how time it would take to calculate the checksums. This is based on the fact that the sum of rev_len for all revisions is 7599309679117 bytes. It seems that the length of a murmur3 hash varies.

Two field types are possible for the MD5 and SHA1 hashes: CHAR and BINARY. The murmur hash can be stored either as a BIGINT, DOUBLE a VARBINARY or a VARCHAR. Binary seems to give a minor performance gain and takes less space to store. From the manual this is the main difference: The BINARY and VARBINARY types are similar to CHAR and VARCHAR, except that they contain binary strings rather than non-binary strings. That is, they contain byte strings rather than character strings. This means that they have no character set, and sorting and comparison are based on the numeric values of the bytes in the values.

Index
No index is required for this field. I haven't heard of a use case that would frequently require an index.

= Deployment = We can choose from two general strategies to generate the checksums: 1) Generate bulk checksums from a recent XML dump file and

Disk space
An ALTER TABLE query on Innodb databases makes a full copy of the table. According to Asher Feldman, the current revision table is 79Gb. The current number of rows in revision is around 412M. We would need at least 90Gb on each server to run the ALTER TABLE query and add the new checksum field.

Time required
I do not yet have the required information to make an estimate for this.

Target release
Most of the patches to include this functionality are already available and it seems that MW 1.19 is a feasible target version.

Current patch
I suggest to change the VARBINARY in the current patch to a BINARY field as SHA1 checksum are constant in length. This will save 1 byte per revision. The current patch is available at http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94289

Relevant bug reports
The following bug reports are relevant for this proposal:
 * MD5 or SHA1 checksum in stub dumps
 * Add checksum field to database table; expose it in API