Compression

This page is about data compression as it relates to MediaWiki.

Database dumps
The WikiMedia Database is quite large, so wikimedia compresses the database dumps using bzip2.

Output
It is possible, with HTTP, to compress individual pages served. Both the browser and the server must support it, and it is normally negotiated, (with uncompressed version available). This is on by default if PHP has zlib support enabled (no Apache mods are required). The negligible CPU time spent compressing on the server is dwarfed by things like loading the PHP scripts, and the bandwidth savings are considerable.

Articles
Anthony DiPierro is looking into the feasibility of using Huffman coding. A preliminary article space character count produced huffman codes which would give a 35% compression. Gzipping individual articles outperforms this (55% compression, with the compressed article space taking up about half a gig as of December 2004), but huffman coding based on words instead of characters (Huffword) might provide approximately 75% compression (see Adding Compression to a Full-text Retrieval System). Additionally, the dictionary could be used for history compression, with the huffman codes reassigned on a per-article title basis. This could perhaps approach other proposed methods of history compression while still retaining random access. The dictionary could also be reused (with extensions) for full text search.

On or about 2004-02-20 the old table and archive table were changed to allow some articles in the history table to be compressed. Old entries marked with old_flags="gzip" have their old_text compressed with zlib's deflate algorithm, with no header bytes. PHP's gzinflate will accept this text plainly; in Perl etc set the window size to -MAX_WSIZE to disable the header bytes.

Page histories
It is also possible to compress the history table in a way which exploits the similar data in the different versions, such as Reverse diff version control. See History compression for some actual numbers.

Cache compression
File cache talks about compression in the cached copies of pages. Now that the Wikimedia projects use squids, it's unclear how much of this is obsolete.