dbzip2 is a utility program currently under development to help speed up working with the large data dumps we produce for Wikipedia.
Current disks and networks can toss data around 10-100x faster than current CPUs can run bzip2 compression (a measly 2-3 megabytes per second of input data processed). This shows a clear opportunity to parallelize the task; a full order-of-magnitude speedup should be quite possible with sufficient resources.
As of September 29, 2008...
|Remote networked threads||no||no||yes (MPI)||no||no||no||yes|
|Bit-identical with bzip2||yes||no||no||yes||no||no||yes|
|Compatible with bzip2||yes||yes||yes||yes||yes||yes||yes|
|Compatible with libbzip2 apps||yes||no||no||yes||no||yes|
- pbzip2 can only parallel-decompress its own funky output files. Regular bzip2 streams must be processed on a single thread.
- same limitations as with pbzip2.
- lbzip2 decompresses traditional, sequential access bzip2 streams on multiple threads. There is a remote possibility for a bzip2 stream to be both valid and uncompressible by multiple-workers lbzip2, since lbzip2 splits the compressed stream into blocks at bit sequences that look like block delimiters. However, these bit-sequences can occur (with a tiny possibility) within one compressed block. When lbzip2 finds such a bit-sequence, it exits with an error. Furthermore, the multiple-workers decompressor may reject a bz2 file created by manual concatenation if it contains too many adjacent empty bzip2 streams.
- Consistent output is not guaranteed in heterogeneous environments; different versions of the bzip2 library may produce slightly different compressed output for each block.
rzip and xdelta3
Revision histories contain a lot of repeats of large sections of the article, long distances apart. gzip and bzip2 are designed to compress redundancy over smaller distances (in a 32kb sliding window for gzip, within 100-900kb blocks for bzip2). You can improve compression ratio or speed (or both) by feeding a full-history dump to something designed to compress out long-range redundancy.
(Some of the below is summarized from Randall's April-May '13 posts to xmldatadumps-l.)
rzip and xdelta3 are a couple of programs that can act as long-range-redundancy compressors. For the particular case of Wikipedia dumps (not arbitrary binary files), just very basic preprocessing in a Python script can also help you out.
In a test, rzip compressed 15GB of dump to 44MB in three minutes, essentially the same ratio as 7zip, 20x as fast. xdelta3 got the 15G sample down to 141M in 2.5 min, and (as xdelta -9 | xz -9) down to 91M in 4 min.
Downsides: rzip won't stream input/output; it needs real, seekable, on-filesystem files. And it can eat lots of RAM for large inputs. One mitigation is to write a wrapper script that chops the many-GB input stream into chunks of a few MB and (de)compresses each separately. A proof-of-concept Python script for that approach exists. It's somewhat slower and doesn't compress as well, but it might be usable where rzip isn't.
My (Randall's) overall impression is that while rzip helps in a narrow technical sense (it makes smallish archives quickly-ish) it still isn't ideal compared to some sort of purpose-built delta-coded dump format, and probably isn't practical to deploy just because we can't ask everyone who wants to use history dumps to install a new decompressor and wrapper script. Interested to see what comes out of the incremental-dumps GSoC project.
A preprocessor can remove some long-range redundancy (e.g., chunks of article repeated across revisions) and help put other redundant content (sections with a few edits) nearer to each other so that the short-range compression of gzip or bzip2 can be more effective. For example, this procedure only takes a few lines of Python:
- Create a sorted list of all the distinct lines that occur in the input.
- Instead of saving the full text, just save indices into that list (the "line numbers").
- Compress both the list of distinct lines and the line numbers with gzip.
Then you don't have to store many identical copies of the same line; also, sorting the list of lines tends to place different revisions of the same paragraph next to each other, where gzip can compress out the shared parts. Running that on 4.1G of history compressed it to 63M in 2m45s. (Using the same approach, but using bzip instead of gzip on the sorted line list, got the same file down to 56M in 3m40s.) Projecting from a 50M sample of the file, bzip might take about 27m and produce a 131M file.
There are lots of possible optimizations; maybe the point here is, again, that almost any effort to compress long-range redundancy before using a general-purpose compressor helps.
Testing compression of a 100,000,000-byte extract of a Wikipedia dump file on a 2.0 GHz dual-core G5 running Mac OS X 10.4.6. Times are seconds of real time as reported by 'time', best of three runs. iTunes, irc running in background.
2x bzip2smp: 16.469 1.72x 5x dbzip2 lan: 17.522 1.62x 2x pbzip2: 18.716 1.51x 2x dbzip2 local: 20.214 1.40x 2x dbzip2 remote: 22.218 1.27x 1x bzip2smp: 27.031 1.05x 1x bzip2: 28.300 1.00x (baseline) 1x pbzip2: 31.742 0.89x 1x dbzip2 local: 32.388 0.87x 1x dbzip2 remote: 36.542 0.77x
The '5x dbzip2 lan' configuration includes remote threads on a 2.0 GHz Athlon XP (over 100 MBps ethernet), a 1.5 GHz Intel Core Solo (via wireless), and a 1.0 GHz G4 (via wireless). See earlier post about performance breakdown on this setup.
dbzip2 currently includes a number of inefficiencies, such as doing RLE encoding twice and picking apart bitstreams, so I'm pretty happy that it does as well as it does so far. bzip2smp does very well on local threads and sets a good standard to aim for, though the bitstream shifting requirements probably mean we can't beat it (unless we cheat!)
The real shine, though, comes from making use of a cluster of multiple fast machines with a fast network. The following figures were made with the same data extract, running remote threads on 1 to 12 of Wikimedia'a database servers.
The servers were under (relatively light nighttime/morning) load, but they usually have lots of spare CPU cycles and gigabit ethernet leaves plenty of bandwidth.
CPUs Time MB/sec input Linear approximation 1 30.363 3.29348219872872 2.6 2 19.203 5.20751965838671 5.2 3 11.692 8.55285665412248 7.8 4 8.584 11.6495806150979 10.4 5 6.775 14.760147601476 13 6 6.006 16.6500166500166 15.6 7 5.227 19.1314329443275 18.2 8 4.78 20.9205020920502 20.8 9 5.324 18.7828700225394 23.4 10 4.155 24.0673886883273 26 11 4.211 23.7473284255521 28.6 12 4.396 22.7479526842584 31.2
It appears to scale pretty well up to 8 threads, then starts trailing off a bit; the local processing on the hub takes up over half the runtime by the end. Throughput peaked around 24 megabytes per second, which is pretty satisfying and a nice big leap over the single-threaded throughput.
dbzip2 is not ready for public use yet. The current prototype is written in Python and C, using the standard bzip2 library to do the heavy lifting.
proof of concept local threads, pbzip2-style proof of concept remote threads, pbzip2-style proof of concept bitstream merging for library-compatible output reimplement bitstream in C for performance break up input into proper-sized blocks to produce identical output and avoid overruns on RLE worst cases
- Note this can use a lot of memory buffering very long RLE runs with the current system. These should be relatively rare in our cases, though.
- parallelize decompression by scanning bitstream for block boundaries
automatic recovery from remote node failures
- auto-detect number of local processors
- pipelining to optimize network throughput
- config file for remote node list
- zeroconf for remote nodes?
- automatic recovery from false-positive decompression block boundaries
- Reimplement in C or C++, with heavily-modified bzip2 library code instead of hacking bitstreams
- Try to do similar for 7zip