Dbzip2
From MediaWiki.org
dbzip2 is a utility program currently under development to help speed up working with the large data dumps we produce for Wikipedia.
Current disks and networks can toss data around 10-100x faster than current CPUs can run bzip2 compression (a measly 2-3 megabytes per second of input data processed). This shows a clear opportunity to parallelize the task; a full order-of-magnitude speedup should be quite possible with sufficient resources.
Contents |
[edit] Feature comparison
As of September 29, 2008...
| bzip2 | pbzip2 | mpibzip2 | bzip2smp | smpbzip2 | lbzip2 | dbzip2 | |
|---|---|---|---|---|---|---|---|
| Parallel compression | no | yes | yes | yes | yes | yes | yes |
| Parallel decompression | no | yes[1] | yes[2] | no | no | yes[3] | no |
| Remote networked threads | no | no | yes (MPI) | no | no | no | yes |
| Bit-identical with bzip2 | yes | no | no | yes | no | no | yes[4] |
| Compatible with bzip2 | yes | yes | yes | yes | yes | yes | yes |
| Compatible with libbzip2 apps | yes | no | no | yes | no | yes |
- ↑ pbzip2 can only parallel-decompress its own funky output files. Regular bzip2 streams must be processed on a single thread.
- ↑ same limitations as with pbzip2.
- ↑ lbzip2 decompresses traditional, sequential access bzip2 streams on multiple threads. There is a remote possibility for a bzip2 stream to be both valid and uncompressible by multiple-workers lbzip2, since lbzip2 splits the compressed stream into blocks at bit sequences that look like block delimiters. However, these bit-sequences can occur (with a tiny possibility) within one compressed block. When lbzip2 finds such a bit-sequence, it exits with an error. Furthermore, the multiple-workers decompressor may reject a bz2 file created by manual concatenation if it contains too many adjacent empty bzip2 streams.
- ↑ Consistent output is not guaranteed in heterogeneous environments; different versions of the bzip2 library may produce slightly different compressed output for each block.
[edit] Benchmarks
[edit] Local
Testing compression of a 100,000,000-byte extract of a Wikipedia dump file on a 2.0 GHz dual-core G5 running Mac OS X 10.4.6. Times are seconds of real time as reported by 'time', best of three runs. iTunes, irc running in background.
2x bzip2smp: 16.469 1.72x 5x dbzip2 lan: 17.522 1.62x 2x pbzip2: 18.716 1.51x 2x dbzip2 local: 20.214 1.40x 2x dbzip2 remote: 22.218 1.27x 1x bzip2smp: 27.031 1.05x 1x bzip2: 28.300 1.00x (baseline) 1x pbzip2: 31.742 0.89x 1x dbzip2 local: 32.388 0.87x 1x dbzip2 remote: 36.542 0.77x
The '5x dbzip2 lan' configuration includes remote threads on a 2.0 GHz Athlon XP (over 100 MBps ethernet), a 1.5 GHz Intel Core Solo (via wireless), and a 1.0 GHz G4 (via wireless). See earlier post about performance breakdown on this setup.
dbzip2 currently includes a number of inefficiencies, such as doing RLE encoding twice and picking apart bitstreams, so I'm pretty happy that it does as well as it does so far. bzip2smp does very well on local threads and sets a good standard to aim for, though the bitstream shifting requirements probably mean we can't beat it (unless we cheat!)
[edit] Cluster
The real shine, though, comes from making use of a cluster of multiple fast machines with a fast network. The following figures were made with the same data extract, running remote threads on 1 to 12 of Wikimedia'a database servers.
The servers were under (relatively light nighttime/morning) load, but they usually have lots of spare CPU cycles and gigabit ethernet leaves plenty of bandwidth.
CPUs Time MB/sec input Linear approximation 1 30.363 3.29348219872872 2.6 2 19.203 5.20751965838671 5.2 3 11.692 8.55285665412248 7.8 4 8.584 11.6495806150979 10.4 5 6.775 14.760147601476 13 6 6.006 16.6500166500166 15.6 7 5.227 19.1314329443275 18.2 8 4.78 20.9205020920502 20.8 9 5.324 18.7828700225394 23.4 10 4.155 24.0673886883273 26 11 4.211 23.7473284255521 28.6 12 4.396 22.7479526842584 31.2
It appears to scale pretty well up to 8 threads, then starts trailing off a bit; the local processing on the hub takes up over half the runtime by the end. Throughput peaked around 24 megabytes per second, which is pretty satisfying and a nice big leap over the single-threaded throughput.
[edit] Development status
dbzip2 is not ready for public use yet. The current prototype is written in Python and C, using the standard bzip2 library to do the heavy lifting.
Checklist:
proof of concept local threads, pbzip2-styleproof of concept remote threads, pbzip2-styleproof of concept bitstream merging for library-compatible outputreimplement bitstream in C for performancebreak up input into proper-sized blocks to produce identical output and avoid overruns on RLE worst cases- Note this can use a lot of memory buffering very long RLE runs with the current system. These should be relatively rare in our cases, though.
- parallelize decompression by scanning bitstream for block boundaries
automatic recovery from remote node failures
Additional goodies:
- auto-detect number of local processors
- pipelining to optimize network throughput
- config file for remote node list
- zeroconf for remote nodes?
- automatic recovery from false-positive decompression block boundaries
Possible future:
- Reimplement in C or C++, with heavily-modified bzip2 library code instead of hacking bitstreams
- Java-friendly version for at least decompression on local threads, for mwdumper
- Try to do similar for 7zip