Dbzip2

dbzip2 is a utility program currently under development to help speed up working with the large data dumps we produce for Wikipedia.

Current disks and networks can toss data around 10-100x faster than current CPUs can run bzip2 compression (a measly 2-3 megabytes per second of input data processed). This shows a clear opportunity to parallelize the task; a full order-of-magnitude speedup should be quite possible with sufficient resources.

Feature comparison
As of May 30, 2006...

Local
Testing compression of a 100,000,000-byte extract of a Wikipedia dump file on a 2.0 GHz dual-core G5 running Mac OS X 10.4.6. Times are seconds of real time as reported by 'time', best of three runs. iTunes, irc running in background.

2x bzip2smp:     16.469 1.72x 5x dbzip2 lan:   17.522 1.62x 2x pbzip2:       18.716 1.51x 2x dbzip2 local: 20.214 1.40x 2x dbzip2 remote: 22.218 1.27x 1x bzip2smp:     27.031 1.05x 1x bzip2:        28.300 1.00x (baseline) 1x pbzip2:       31.742 0.89x 1x dbzip2 local: 32.388 0.87x 1x dbzip2 remote: 36.542 0.77x

The '5x dbzip2 lan' configuration includes remote threads on a 2.0 GHz Athlon XP (over 100 MBps ethernet), a 1.5 GHz Intel Core Solo (via wireless), and a 1.0 GHz G4 (via wireless). See earlier post about performance breakdown on this setup.

dbzip2 currently includes a number of inefficiencies, such as doing RLE encoding twice and picking apart bitstreams, so I'm pretty happy that it does as well as it does so far. bzip2smp does very well on local threads and sets a good standard to aim for, though the bitstream shifting requirements probably mean we can't beat it (unless we cheat!)

Cluster
The real shine, though, comes from making use of a cluster of multiple fast machines with a fast network. The following figures were made with the same data extract, running remote threads on 1 to 12 of Wikimedia'a database servers.

The servers were under (relatively light nighttime/morning) load, but they usually have lots of spare CPU cycles and gigabit ethernet leaves plenty of bandwidth.



CPUs	Time	MB/sec input	Linear approximation 1	30.363	3.29348219872872	2.6 2	19.203	5.20751965838671	5.2 3	11.692	8.55285665412248	7.8 4	8.584	11.6495806150979	10.4 5	6.775	14.760147601476	13 6	6.006	16.6500166500166	15.6 7	5.227	19.1314329443275	18.2 8	4.78	20.9205020920502	20.8 9	5.324	18.7828700225394	23.4 10	4.155	24.0673886883273	26 11	4.211	23.7473284255521	28.6 12	4.396	22.7479526842584	31.2

It appears to scale pretty well up to 8 threads, then starts trailing off a bit; the local processing on the hub takes up over half the runtime by the end. Throughput peaked around 24 megabytes per second, which is pretty satisfying and a nice big leap over the single-threaded throughput.

Development status
dbzip2 is not ready for public use yet. The current prototype is written in Python and C, using the standard bzip2 library to do the heavy lifting.


 * first notes 2006-05-12
 * update 2006-05-22
 * update 2006-05-31

Checklist:
 * proof of concept local threads, pbzip2-style
 * proof of concept remote threads, pbzip2-style
 * proof of concept bitstream merging for library-compatible output
 * reimplement bitstream in C for performance
 * break up input into proper-sized blocks to produce identical output and avoid overruns on RLE worst cases
 * Note this can use a lot of memory buffering very long RLE runs with the current system. These should be relatively rare in our cases, though.
 * parallelize decompression by scanning bitstream for block boundaries
 * automatic recovery from remote node failures

Additional goodies:
 * auto-detect number of local processors
 * pipelining to optimize network throughput
 * config file for remote node list
 * zeroconf for remote nodes?
 * automatic recovery from false-positive decompression block boundaries

Possible future:
 * Reimplement in C or C++, with heavily-modified bzip2 library code instead of hacking bitstreams
 * Java-friendly version for at least decompression on local threads, for mwdumper
 * Try to do similar for 7zip