Dbzip2

From MediaWiki.org

Jump to: navigation, search

dbzip2 is a utility program currently under development to help speed up working with the large data dumps we produce for Wikipedia.

Current disks and networks can toss data around 10-100x faster than current CPUs can run bzip2 compression (a measly 2-3 megabytes per second of input data processed). This shows a clear opportunity to parallelize the task; a full order-of-magnitude speedup should be quite possible with sufficient resources.

Contents

[edit] Feature comparison

As of September 29, 2008...

bzip2 pbzip2 bzip2smp lbzip2 dbzip2
Parallel compression no yes yes yes yes
Parallel decompression no yes[1] no yes[2] no
Remote networked threads no no no no yes
Bit-identical with bzip2 yes no yes no yes[3]
Compatible with bzip2 yes yes yes yes yes
Compatible with libbzip2 apps yes no yes no yes
  1. pbzip2 can only parallel-decompress its own funky output files. Regular bzip2 streams must be processed on a single thread.
  2. lbzip2 decompresses traditional, sequential access bzip2 streams on multiple threads. There is a remote possibility for a bzip2 stream to be both valid and uncompressible by multiple-workers lbzip2, since lbzip2 splits the compressed stream into blocks at bit sequences that look like block delimiters. However, these bit-sequences can occur (with a tiny possibility) within one compressed block. When lbzip2 finds such a bit-sequence, it exits with an error. Furthermore, the multiple-workers decompressor may reject a bz2 file created by manual concatenation if it contains too many adjacent empty bzip2 streams.
  3. Consistent output is not guaranteed in heterogeneous environments; different versions of the bzip2 library may produce slightly different compressed output for each block.

[edit] Benchmarks

[edit] Local

Testing compression of a 100,000,000-byte extract of a Wikipedia dump file on a 2.0 GHz dual-core G5 running Mac OS X 10.4.6. Times are seconds of real time as reported by 'time', best of three runs. iTunes, irc running in background.

2x bzip2smp:      16.469 1.72x
5x dbzip2 lan:    17.522 1.62x
2x pbzip2:        18.716 1.51x
2x dbzip2 local:  20.214 1.40x
2x dbzip2 remote: 22.218 1.27x
1x bzip2smp:      27.031 1.05x
1x bzip2:         28.300 1.00x (baseline)
1x pbzip2:        31.742 0.89x
1x dbzip2 local:  32.388 0.87x
1x dbzip2 remote: 36.542 0.77x

The '5x dbzip2 lan' configuration includes remote threads on a 2.0 GHz Athlon XP (over 100 MBps ethernet), a 1.5 GHz Intel Core Solo (via wireless), and a 1.0 GHz G4 (via wireless). See earlier post about performance breakdown on this setup.

dbzip2 currently includes a number of inefficiencies, such as doing RLE encoding twice and picking apart bitstreams, so I'm pretty happy that it does as well as it does so far. bzip2smp does very well on local threads and sets a good standard to aim for, though the bitstream shifting requirements probably mean we can't beat it (unless we cheat!)

[edit] Cluster

The real shine, though, comes from making use of a cluster of multiple fast machines with a fast network. The following figures were made with the same data extract, running remote threads on 1 to 12 of Wikimedia'a database servers.

The servers were under (relatively light nighttime/morning) load, but they usually have lots of spare CPU cycles and gigabit ethernet leaves plenty of bandwidth.

CPUs	Time	MB/sec input	Linear approximation
1	30.363	3.29348219872872	2.6
2	19.203	5.20751965838671	5.2
3	11.692	8.55285665412248	7.8
4	8.584	11.6495806150979	10.4
5	6.775	14.760147601476	13
6	6.006	16.6500166500166	15.6
7	5.227	19.1314329443275	18.2
8	4.78	20.9205020920502	20.8
9	5.324	18.7828700225394	23.4
10	4.155	24.0673886883273	26
11	4.211	23.7473284255521	28.6
12	4.396	22.7479526842584	31.2

It appears to scale pretty well up to 8 threads, then starts trailing off a bit; the local processing on the hub takes up over half the runtime by the end. Throughput peaked around 24 megabytes per second, which is pretty satisfying and a nice big leap over the single-threaded throughput.

[edit] Development status

dbzip2 is not ready for public use yet. The current prototype is written in Python and C, using the standard bzip2 library to do the heavy lifting.

Checklist:

  • proof of concept local threads, pbzip2-style
  • proof of concept remote threads, pbzip2-style
  • proof of concept bitstream merging for library-compatible output
  • reimplement bitstream in C for performance
  • break up input into proper-sized blocks to produce identical output and avoid overruns on RLE worst cases
    • Note this can use a lot of memory buffering very long RLE runs with the current system. These should be relatively rare in our cases, though.
  • parallelize decompression by scanning bitstream for block boundaries
  • automatic recovery from remote node failures

Additional goodies:

  • auto-detect number of local processors
  • pipelining to optimize network throughput
  • config file for remote node list
  • zeroconf for remote nodes?
  • automatic recovery from false-positive decompression block boundaries

Possible future:

  • Reimplement in C or C++, with heavily-modified bzip2 library code instead of hacking bitstreams
  • Java-friendly version for at least decompression on local threads, for mwdumper
  • Try to do similar for 7zip
Personal tools