Dbzip2

From mediawiki.org

dbzip2 is a utility program currently under development to help speed up working with the large data dumps we produce for Wikipedia.

Current disks and networks can toss data around 10-100x faster than current CPUs can run bzip2 compression (a measly 2-3 megabytes per second of input data processed). This shows a clear opportunity to parallelize the task; a full order-of-magnitude speedup should be quite possible with sufficient resources.

Feature comparison[edit]

As of September 29, 2008...

bzip2 pbzip2 mpibzip2 bzip2smp smpbzip2 lbzip2 dbzip2
Parallel compression no yes yes yes yes yes yes
Parallel decompression no yes[1] yes[2] no no yes[3] no
Remote networked threads no no yes (MPI) no no no yes
Bit-identical with bzip2 yes no no yes no no yes[4]
Compatible with bzip2 yes yes yes yes yes yes yes
Compatible with libbzip2 apps yes no no yes no yes
  1. pbzip2 can only parallel-decompress its own funky output files. Regular bzip2 streams must be processed on a single thread.
  2. same limitations as with pbzip2.
  3. lbzip2 decompresses traditional, sequential access bzip2 streams on multiple threads. There is a remote possibility for a bzip2 stream to be both valid and uncompressible by multiple-workers lbzip2, since lbzip2 splits the compressed stream into blocks at bit sequences that look like block delimiters. However, these bit-sequences can occur (with a tiny possibility) within one compressed block. When lbzip2 finds such a bit-sequence, it exits with an error. Furthermore, the multiple-workers decompressor may reject a bz2 file created by manual concatenation if it contains too many adjacent empty bzip2 streams.
  4. Consistent output is not guaranteed in heterogeneous environments; different versions of the bzip2 library may produce slightly different compressed output for each block.

rzip and xdelta3[edit]

Revision histories contain a lot of repeats of large sections of the article, long distances apart. gzip and bzip2 are designed to compress redundancy over smaller distances (in a 32kb sliding window for gzip, within 100-900kb blocks for bzip2). You can improve compression ratio or speed (or both) by feeding a full-history dump to something designed to compress out long-range redundancy.

(Some of the below is summarized from Randall's April-May '13 posts to xmldatadumps-l.)

rzip and xdelta3 are a couple of programs that can act as long-range-redundancy compressors. For the particular case of Wikipedia dumps (not arbitrary binary files), just very basic preprocessing in a Python script can also help you out.

In a test, rzip compressed 15GB of dump to 44MB in three minutes, essentially the same ratio as 7zip, 20x as fast. xdelta3 got the 15G sample down to 141M in 2.5 min, and (as xdelta -9 | xz -9) down to 91M in 4 min.

Downsides: rzip won't stream input/output; it needs real, seekable, on-filesystem files. And it can eat lots of RAM for large inputs. One mitigation is to write a wrapper script that chops the many-GB input stream into chunks of a few MB and (de)compresses each separately. A proof-of-concept Python script for that approach exists. It's somewhat slower and doesn't compress as well, but it might be usable where rzip isn't.

My (Randall's) overall impression is that while rzip helps in a narrow technical sense (it makes smallish archives quickly-ish) it still isn't ideal compared to some sort of purpose-built delta-coded dump format, and probably isn't practical to deploy just because we can't ask everyone who wants to use history dumps to install a new decompressor and wrapper script. Interested to see what comes out of the incremental-dumps GSoC project.

Preprocessing[edit]

A preprocessor can remove some long-range redundancy (e.g., chunks of article repeated across revisions) and help put other redundant content (sections with a few edits) nearer to each other so that the short-range compression of gzip or bzip2 can be more effective. For example, this procedure only takes a few lines of Python:

  1. Create a sorted list of all the distinct lines that occur in the input.
  2. Instead of saving the full text, just save indices into that list (the "line numbers").
  3. Compress both the list of distinct lines and the line numbers with gzip.

Then you don't have to store many identical copies of the same line; also, sorting the list of lines tends to place different revisions of the same paragraph next to each other, where gzip can compress out the shared parts. Running that on 4.1G of history compressed it to 63M in 2m45s. (Using the same approach, but using bzip instead of gzip on the sorted line list, got the same file down to 56M in 3m40s.) Projecting from a 50M sample of the file, bzip might take about 27m and produce a 131M file.

There are lots of possible optimizations; maybe the point here is, again, that almost any effort to compress long-range redundancy before using a general-purpose compressor helps.

Benchmarks[edit]

Local[edit]

Testing compression of a 100,000,000-byte extract of a Wikipedia dump file on a 2.0 GHz dual-core G5 running Mac OS X 10.4.6. Times are seconds of real time as reported by 'time', best of three runs. iTunes, irc running in background.

2x bzip2smp:      16.469 1.72x
5x dbzip2 lan:    17.522 1.62x
2x pbzip2:        18.716 1.51x
2x dbzip2 local:  20.214 1.40x
2x dbzip2 remote: 22.218 1.27x
1x bzip2smp:      27.031 1.05x
1x bzip2:         28.300 1.00x (baseline)
1x pbzip2:        31.742 0.89x
1x dbzip2 local:  32.388 0.87x
1x dbzip2 remote: 36.542 0.77x

The '5x dbzip2 lan' configuration includes remote threads on a 2.0 GHz Athlon XP (over 100 MBps ethernet), a 1.5 GHz Intel Core Solo (via wireless), and a 1.0 GHz G4 (via wireless). See earlier post about performance breakdown on this setup.

dbzip2 currently includes a number of inefficiencies, such as doing RLE encoding twice and picking apart bitstreams, so I'm pretty happy that it does as well as it does so far. bzip2smp does very well on local threads and sets a good standard to aim for, though the bitstream shifting requirements probably mean we can't beat it (unless we cheat!)

Cluster[edit]

The real shine, though, comes from making use of a cluster of multiple fast machines with a fast network. The following figures were made with the same data extract, running remote threads on 1 to 12 of Wikimedia'a database servers.

The servers were under (relatively light nighttime/morning) load, but they usually have lots of spare CPU cycles and gigabit ethernet leaves plenty of bandwidth.

CPUs	Time	MB/sec input	Linear approximation
1	30.363	3.29348219872872	2.6
2	19.203	5.20751965838671	5.2
3	11.692	8.55285665412248	7.8
4	8.584	11.6495806150979	10.4
5	6.775	14.760147601476	13
6	6.006	16.6500166500166	15.6
7	5.227	19.1314329443275	18.2
8	4.78	20.9205020920502	20.8
9	5.324	18.7828700225394	23.4
10	4.155	24.0673886883273	26
11	4.211	23.7473284255521	28.6
12	4.396	22.7479526842584	31.2

It appears to scale pretty well up to 8 threads, then starts trailing off a bit; the local processing on the hub takes up over half the runtime by the end. Throughput peaked around 24 megabytes per second, which is pretty satisfying and a nice big leap over the single-threaded throughput.

Development status[edit]

dbzip2 is not ready for public use yet. The current prototype is written in Python and C, using the standard bzip2 library to do the heavy lifting.

Checklist:

  • proof of concept local threads, pbzip2-style
  • proof of concept remote threads, pbzip2-style
  • proof of concept bitstream merging for library-compatible output
  • reimplement bitstream in C for performance
  • break up input into proper-sized blocks to produce identical output and avoid overruns on RLE worst cases
    • Note this can use a lot of memory buffering very long RLE runs with the current system. These should be relatively rare in our cases, though.
  • parallelize decompression by scanning bitstream for block boundaries
  • automatic recovery from remote node failures

Additional goodies:

  • auto-detect number of local processors
  • pipelining to optimize network throughput
  • config file for remote node list
  • zeroconf for remote nodes?
  • automatic recovery from false-positive decompression block boundaries

Possible future:

  • Reimplement in C or C++, with heavily-modified bzip2 library code instead of hacking bitstreams
  • Try to do similar for 7zip