User:GWicke/Notes/Storage

See bug 48483.

Cassandra
Distributed storage with support for indexes, CAS and clustering / transparent compression. Avoids hot spots for IO (problem in ExternalStore sharding scheme).

Idea: Use this for revision storage, with a simple node storage service front-end. Easier to implement than trying to build a frontend for ExternalStore, provides testing for possible wider use.

CREATE TABLE revisions ( id uuid,  key text,  ts timestamp,  value blob,  PRIMARY KEY (id, key, ts) ) WITH CLUSTERING ORDER BY (ts DESC);
 * helenus Nodejs bindings

// sstable2json output demonstrating resulting clustering by column for delta compression [ {    "key": "550e8400e29b41d4a716446655440000", "columns": [ [       "html:1969-12-31 16\\:00\\:00-0800:", "",       1379879308010000      ],      [        "html:1969-12-31 16\\:00\\:00-0800:value", "666f6f", 1379879308010000     ],      [        "html:1969-12-31 16\\:00\\:00-0800:", "",       1379879315744000      ],      [        "html:1969-12-31 16\\:00\\:00-0800:value", "666f6f626172", 1379879315744000     ],      [        "wikitext:1969-12-31 16\\:00\\:00-0800:", "",       1379879325607000      ],      [        "wikitext:1969-12-31 16\\:00\\:00-0800:value", "666f6f", 1379879325607000     ],      [        "wikitext:1969-12-31 16\\:16\\:40-0800:", "",       1379879583462000      ],      [        "wikitext:1969-12-31 16\\:16\\:40-0800:value", "61207265616c6c79206c6f6e6720737472696e67", 1379879583462000     ]    ]  } ]

History compression
It used to be more efficient when pages on Wikipedia were still smaller than the (typically 64k) compression algorithm window size: meta:History_compression.

-rw-r--r-- 1 gabriel gabriel 143K Sep 23 14:00 /tmp/Atheism.txt -rw-r--r-- 1 gabriel gabriel 14M Sep 23 14:01 /tmp/Atheism-100.txt -rw-r--r-- 1 gabriel gabriel 7.8M Sep 23 14:29 /tmp/Atheism-100.txt.lz4 -rw-r--r-- 1 gabriel gabriel 5.0M Sep 23 14:02 /tmp/Atheism-100.txt.gzip9 -rw-r--r-- 1 gabriel gabriel 1.3M Sep 23 14:01 /tmp/Atheism-100.txt.bz2 -rw-r--r-- 1 gabriel gabriel 49K Sep 23 14:05 /tmp/Atheism-100.txt.lzma
 * 1) -100 is 100 concatenations of the single file.
 * 2) First a page larger than the typical 64k compression window.
 * 3) Only lzma fully picks up the repetition with its large window.

-rw-r--r-- 1 gabriel gabriel 7.0K Sep 23 14:16 /tmp/Storage.html -rw-r--r-- 1 gabriel gabriel 699K Sep 23 14:16 /tmp/Storage-100.html -rw-r--r-- 1 gabriel gabriel 6.8K Sep 23 14:17 /tmp/Storage-100.html.gz -rw-r--r-- 1 gabriel gabriel 5.7K Sep 23 14:29 /tmp/Storage-100.html.lz4 -rw-r--r-- 1 gabriel gabriel 4.9K Sep 23 14:16 /tmp/Storage-100.html.bz2 -rw-r--r-- 1 gabriel gabriel 2.2K Sep 23 14:18 /tmp/Storage-100.html.lzma
 * 1) Now a small (more typical) 7k page, this time as HTML.
 * 2) Compression works well using all algorithms.
 * 3) LZ4 (fast and default in Cassandra) outperforms gzip -9.
 * Size stats enwiki: 99.9% of all articles are < 64k

Cassandra compression
I benchmarked different compression algorithms and compression block sizes in Cassandra. Deflate (gzip) does best as it recognizes repetitions within its 30k sliding window, which means that many copies of the same article compress really well as long as it is smaller than about 30k. LZ4 and Snappy both process fixed 64k blocks at a time, so don't find many repetitions in typical article sizes. 20 * 5k articles any size (93M*20), Deflate, 256k block: 488MB (26%) 20 * 5k articles < 30k (39M*20), Deflate, 256k block: 48MB (6.2%) 20 * 10k articles < 10k (23M*20), Deflate, 256k block: 26MB (5.6%)

Tests were done by inserting the first 5k/10k articles from an enwiki dump into a cassandra table with this layout: CREATE TABLE revisions (      name text,       prop text,       id timeuuid,       value blob,       PRIMARY KEY (name, prop, id) ) WITH compression = { 'sstable_compression' : 'DeflateCompressor', 'chunk_length_kb' : 256 };

Alternatives

 * Swift: A bit hacky conceptually. Lacks clustering / compression features. Was not the most reliable when used for thumbnails.
 * Riak: Similar to Cassandra. Does not offer clustering and compression. Reportedly less mature and slower. Smaller community. No cross-datacenter replication in open source edition.

Related REST storage interfaces

 * Amazon S3
 * Swift
 * couchDB - underscore prefix for private resources