User:GWicke/Notes/Storage

See bug 48483.

Plan:
 * Create REST revision storage web service and use it (initially) for Parsoid HTML/metadata storage
 * Each revision has multiple timestamped parts
 * html and JSON page properties, one timestamped version per re-render. Lets us retrieve a page as it looked like at time X.
 * single wikitext per revision
 * JSON parsoid round-trip info
 * arbitrary metadata added by extensions (blame maps, annotations etc)

Strawman backend API: /*/jobs/ /*/user/ /en/page/Main_Page/latest/html -- returns latest html /en/page/Main_Page/latest/ -- list properties of latest revision /en/page/Main_Page/12345/ -- list properties of specific revision /en/page/Main_Page/12345/html -- returns latest HTML /en/page/Main_Page/2013-02-23t22:23:24z/html find revision as it was at time X, not cacheable, redirects to /en/page/Main_Page/8f545ba0-2601-11e3-885c-4160918f0fb9/html stable revision snapshot identified by Type 1 UUID. Immutable apart from HTML spec updates.

This can be made publicly available as: /wiki/Main_Page?prop=latest/html -- returns latest html, purged on new revision / re-render

Cassandra backend
Distributed storage with support for indexes, CAS and clustering / transparent compression. Avoids hot spots for IO (problem in ExternalStore sharding scheme).

Idea: Use this for revision storage, with a simple node storage service front-end. Easier to implement than trying to build a frontend for ExternalStore, provides testing for possible wider use.


 * helenus Nodejs bindings

History compression
It used to be more efficient when pages on Wikipedia were still smaller than the (typically 64k) compression algorithm window size: meta:History_compression.

-rw-r--r-- 1 gabriel gabriel 143K Sep 23 14:00 /tmp/Atheism.txt -rw-r--r-- 1 gabriel gabriel 14M Sep 23 14:01 /tmp/Atheism-100.txt -rw-r--r-- 1 gabriel gabriel 7.8M Sep 23 14:29 /tmp/Atheism-100.txt.lz4 -rw-r--r-- 1 gabriel gabriel 5.0M Sep 23 14:02 /tmp/Atheism-100.txt.gzip9 -rw-r--r-- 1 gabriel gabriel 1.3M Sep 23 14:01 /tmp/Atheism-100.txt.bz2 -rw-r--r-- 1 gabriel gabriel 49K Sep 23 14:05 /tmp/Atheism-100.txt.lzma
 * 1) -100 is 100 concatenations of the single file.
 * 2) First a page larger than the typical 64k compression window.
 * 3) Only lzma fully picks up the repetition with its large window.

-rw-r--r-- 1 gabriel gabriel 7.0K Sep 23 14:16 /tmp/Storage.html -rw-r--r-- 1 gabriel gabriel 699K Sep 23 14:16 /tmp/Storage-100.html -rw-r--r-- 1 gabriel gabriel 6.8K Sep 23 14:17 /tmp/Storage-100.html.gz -rw-r--r-- 1 gabriel gabriel 5.7K Sep 23 14:29 /tmp/Storage-100.html.lz4 -rw-r--r-- 1 gabriel gabriel 4.9K Sep 23 14:16 /tmp/Storage-100.html.bz2 -rw-r--r-- 1 gabriel gabriel 2.2K Sep 23 14:18 /tmp/Storage-100.html.lzma
 * 1) Now a small (more typical) 7k page, this time as HTML.
 * 2) Compression works well using all algorithms.
 * 3) LZ4 (fast and default in Cassandra) outperforms gzip -9.
 * Size stats enwiki: 99.9% of all articles are < 64k

Cassandra compression
Cassandra stores compound primary key data as so-called wide rows. With the right schema this means that the text column of a given article will be compressed and stored sequentially. I benchmarked different compression algorithms and compression input block sizes. Of the compression algorithms available in Cassandra, Deflate (gzip) does best as it recognizes repetitions within its 32k sliding window, which means that many copies of the same article compress really well as long as it is smaller than about 32k. LZ4 and Snappy both process fixed 64k blocks at a time, so don't find many repetitions for typical (19k in this sample) article sizes. 20 * 5k articles any size (93M*20), Deflate, 256k block: 488MB (26%) 20 * 5k articles < 30k (39M*20), Deflate, 256k block: 48MB (6.2%) 20 * 10k articles < 10k (23M*20), Deflate, 256k block: 26MB (5.6%)

Adding LZMA compression would result in much better results for large articles and HTML at the price of (~5x) slower compression. For highly compressible content like article revisions, decompression will likely be about as fast or even faster than deflate as the compressed input is much smaller. Slower compression might also be fine as our write rates are low and the CPU load would be evenly spread across all Cassandra boxes. With deflate on my puny dual-core laptop insertion rates are routinely in the 500 rev/second range, mostly CPU-bound in the node client. We currently have less than 50 edits / re-renders per second in the cluster.

Tests were done by inserting the first 5k/10k articles from an enwiki dump into a cassandra table with this layout: CREATE TABLE revisions (      name text,       prop text,       id timeuuid,       value blob,       PRIMARY KEY (name, prop, id) ) WITH compression = { 'sstable_compression' : 'DeflateCompressor', 'chunk_length_kb' : 256 };

The Cassandra db file size was then summed up using.

Alternatives
The stateless storage API frontend will abstract the backend, so we are free to swap that out. Some backend contenders to consider:


 * HBase: DB layer on top of HDFS. Strongly consistent, less available. HDFS is hierarchical (naming nodes) rather than a flat DHT. Possibly the strongest alternative to Cassandra.
 * Swift: A bit hacky conceptually. Lacks clustering / compression features. Had some issues in the thumb storage application.
 * Riak: Similar to Cassandra. Does not offer clustering and compression. Reportedly less mature and slower. Smaller community. No cross-datacenter replication in open source edition.

Related REST storage interfaces

 * Amazon S3
 * Swift
 * couchDB - underscore prefix for private resources