User:GWicke/Notes/Storage

See bug 48483.

A lot of our quarterly goals depend on storage of metadata. Additionally, mobile and others threaten to actually use our content which makes it important to avoid cold caches.

Idea:
 * Create REST revision storage web service and use it (initially) for Parsoid HTML/metadata storage
 * Each revision has multiple timestamped parts
 * html and JSON page properties ('meta'), one timestamped version per re-render. Lets us retrieve a page as it looked like at time X.
 * single wikitext per revision
 * JSON parsoid round-trip info
 * arbitrary metadata added by extensions (blame maps, annotations etc)

Strawman backend API: GET /enwiki/page/Main_Page?rev/latest -- redirect to /enwiki/latest/html, cached GET /enwiki/page/Main_Page?rev/latest/html -- returns latest html, cached & purged GET /enwiki/page/Main_Page?rev/latest/ -- list properties of latest revision, cached & purged GET /enwiki/page/Main_Page?rev/ -- list revisions, latest first GET /enwiki/page/Main_Page?rev/12345/ -- list properties with timeuuids of specific revision GET /enwiki/page/Main_Page?rev/12345/html -- redirects to latest html timeuuid URL GET /enwiki/page/Main_Page?rev/2013-02-23T22:23:24Z/html find revision as it was at time X, not cacheable, redirects to GET /enwiki/page/Main_Page?rev/8f545ba0-2601-11e3-885c-4160918f0fb9/html stable revision snapshot identified by Type 1 UUID. Immutable apart from HTML spec updates. Assumptions:
 * A separate table is used to record a mapping from key name to content-type, update policy etc. If a key is not defined there, conservative defaults like application/octet-stream will be used.

Editing:
 * Atomically create a new revision with several properties.
 * Required post vars with example values:
 * _parent=Main_Page?rev/12344 : The parent revision. Returned as x-parent header with regular GET requests, and part of JSON returned at /enwiki/page/Main_Page?rev/ history info.
 * _rev=12345 : The new revision. Returned as  header with regular GET requests. Also part of history info.
 * Optional post vars:
 * _timestamp= 2013-09-25T09:43:09Z : Timestamp to use in timeuuid generation. Needed to import old revisions. Should require special rights. Normal updates should use the current time.
 * Typical property post vars:
 * html, wikitext : The html and wikitext of this revision
 * meta : The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.


 * Update an existing revision with new properties. A new timeuuid will be generated.
 * Typical property post vars:
 * html, wikitext : The html and wikitext of this revision
 * meta : The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.


 * Update or insert properties with a given timeuuid. This timeuuid has to exist already. An error is returned otherwise.
 * Typical property post vars:
 * html, wikitext : The html and wikitext of this revision
 * meta : The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
 * Returns : JSON status indicating update / creation for keys or JSON error message when trying to edit immutable props. Implicitly purges caches.
 * Returns : JSON status indicating update / creation for keys or JSON error message when trying to edit immutable props. Implicitly purges caches.

The read-only content API can be made publicly available as: GET /wiki/Main_Page?rev/latest/html -- returns latest html, purged on new revision / re-render

Cassandra backend
Distributed storage with support for indexes, CAS and clustering / transparent compression. Avoids hot spots for IO (problem in ExternalStore sharding scheme).

Idea: Use this for revision storage, with a simple node storage service front-end. Easier to implement than trying to build a frontend for ExternalStore, provides testing for possible wider use.


 * helenus Nodejs bindings
 * escaping issues (got a local patch)

History compression
It used to be more efficient when pages on Wikipedia were still smaller than the (typically 64k) compression algorithm window size: meta:History_compression.

-rw-r--r-- 1 gabriel gabriel 143K Sep 23 14:00 /tmp/Atheism.txt -rw-r--r-- 1 gabriel gabriel 14M Sep 23 14:01 /tmp/Atheism-100.txt -rw-r--r-- 1 gabriel gabriel 7.8M Sep 23 14:29 /tmp/Atheism-100.txt.lz4 -rw-r--r-- 1 gabriel gabriel 5.0M Sep 23 14:02 /tmp/Atheism-100.txt.gzip9 -rw-r--r-- 1 gabriel gabriel 1.3M Sep 23 14:01 /tmp/Atheism-100.txt.bz2 -rw-r--r-- 1 gabriel gabriel 49K Sep 23 14:05 /tmp/Atheism-100.txt.lzma
 * 1) -100 is 100 concatenations of the single file.
 * 2) First a page larger than the typical 64k compression window.
 * 3) Only lzma fully picks up the repetition with its large window.

-rw-r--r-- 1 gabriel gabriel 7.0K Sep 23 14:16 /tmp/Storage.html -rw-r--r-- 1 gabriel gabriel 699K Sep 23 14:16 /tmp/Storage-100.html -rw-r--r-- 1 gabriel gabriel 6.8K Sep 23 14:17 /tmp/Storage-100.html.gz -rw-r--r-- 1 gabriel gabriel 5.7K Sep 23 14:29 /tmp/Storage-100.html.lz4 -rw-r--r-- 1 gabriel gabriel 4.9K Sep 23 14:16 /tmp/Storage-100.html.bz2 -rw-r--r-- 1 gabriel gabriel 2.2K Sep 23 14:18 /tmp/Storage-100.html.lzma
 * 1) Now a small (more typical) 7k page, this time as HTML.
 * 2) Compression works well using all algorithms.
 * 3) LZ4 (fast and default in Cassandra) outperforms gzip -9.
 * Size stats enwiki: 99.9% of all articles are < 64k

Cassandra compression
Cassandra stores compound primary key data as so-called wide rows. With the right schema this means that the text column of a given article will be compressed and stored sequentially. I benchmarked different compression algorithms and compression input block sizes. Of the compression algorithms available in Cassandra, Deflate (gzip) does best as it recognizes repetitions within its 32k sliding window, which means that many copies of the same article compress really well as long as it is smaller than about 32k. LZ4 and Snappy both process fixed 64k blocks at a time, so don't find many repetitions for typical (19k in this sample) article sizes. 20 * 5k articles any size (93M*20), Deflate, 256k block: 488MB (26%) 20 * 5k articles < 30k (39M*20), Deflate, 256k block: 48MB (6.2%) 20 * 10k articles < 10k (23M*20), Deflate, 256k block: 26MB (5.6%)

Adding LZMA compression would result in much better results for large articles and HTML at the price of (~5x) slower compression. For highly compressible content like article revisions, decompression will likely be about as fast or even faster than deflate as the compressed input is much smaller. Slower compression might also be fine as our write rates are low and the CPU load would be evenly spread across all Cassandra boxes. With deflate on my puny dual-core laptop insertion rates are routinely in the 500 rev/second range, mostly CPU-bound in the node client. We currently have less than 50 edits / re-renders per second in the cluster.

Tests were done by inserting the first 5k/10k articles from an enwiki dump into a cassandra table with this layout: CREATE TABLE revisions (      name text,       prop text,       id timeuuid,       value blob,       PRIMARY KEY (name, prop, id) ) WITH compression = { 'sstable_compression' : 'DeflateCompressor', 'chunk_length_kb' : 256 };

The Cassandra db file size was then summed up using.

Alternatives
The stateless storage API frontend will abstract the backend, so we are free to swap that out. Some backend contenders to consider:


 * HBase: DB layer on top of HDFS. Strongly consistent, less available. HDFS is hierarchical (naming nodes) rather than a flat DHT. Possibly the strongest alternative to Cassandra.
 * Swift: A bit hacky conceptually. Lacks clustering / compression features. Had some issues in the thumb storage application.
 * Riak: Similar to Cassandra. Does not offer clustering and compression. Reportedly less mature and slower. Smaller community. No cross-datacenter replication in open source edition.

Related REST storage interfaces

 * Amazon S3
 * Swift
 * couchDB - underscore prefix for private resources