User:GWicke/Notes/Storage

See bug 48483.

A lot of our quarterly goals depend on storage of metadata. Additionally, mobile and others threaten to actually use our HTML at a larger scale which makes it important to avoid cold caches.

Goals

 * Storage backend abstraction -- let storage specialists optimize storage, free others from having to deal with it.
 * External content API -- provide an efficient API to retrieve content from the mobile app, ESI, bots etc.
 * Caching -- important for performance and scaleability of a content API. No random query parameter URLs that cannot be purged.
 * Consistency -- use the same URL schema externally and internally. Return the same content internally and externally (relative links).
 * Extensibility -- provide extension points for additional content-like entry points like listings.
 * Simple rewriting and handler registration -- use URL patterns that can be used for handler selection both in PHP and for directly proxying an entry point to a backend service in Varnish.

Idea

 * Create REST revision storage web service and use it (initially) for Parsoid HTML/metadata storage
 * Each revision has multiple timestamped parts
 * html and JSON page properties ('meta'), one timestamped version per re-render. Lets us retrieve a page as it looked like at time X.
 * single wikitext per revision
 * JSON parsoid round-trip info
 * arbitrary metadata added by extensions (blame maps, annotations etc)

Strawman backend API
GET /enwiki/page/Main_Page?rev/latest -- redirect to /enwiki/latest/html, cached GET /enwiki/page/Main_Page?rev/latest/html -- returns latest html, cached & purged GET /enwiki/page/Main_Page?rev/latest/ -- list properties of latest revision, cached & purged GET /enwiki/page/Main_Page?rev/ -- list revisions, latest first GET /enwiki/page/Main_Page?rev/12345/ -- list properties with timeuuids of specific revision GET /enwiki/page/Main_Page?rev/12345/html -- redirects to latest html timeuuid URL GET /enwiki/page/Main_Page?rev/2013-02-23T22:23:24Z/html find revision as it was at time X, not cacheable, redirects to GET /enwiki/page/Main_Page?rev/8f545ba0-2601-11e3-885c-4160918f0fb9/html stable revision snapshot identified by Type 1 UUID. Immutable apart from HTML spec updates. Assumptions:
 * A separate table is used to record a mapping from key name to content-type, update policy etc. If a key is not defined there, conservative defaults like application/octet-stream will be used.

Editing:
 * Atomically create a new revision with several properties.
 * Required post vars with example values:
 * _parent=Main_Page?rev/12344 : The parent revision. Returned as x-parent header with regular GET requests, and part of JSON returned at /enwiki/page/Main_Page?rev/ history info.
 * _rev=12345 : The new revision. Returned as  header with regular GET requests. Also part of history info.
 * Optional post vars:
 * _timestamp= 2013-09-25T09:43:09Z : Timestamp to use in timeuuid generation. Needed to import old revisions. Should require special rights. Normal updates should use the current time.
 * Typical property post vars:
 * html, wikitext : The html and wikitext of this revision
 * meta : The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.


 * Update an existing revision with new properties. Example: re-expanded transclusions or images in HTML. A new timeuuid will be generated.
 * Typical property post vars:
 * html, wikitext : The html and wikitext of this revision
 * meta : The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.


 * Update or insert properties with a given timeuuid. This timeuuid has to exist already. An error is returned otherwise.
 * Typical property post vars:
 * html, wikitext : The html and wikitext of this revision
 * meta : The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
 * Returns : JSON status indicating update / creation for keys or JSON error message when trying to edit immutable props. Implicitly purges caches.
 * Returns : JSON status indicating update / creation for keys or JSON error message when trying to edit immutable props. Implicitly purges caches.

Strawman front-end API
Following the goal of using the same URL schema internally and externally, the content API can be made publicly available as: GET /wiki/Main_Page?rev/latest/html -- returns latest html, purged on new revision / re-render

Relative links in JSON listings and HTML will work independent of the prefix, and existing page URLs are used as the base for subresources.

Storage backend
We are looking for a solution that
 * is scaleable and reliable: replicated, good load balancing
 * compresses consecutive revisions of the same data. Storing HTML after each re-render would be very expensive otherwise.
 * can answer simple queries like 'page X at time Y' efficiently (clustering, limited range queries)
 * is easy to work with and bind to
 * causes little work for ops

Cassandra
Distributed storage with support for indexes, CAS and clustering / transparent compression. Avoids hot spots for IO (problem in ExternalStore sharding scheme).

Idea: Use this for revision storage, with a simple node storage service front-end. Easier to implement than trying to build a frontend for ExternalStore, provides testing for possible wider use.


 * helenus Nodejs bindings
 * escaping issues (got a local patch)

Current schema: CREATE KEYSPACE "testdb"  WITH REPLICATION = {'class' : 'SimpleStrategy',  'replication_factor' : 3 }; use testdb; CREATE TABLE revisions (      name text,       prop text,       tid timeuuid,       revid text,    parentid text,    vversion int,    value blob,       PRIMARY KEY (name, prop, tid) ) WITH compression = { 'sstable_compression' : 'DeflateCompressor', 'chunk_length_kb': 256 } and CLUSTERING ORDER BY (prop asc, tid DESC);

// indexes create index on revisions(tid); create index on revisions(revid); create index on revisions(vversion);

History compression
It used to be more efficient when pages on Wikipedia were still smaller than the (typically 64k) compression algorithm window size: meta:History_compression.

-rw-r--r-- 1 gabriel gabriel 143K Sep 23 14:00 /tmp/Atheism.txt -rw-r--r-- 1 gabriel gabriel 14M Sep 23 14:01 /tmp/Atheism-100.txt -rw-r--r-- 1 gabriel gabriel 7.8M Sep 23 14:29 /tmp/Atheism-100.txt.lz4 -rw-r--r-- 1 gabriel gabriel 5.0M Sep 23 14:02 /tmp/Atheism-100.txt.gzip9 -rw-r--r-- 1 gabriel gabriel 1.3M Sep 23 14:01 /tmp/Atheism-100.txt.bz2 -rw-r--r-- 1 gabriel gabriel 49K Sep 23 14:05 /tmp/Atheism-100.txt.lzma
 * 1) -100 is 100 concatenations of the single file.
 * 2) First a page larger than the typical 64k compression window.
 * 3) Only lzma fully picks up the repetition with its large window.

-rw-r--r-- 1 gabriel gabriel 7.0K Sep 23 14:16 /tmp/Storage.html -rw-r--r-- 1 gabriel gabriel 699K Sep 23 14:16 /tmp/Storage-100.html -rw-r--r-- 1 gabriel gabriel 6.8K Sep 23 14:17 /tmp/Storage-100.html.gz -rw-r--r-- 1 gabriel gabriel 5.7K Sep 23 14:29 /tmp/Storage-100.html.lz4 -rw-r--r-- 1 gabriel gabriel 4.9K Sep 23 14:16 /tmp/Storage-100.html.bz2 -rw-r--r-- 1 gabriel gabriel 2.2K Sep 23 14:18 /tmp/Storage-100.html.lzma
 * 1) Now a small (more typical) 7k page, this time as HTML.
 * 2) Compression works well using all algorithms.
 * 3) LZ4 (fast and default in Cassandra) outperforms gzip -9.
 * Size stats enwiki: 99.9% of all articles are < 64k

Cassandra compression
Cassandra stores compound primary key data as so-called wide rows. With the right schema this means that the text column of a given article will be compressed and stored sequentially. I benchmarked different compression algorithms and compression input block sizes. Of the compression algorithms available in Cassandra, Deflate (gzip) does best as it recognizes repetitions within its 32k sliding window, which means that many copies of the same article compress really well as long as it is smaller than about 32k. LZ4 and Snappy both process fixed 64k blocks at a time, so don't find many repetitions for typical (19k in this sample) article sizes. 20 * 5k articles any size (93M*20), Deflate, 256k block: 488MB (26%) 20 * 5k articles < 30k (39M*20), Deflate, 256k block: 48MB (6.2%) 20 * 10k articles < 10k (23M*20), Deflate, 256k block: 26MB (5.6%)

Adding LZMA compression would result in much better results for large articles and HTML at the price of (~5x ) slower compression. For highly compressible content like article revisions, decompression will likely be about as fast or even faster than deflate as the compressed input is much smaller. Slower compression might also be fine as our write rates are low and the CPU load would be evenly spread across all Cassandra boxes. With deflate on my puny dual-core laptop insertion rates are routinely in the 500 rev/second range, mostly CPU-bound in the node client. We currently have less than 50 edits / re-renders per second in the cluster.

Tests were done by inserting the first 5k/10k articles from an enwiki dump into a cassandra table with this layout: CREATE TABLE revisions (      name text,       prop text,       id timeuuid,       value blob,       PRIMARY KEY (name, prop, id) ) WITH compression = { 'sstable_compression' : 'DeflateCompressor', 'chunk_length_kb' : 256 };

The Cassandra db file size was then summed up using.

Alternatives
The stateless storage API frontend will abstract the backend, so we are free to swap that out. Some backend contenders to consider:


 * HBase: DB layer on top of HDFS. Strongly consistent, less available. No per-request consistency selection. The underlying HDFS is hierarchical (naming nodes) rather than a flat DHT. Possibly the strongest alternative to Cassandra.
 * Swift: A bit hacky conceptually. Lacks clustering / compression features. Had some issues in the thumb storage application.
 * Riak: Similar to Cassandra. Does not offer clustering and compression. Reportedly less mature and slower. Smaller community. No cross-datacenter replication in open source edition.

Related REST storage interfaces

 * Amazon S3
 * Swift
 * couchDB - underscore prefix for private resources