User:GWicke/Notes/Storage

See bug 48483 and the prototype implementation at https://github.com/gwicke/storoid.

A lot of our quarterly goals depend on storage of metadata. Additionally, mobile and others threaten to actually use our HTML at a larger scale which makes it important to avoid cold caches.

Goals

 * Storage backend abstraction -- let storage specialists optimize storage, free others from having to deal with it.
 * External content API -- provide an efficient API to retrieve content from the mobile app, ESI, bots etc.
 * Caching -- important for performance and scaleability of a content API. No random query parameter URLs that cannot be purged.
 * Consistency -- use the same URL schema externally and internally. Return the same content internally and externally. Consistently use relative links.
 * Extensibility -- provide extension points for additional content-like entry points like listings.
 * Simple rewriting and handler registration -- use URL patterns that can be used for handler selection both in PHP and for directly proxying an entry point to a backend service in Varnish.

Idea: Storage web service

 * Create REST revision storage web service and use it (initially) for Parsoid HTML/metadata storage
 * Each revision has multiple timestamped parts
 * html and JSON page properties ('meta'), one timestamped version per re-render. Lets us retrieve a page as it looked like at time X.
 * single wikitext per revision
 * JSON parsoid round-trip info
 * arbitrary metadata added by extensions (blame maps, annotations etc)

Resource / URL layout considerations

 * Our page names have established URLs . The content part of our API defines sub-resources, which should be reflected in the URL layout.
 * Relative links should work. Links are relative to the perceived path and page names can contain slashes, so relative links in a page called  will be prefixed with ../../ . Relative links should also work both internally and externally (with differing path prefixes), and both in a sub-resource and the page itself. This basically means that we need to use a query string.
 * Query strings should be deterministic so we can purge URLs. This means that there should be exactly one query parameter. Options considered are:
 * : Sounds odd, as the key does not really match the sub-resource on the right. Only the 'latest' part is really a revision, html is the part of it.  would avoid that, but is much longer.
 * : Looks more path-y, but is longer and more noisy.
 * : Short and does not induce strange meaning like key=value. The path is a bit more broken up than the second option, but looks natural and less noisy for people used to query strings.

Strawman backend API
GET /enwiki/page/Main_Page?rev/latest -- redirect to /enwiki/latest/html, cached GET /enwiki/page/Main_Page?rev/latest/html -- returns latest html, cached & purged GET /enwiki/page/Main_Page?rev/latest/ -- list properties of latest revision, cached & purged GET /enwiki/page/Main_Page?rev/ -- list revisions, latest first GET /enwiki/page/Main_Page?rev/12345/ -- list properties with timeuuids of specific revision GET /enwiki/page/Main_Page?rev/12345/html -- redirects to latest html timeuuid URL GET /enwiki/page/Main_Page?rev/2013-02-23T22:23:24Z/html find revision as it was at time X, not cacheable, redirects to GET /enwiki/page/Main_Page?rev/8f545ba0-2601-11e3-885c-4160918f0fb9/html stable revision snapshot identified by Type 1 UUID. Immutable apart from HTML spec updates. Assumptions:
 * A separate table is used to record a mapping from key name to content-type, update policy etc. If a key is not defined there, conservative defaults like application/octet-stream will be used.

See this background on UUIDs. Type 1 UUIDs ('timeuuid') encode a timestamp plus some random bits. Cassandra stores them in timestamp order, which lets us query these UUIDs by time. Other backends have similar time-based UUID representations.

Editing:
 * Atomically create a new revision with several properties.
 * Required post vars with example values:
 * _parent=Main_Page?rev/12344 : The parent revision. Returned as x-parent header with regular GET requests, and part of JSON returned at /enwiki/page/Main_Page?rev/ history info.
 * _rev=12345 : The new revision. Returned as  header with regular GET requests. Also part of history info.
 * Optional post vars:
 * _timestamp= 2013-09-25T09:43:09Z : Timestamp to use in timeuuid generation. Needed to import old revisions. Should require special rights. Normal updates should use the current time.
 * Typical property post vars:
 * html, wikitext : The html and wikitext of this revision
 * meta : The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.


 * Insert new (versions of) properties for the given timeuuid base revision. A new timeuuid will be generated.
 * Typical property post vars:
 * html, wikitext : The html and wikitext of this revision
 * meta : The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
 * Returns : JSON status with new timeuuid on success or JSON error message otherwise. Implicitly purges caches.
 * Returns : JSON status with new timeuuid on success or JSON error message otherwise. Implicitly purges caches.


 * Insert new (versions of) properties for the given revid base revision. Alternative form for the timeuuid-based update above. A new timeuuid will be generated.
 * Typical property post vars:
 * html, wikitext : The html and wikitext of this revision
 * meta : The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.


 * Destructively update a versioned property. This property needs to exist. Example use case: update HTML to the latest DOM spec. Requires elevated rights.
 * Returns : JSON status on success, JSON error message otherwise. Implicitly purges caches.
 * Returns : JSON status on success, JSON error message otherwise. Implicitly purges caches.

Strawman front-end API
Following the goal of using the same URL schema internally and externally, the content API can be made publicly available as: GET /wiki/Main_Page?rev/latest/html -- returns latest html, purged on new revision / re-render

Relative links in JSON listings and HTML will work independent of the prefix, and existing page URLs are used as the base for subresources.

Key/value store without versioning
This addresses a similar use case as the DataStore RFC. It is not terribly important for the immediate Parsoid needs, but can easily be added in the storage front-end.

// Use bucket prefixes so that each extension has its own namespace: /math-png and /views in this example. PUT /enwiki/blob/math-png/96d719730559f4399cf1ddc2ba973bbd.png GET /enwiki/blob/math-png/96d719730559f4399cf1ddc2ba973bbd.png GET /enwiki/blob/math-png/ -- returns a JSON list of 50 or so entries in random order, plus a paging URL

// ordered bucket type with efficient range / inexact match queries PUT /enwiki/ordered-blob/timeseries/2012-03-12T22:30:23.56Z-something // get a list of entries matching an inequality GET /enwiki/ordered-blob/timeseries/?lt=2012-04&limit=1 // range query GET /enwiki/ordered-blob/timeseries/?gt=2012-02&lt=2012-04&limit=50

// several backends handle counters specially GET /enwiki/counter/views/pages POST /enwiki/counter/views/pages // with post parameter increment=1

Notes:
 * Access rights and content-types can be configured by bucket. Entries in public buckets are directly accessible to users able to read regular page content through the public web api:
 * Paging through all keys in a bucket is possible with most backends, but is not terribly efficient.
 * The ordered-blob type can be implemented with secondary indexes or backend-maintained extra index tables.

A generic web service PHP API can use URLs directly:


 * The generic store has a registry of prefixes to storage backends. Example for :

Backends can be HTTP-based, but can also be transparently mapped to local PHP classes or other protocols. The path-based namespace presents a simple and unified API and avoids the need for many different classes. New backend features or backends can be added without a need to upgrade the client bindings. Bulk operations can be performed across several backends.

Post request encoding

 * default for forms
 * inefficient for binary data (30% base64 overhead)


 * supported in all browsers
 * efficient for binary data
 * no per-part headers apart from the value name


 * not natively supported for form submissions
 * efficient for binary
 * can use content headers per part, full mime

Likely best: Support all three, and use default values when content-type etc headers are absent.

Format selection / API versioning via Accept header
For now we are mainly interested in API versioning, as we only plan to support JSON for listings and other dynamic content. Other resources have a single content-type too. The issue with using the Accept header is caching. We'll have to vary on Accept, but would also like to avoid fragmentation for browser requests.

Possible solution:
 * strip / normalize Accept headers that don't contain a single type that matches vnd.org.mediawiki
 * these will receive the default type and latest API version
 * Forward all other Accept values to the backend

Storage backend
We are looking for a solution that
 * is scaleable and reliable: replicated, good load balancing
 * compresses consecutive revisions of the same data. Storing HTML after each re-render would be very expensive otherwise.
 * can answer simple queries like 'page X at time Y' efficiently (clustering, limited range queries)
 * is easy to work with and bind to
 * causes little work for ops

Cassandra
Distributed storage with support for indexes, CAS and clustering / transparent compression. Avoids hot spots for IO (problem in ExternalStore sharding scheme).

Idea: Use this for revision storage, with a simple node storage service front-end. Easier to implement than trying to build a frontend for ExternalStore, provides testing for possible wider use.


 * Prototype API implementation: https://github.com/gwicke/storoid
 * Current schema: cassandra-revisions.cql

History compression
It used to be more efficient when pages on Wikipedia were still smaller than the (typically 64k) compression algorithm window size: meta:History_compression.

-rw-r--r-- 1 gabriel gabriel 143K Sep 23 14:00 /tmp/Atheism.txt -rw-r--r-- 1 gabriel gabriel 14M Sep 23 14:01 /tmp/Atheism-100.txt -rw-r--r-- 1 gabriel gabriel 7.8M Sep 23 14:29 /tmp/Atheism-100.txt.lz4 -rw-r--r-- 1 gabriel gabriel 5.0M Sep 23 14:02 /tmp/Atheism-100.txt.gzip9 -rw-r--r-- 1 gabriel gabriel 1.3M Sep 23 14:01 /tmp/Atheism-100.txt.bz2 -rw-r--r-- 1 gabriel gabriel 49K Sep 23 14:05 /tmp/Atheism-100.txt.lzma
 * 1) -100 is 100 concatenations of the single file.
 * 2) First a page larger than the typical 64k compression window.
 * 3) Only lzma fully picks up the repetition with its large window.

-rw-r--r-- 1 gabriel gabriel 7.0K Sep 23 14:16 /tmp/Storage.html -rw-r--r-- 1 gabriel gabriel 699K Sep 23 14:16 /tmp/Storage-100.html -rw-r--r-- 1 gabriel gabriel 6.8K Sep 23 14:17 /tmp/Storage-100.html.gz -rw-r--r-- 1 gabriel gabriel 5.7K Sep 23 14:29 /tmp/Storage-100.html.lz4 -rw-r--r-- 1 gabriel gabriel 4.9K Sep 23 14:16 /tmp/Storage-100.html.bz2 -rw-r--r-- 1 gabriel gabriel 2.2K Sep 23 14:18 /tmp/Storage-100.html.lzma
 * 1) Now a small (more typical) 7k page, this time as HTML.
 * 2) Compression works well using all algorithms.
 * 3) LZ4 (fast and default in Cassandra) outperforms gzip -9.
 * Size stats enwiki: 99.9% of all articles are < 64k

Cassandra compression
Cassandra stores compound primary key data as so-called wide rows. With the right schema this means that the text column of a given article will be compressed and stored sequentially. I benchmarked different compression algorithms and compression input block sizes. Of the compression algorithms available in Cassandra, Deflate (gzip) does best as it recognizes repetitions within its 32k sliding window, which means that many copies of the same article compress really well as long as it is smaller than about 32k. LZ4 and Snappy both process fixed 64k blocks at a time, so don't find many repetitions for typical (19k in this sample) article sizes. 20 * 5k articles any size (93M*20), Deflate, 256k block: 488MB (26%) 20 * 5k articles < 30k (39M*20), Deflate, 256k block: 48MB (6.2%) 20 * 10k articles < 10k (23M*20), Deflate, 256k block: 26MB (5.6%)

Adding LZMA compression would result in much better results for large articles and HTML at the price of (~5x ) slower compression. For highly compressible content like article revisions, decompression will likely be about as fast or even faster than deflate as the compressed input is much smaller. Slower compression might also be fine as our write rates are low and the CPU load would be evenly spread across all Cassandra boxes. With deflate on my puny dual-core laptop insertion rates are routinely in the 500 rev/second range, mostly CPU-bound in the node client. We currently have less than 50 edits / re-renders per second in the cluster.

Tests were done by inserting the first 5k/10k articles from an enwiki dump into a cassandra table with this layout: CREATE TABLE revisions (      name text,       prop text,       id timeuuid,       value blob,       PRIMARY KEY (name, prop, id) ) WITH compression = { 'sstable_compression' : 'DeflateCompressor', 'chunk_length_kb' : 256 };

The Cassandra db file size was then summed up using.

See also User:GWicke/Notes/Cassandra.

Alternatives
The stateless storage API frontend will abstract the backend, so we are free to swap that out. Some backend contenders to consider:


 * HBase: DB layer on top of HDFS. Strongly consistent, less available. No per-request consistency selection. The underlying HDFS is hierarchical (naming nodes) rather than a flat DHT. Possibly the strongest alternative to Cassandra.
 * Swift: A bit hacky, not symmetrical. Lacks clustering / compression features. Had some issues in the thumb storage application.
 * Riak: Similar to Cassandra. Does not offer clustering and compression. Reportedly less mature and slower. Smaller community. No cross-datacenter replication in open source edition.

Related REST storage interfaces

 * Amazon S3
 * Swift
 * couchDB - underscore prefix for private resources
 * Sun cloud API
 * Google Apps
 * Google drive
 * Facebook APIs

Authentication

 * MediaWiki OAuth 1 implementation
 * OAuth2 / OpenID connect with JWT bearer tokens. Self-describing signed claims, cheap to verify per request without DB access.
 * Google authentication -- also supports SAML
 * PayPal
 * Most of the OAuth2 providers in en:OAuth -- includes Microsoft, Facebook etc