Requests for comment/Storage service

Problem statement
In the short term, the Parsoid team needs a way to store revisioned HTML, wikitext and metadata efficiently and reliably. This information needs to be accessible through a (partly public) web API for efficient retrieval of revision data. The generality of a web interface is desirable to make this storage solution available to both node.js code like Parsoid and PHP code in MediaWiki.

As we argue in the service RFC it also seems be desirable to have more generalized storage interfaces available for MediaWiki in general. The revision storage interface needed for Parsoid can serve as a first example of how such a more general storage interface might look.

Goals

 * Storage backend abstraction -- let storage specialists optimize storage, free others from having to deal with it.
 * External content API -- provide an efficient API to retrieve content from the mobile app, ESI, bots etc.
 * API versioning -- enable evolution of APIs without breaking users immediately
 * Caching -- important for performance and scalability of a content API. No random query parameter URLs that cannot be purged.
 * Consistency -- use the same URL schema externally and internally. Return the same content internally and externally, and make links in content work in both contexts without rewriting.
 * Extensibility -- provide extension points for content handlers like Parsoid, on-demand metadata generation etc.
 * Support rewriting -- use URL patterns that support URL-based rewriting in something like Varnish.

Storage service interface

 * Create REST revision storage web service and use it (initially) for Parsoid HTML/metadata storage
 * Each revision has multiple timestamped parts
 * html and JSON page properties ('meta'), one timestamped version per re-render. Lets us retrieve a page as it looked like at time X.
 * single wikitext per revision
 * JSON parsoid round-trip info
 * arbitrary metadata added by extensions (blame maps, annotations etc)

Resource / URL layout considerations

 * Our page names have established URLs . The content part of our API defines sub-resources, which should be reflected in the URL layout.
 * Relative links should work. Links are relative to the perceived path and page names can contain slashes, so relative links in a page called  will be prefixed with ../../ . Relative links should also work both internally and externally (with differing path prefixes), and both in a sub-resource and the page itself. This basically means that we need to use a query string.
 * Query strings should be deterministic so we can purge URLs. This means that there should be exactly one query parameter. Options considered are:
 * : Sounds odd, as the key does not really match the sub-resource on the right. Only the 'latest' part is really a revision, html is the part of it.  would avoid that, but is much longer.
 * : Looks more path-y, but is longer and more noisy.
 * : Short and does not induce strange meaning like key=value. The path is a bit more broken up than the second option, but looks natural and less noisy for people used to query strings.

Strawman Rashomon API
GET /v1/enwiki/page/Main_Page?rev/latest -- redirect to /enwiki/latest/html, cached GET /v1/enwiki/page/Main_Page?rev/latest/html -- returns latest html, cached & purged GET /v1/enwiki/page/Main_Page?rev/latest/ -- list properties of latest revision, cached & purged GET /v1/enwiki/page/Main_Page?rev/ -- list revisions, latest first GET /v1/enwiki/page/Main_Page?rev/12345/ -- list properties with timeuuids of specific revision GET /v1/enwiki/page/Main_Page?rev/12345/html -- redirects to latest html timeuuid URL GET /v1/enwiki/page/Main_Page?rev/2013-02-23T22:23:24Z/html find revision as it was at time X, not cacheable, redirects to GET /v1/enwiki/page/Main_Page?rev/8f545ba0-2601-11e3-885c-4160918f0fb9/html stable revision snapshot identified by Type 1 UUID. Immutable apart from HTML spec updates. Assumptions:
 * A separate table is used to record a mapping from key name to content-type, update policy etc. If a key is not defined there, conservative defaults like application/octet-stream will be used.

See this background on UUIDs. Type 1 UUIDs ('timeuuid') encode a timestamp plus some random bits. Cassandra stores them in timestamp order, which lets us query these UUIDs by time. Other backends have similar time-based UUID representations.

Editing:
 * Atomically create a new revision with several properties.
 * Required post vars with example values:
 * _parent=Main_Page?rev/12344 : The parent revision. Returned as x-parent header with regular GET requests, and part of JSON returned at /enwiki/page/Main_Page?rev/ history info.
 * _rev=12345 : The new revision. Returned as  header with regular GET requests. Also part of history info.
 * Optional post vars:
 * _timestamp= 2013-09-25T09:43:09Z : Timestamp to use in timeuuid generation. Needed to import old revisions. Should require special rights. Normal updates should use the current time.
 * Typical property post vars:
 * html, wikitext, parsoid : The html, wikitext or parsoid information of this revision. All entries that are passed in are stored atomically.
 * meta : The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.


 * Insert new (versions of) properties for the given timeuuid base revision. A new timeuuid will be generated.
 * Typical property post vars:
 * html, wikitext : The html and wikitext of this revision
 * meta : The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
 * Returns : JSON status with new timeuuid on success or JSON error message otherwise. Implicitly purges caches.
 * Returns : JSON status with new timeuuid on success or JSON error message otherwise. Implicitly purges caches.


 * Insert new (versions of) properties for the given revid base revision. Alternative form for the timeuuid-based update above. A new timeuuid will be generated.
 * Typical property post vars:
 * html, wikitext : The html and wikitext of this revision
 * meta : The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.


 * Destructively update a versioned property. This property needs to exist. Example use case: update HTML to the latest DOM spec. Requires elevated rights.
 * Returns : JSON status on success, JSON error message otherwise. Implicitly purges caches.
 * Returns : JSON status on success, JSON error message otherwise. Implicitly purges caches.

Strawman front-end API tunneled to Rashomon
Following the goal of using the same URL schema internally and externally, the content API can be made publicly available as: GET /wiki/Main_Page?rev/latest/html -- returns latest html, purged on new revision / re-render

Relative links in JSON listings and HTML will work independent of the prefix, and existing page URLs are used as the base for subresources.

Key/value store without versioning
This addresses a similar use case as the DataStore RFC. It is not terribly important for the immediate Parsoid needs, but can easily be added in the storage front-end. Each bucket has a name so that each extension can have its own namespace.

Create a simple blob bucket: PUT /v1/enwiki/math-png Content-type: application/vnd.org.mediawiki.bucket.1.0+json

{'type': 'blob'}

Get Bucket properties GET /v1/enwiki/math-png

Content-type: application/vnd.org.mediawiki.bucket.1.0+json

{'type': 'blob'}

Add an entry to a bucket: PUT /v1/enwiki/math-png/96d719730559f4399cf1ddc2ba973bbd.png Content-type: image/png

Fetch the image back: GET /v1/enwiki/math-png/96d719730559f4399cf1ddc2ba973bbd.png

List bucket contents: GET /v1/enwiki/math-png/ -- returns a JSON list of 50 or so entries in random order, plus a paging URL => Content-type: application/vnd.org.mediawiki.bucketlisting.1.0+json

{ .. }

Similarly, other bucket types can be created. Example for a bucket that supports efficient range / inexact match queries on byte string keys and a counter: // Create an ordered blob bucket PUT /v1/enwiki/timeseries Content-Type: application/vnd.org.mediawiki.bucket.1.0+json

{ 'type': 'ordered-blob' }

// Add an entry PUT /v1/enwiki/timeseries/2012-03-12T22:30:23.56Z-something

// get a list of entries matching an inequality GET /v1/enwiki/timeseries/?lt=2012-04&limit=1

// range query GET /v1/enwiki/timeseries/?gt=2012-02&lt=2012-04&limit=50

Another example, this time using a counter bucket: // Create an ordered blob bucket PUT /v1/enwiki/timeseries Content-Type: application/vnd.org.mediawiki.bucket.1.0+json

{ 'type': 'counter' }

// Read the current count GET /v1/enwiki/views/pages

// Increment the counter, optionally with an increment parameter POST /v1/enwiki/views/pages

Notes:
 * Access rights and content-types can be configured by bucket. Entries in public buckets are directly accessible to users able to read regular page content through the public web api:
 * Paging through all keys in a bucket is possible with most backends, but is not terribly efficient.
 * The ordered-blob type can be implemented with secondary indexes or backend-maintained extra index tables.

Front-end: Rashomon
We implemented a simple Node.js-based HTTP service called Rashomon. This stateless server runs on each storage node and load balances backend requests across storage backend servers. The node server processes use little CPU and can sustain thousands of requests per second. Clients can connect to any rashomon server they know about, which avoids making these servers a single point of failure or a bottleneck.

The current implementation is fairly basic and does not yet provide desirable features like authentication. It does however provide the basic revision storage functionality and lets us start storing HTML and metadata soon.

First supported backend: Cassandra
In the MediaWiki setup at the Wikimedia foundation, the wikitext of revisions is stored in ExternalStore, a blob store based on MySQL. As a pure key-value store ExternalStore relies on external data structures to capture revision information, for example in the MySQL revision table. This complicates storage management tasks like the grouped compression of consecutive revisions. The use of MySQL makes it relatively difficult to make both indexing and ExternalStore highly available without a single point of failure. Reads and writes of current revisions are not evenly spread across machines in the cluster, which is not ideal for performance.

After considering Riak and HBase, we investigated and tested Cassandra as an alternative backend storage solution with good results. Features of Cassandra include:


 * Symmetric DHT architecture based on Dynamo paper without single point of failure
 * Local storage based on journaling with log structured merge trees similar to LevelDB or BigTable with compression support. An import of an enwiki wikitext dump compresses to approximately 16% of the input text size on disk, including all index structures.
 * Scalable by adding more boxes, automatically distributes load and uses all machines for reads/writes
 * Replication support with consistency configurable per query; rack and datacenter awareness

Performance in write tests using three misc servers and spinning disks was around 900 revisions per second, which is well beyond production requirements. Stability of the new Cassandra 2.0 branch is on track for production use in January. Overall this let us choose Cassandra as the first storage backend we support. The storage service interface makes it straightforward to add or switch to different backends in the future without clients having to know about it.