Requests for comment/Storage service

Problem statement
In the short term, the Parsoid team needs a way to store revisioned HTML, wikitext and metadata efficiently and reliably. This information needs to be accessible through a web API for efficient retrieval of revision data. The generality of a web interface can make this storage solution available to both node.js code like Parsoid and PHP code in MediaWiki. A part of this storage interface can also be used as a public content API, which is discussed in a separate RFC.

As we argue in the service RFC it is also desirable to have more generalized storage interfaces available for MediaWiki in general. The revision storage service needed for Parsoid can serve as a first example of how such a more general storage interface might look.

Goals

 * Storage backend abstraction -- let storage specialists optimize storage, free others from having to deal with it.
 * Share storage implementations -- reuse the implementation of different storage solutions across applications
 * Extensibility -- provide extension points for content handlers like Parsoid, on-demand metadata generation etc.
 * Scalability -- easily add more boxes to handle growing load
 * Reliability -- no Single Point of Failure (SPOF), cross-datacenter replication
 * API versioning -- enable evolution of APIs without breaking users unnecessarily
 * Consistency -- use essentially the same URL scheme externally and internally. Return the same content internally and externally, and make links in content work in both contexts without rewriting.

Revision storage service as a first step

 * Create REST revision storage web service and use it (initially) for Parsoid HTML/metadata storage
 * Each revision has multiple timestamped parts
 * html and JSON page properties ('meta'), one timestamped version per re-render. Lets us retrieve a page as it looked like at time X.
 * single wikitext per revision
 * JSON parsoid round-trip info
 * arbitrary metadata added by extensions (blame maps, annotations etc) (See the Element ID page.)
 * Expose substantially the same interface as a public content API

Deterministic URIs for caching
URIs should be deterministic so we can cache requests. REST-style paths are generally solving this problem well. Wiki page names can contain slashes though, which are used in an unencoded form externally. Page-related sub-resources can still be defined with a query string. The easiest method to make those deterministic is to use only a single key-value pair, so that ordering of query parameters cannot introduce non-determinism.

Options for sub-resource encoding considered are:
 * : Sounds odd, as the key does not really match the sub-resource on the right.
 * : Would require query string key order normalization (alphabetic ordering) in caches as many client libraries don't let users control the order of parameters. Requests with missing mandatory parameters or invalid combinations are rejected. Unclear how listings would be modeled in a pure key-value model. With paths those naturally fall out of incomplete paths and the trailing slash. Harder to discover valid parameter combinations; with a path any path prefix is valid.
 * : Looks more path-y, but is longer and more noisy.
 * : Short and does not induce strange meaning like key=value. The path is a bit more broken up than the second option, but looks natural and less noisy for people used to query strings. Current favorite.

Relative links in content
We would like to use relative links in stored content wherever possible. Page names containing slashes complicate this a bit, as normal browser behavior is to interpret relative links relative to the page name.

The current solution used by Parsoid is to prefix relative links in a page called  with ../../. Sadly this does not work so well when content fragments from several pages are combined in one output page, for example in Flow timelines. All links in the content would need to be rewritten so that they work with a different page name. Similar issues occur when pages are renamed.

A promising alternative is to make all links relative to the wiki root, and make this work even for pages containing slashes by setting  in the skin. This also avoids issues with alternate path-less entry points like. Setting base href is much cheaper than rewriting all hrefs in content, and allows the combination of content fragments even where that is not easily possible (ESI).

Rashomon API
GET /v1/enwiki/page/Main_Page?rev/latest -- redirect to /enwiki/latest/html, cached GET /v1/enwiki/page/Main_Page?rev/latest/html -- returns latest html, cached & purged GET /v1/enwiki/page/Main_Page?rev/latest/ -- list properties of latest revision, cached & purged GET /v1/enwiki/page/Main_Page?rev/ -- list revisions, latest first GET /v1/enwiki/page/Main_Page?rev/12345/ -- list properties with timeuuids of specific revision GET /v1/enwiki/page/Main_Page?rev/12345/html -- redirects to latest html timeuuid URL GET /v1/enwiki/page/Main_Page?rev/2013-02-23T22:23:24Z/html find revision as it was at time X, not cacheable, redirects to GET /v1/enwiki/page/Main_Page?rev/8f545ba0-2601-11e3-885c-4160918f0fb9/html stable revision snapshot identified by Type 1 UUID. Immutable apart from HTML spec updates. Assumptions:
 * A separate table is used to record a mapping from key name to content-type, update policy etc. If a key is not defined there, conservative defaults like application/octet-stream will be used.

See this background on UUIDs. Type 1 UUIDs ('timeuuid') encode a timestamp plus some random bits. Cassandra stores them in timestamp order, which lets us query these UUIDs by time. Other backends have similar time-based UUID representations.

Editing:
 * Atomically create a new revision with several properties.
 * Required post vars with example values:
 * _parent=Main_Page?rev/12344 : The parent revision. Returned as x-parent header with regular GET requests, and part of JSON returned at /enwiki/page/Main_Page?rev/ history info.
 * _rev=12345 : The new revision. Returned as  header with regular GET requests. Also part of history info.
 * Optional post vars:
 * _timestamp= 2013-09-25T09:43:09Z : Timestamp to use in timeuuid generation. Needed to import old revisions. Should require special rights. Normal updates should use the current time.
 * Typical property post vars:
 * html, wikitext, parsoid : The html, wikitext or parsoid information of this revision. All entries that are passed in are stored atomically.
 * meta : The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.


 * Insert new (versions of) properties for the given timeuuid base revision. A new timeuuid will be generated.
 * Typical property post vars:
 * html, wikitext : The html and wikitext of this revision
 * meta : The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
 * Returns : JSON status with new timeuuid on success or JSON error message otherwise. Implicitly purges caches.
 * Returns : JSON status with new timeuuid on success or JSON error message otherwise. Implicitly purges caches.


 * Insert new (versions of) properties for the given revid base revision. Alternative form for the timeuuid-based update above. A new timeuuid will be generated.
 * Typical property post vars:
 * html, wikitext : The html and wikitext of this revision
 * meta : The page metadata, JSON-encoded. Language links, categories etc. Divided into static (in page content) and dynamic parts (template-generated, can change on re-expansion).
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.
 * Returns : JSON status with new timeuuid on success, JSON error message otherwise. Implicitly purges caches.


 * Destructively update a versioned property. This property needs to exist. Example use case: update HTML to the latest DOM spec. Requires elevated rights.
 * Returns : JSON status on success, JSON error message otherwise. Implicitly purges caches.
 * Returns : JSON status on success, JSON error message otherwise. Implicitly purges caches.

Front-end: Rashomon
We implemented a simple Node.js-based HTTP service called Rashomon. This stateless server runs on each storage node and load balances backend requests across storage backend servers. The node server processes use little CPU and can sustain thousands of requests per second. Clients can connect to any rashomon server they know about, which avoids making these servers a single point of failure or a bottleneck.

The current implementation is fairly basic and does not yet provide desirable features like authentication. It does however provide the basic revision storage functionality and lets us start storing HTML and metadata soon.

First supported backend: Cassandra
In the MediaWiki setup at the Wikimedia foundation, the wikitext of revisions is stored in ExternalStore, a blob store based on MySQL. As a pure key-value store ExternalStore relies on external data structures to capture revision information, for example in the MySQL revision table. This complicates storage management tasks like the grouped compression of consecutive revisions. The use of MySQL makes it relatively difficult to make both indexing and ExternalStore highly available without a single point of failure. Reads and writes of current revisions are not evenly spread across machines in the cluster, which is not ideal for performance.

After considering Riak and HBase, we investigated and tested Cassandra as an alternative backend storage solution with good results. Features of Cassandra include:


 * Symmetric DHT architecture based on Dynamo paper without single point of failure
 * Local storage based on journaling with log structured merge trees similar to LevelDB or BigTable with compression support. An import of an enwiki wikitext dump compresses to approximately 16% of the input text size on disk, including all index structures.
 * Scalable by adding more boxes, automatically distributes load and uses all machines for reads/writes
 * Replication support with consistency configurable per query; rack and datacenter awareness

Performance in write tests using three misc servers and spinning disks was around 900 revisions per second, which is well beyond production requirements. Stability of the new Cassandra 2.0 branch is on track for production use in January. Overall this let us choose Cassandra as the first storage backend we support. The storage service interface makes it straightforward to add or switch to different backends in the future without clients having to know about it.

Generalizing the storage service with other bucket types
In addition to a revision storage bucket as implemented by rashomon, other types with different characteristics and features can be added. One of those types is a simple key-value store without versioning, very similar to what is proposed in the DataStore RFC. Other specialized bucket types could include a key-value store with support for range queries, counters, queues or time series.

Create a simple blob bucket: PUT /v1/enwiki/math-png Content-type: application/vnd.org.mediawiki.bucket.1.0+json

{'type': 'blob'}

Get Bucket properties GET /v1/enwiki/math-png

Content-type: application/vnd.org.mediawiki.bucket.1.0+json

{'type': 'blob'}

Add an entry to a bucket: PUT /v1/enwiki/math-png/96d719730559f4399cf1ddc2ba973bbd.png Content-type: image/png

Fetch the image back: GET /v1/enwiki/math-png/96d719730559f4399cf1ddc2ba973bbd.png

List bucket contents: GET /v1/enwiki/math-png/ -- returns a JSON list of 50 or so entries in random order, plus a paging URL => Content-type: application/vnd.org.mediawiki.bucketlisting.1.0+json

{ .. }

Similarly, other bucket types can be created. Example for a bucket that supports efficient range / inexact match queries on byte string keys and a counter: // Create an ordered blob bucket PUT /v1/enwiki/timeseries Content-Type: application/vnd.org.mediawiki.bucket.1.0+json

{ 'type': 'ordered-blob' }

// Add an entry PUT /v1/enwiki/timeseries/2012-03-12T22:30:23.56Z-something

// get a list of entries matching an inequality GET /v1/enwiki/timeseries/?lt=2012-04&limit=1

// range query GET /v1/enwiki/timeseries/?gt=2012-02&lt=2012-04&limit=50

Another example, this time using a counter bucket: // Create an ordered blob bucket PUT /v1/enwiki/timeseries Content-Type: application/vnd.org.mediawiki.bucket.1.0+json

{ 'type': 'counter' }

// Read the current count GET /v1/enwiki/views/pages

// Increment the counter, optionally with an increment parameter POST /v1/enwiki/views/pages

Notes:
 * Access rights and content-types can be configured by bucket. Entries in public buckets are directly accessible to users able to read regular page content through the public web api:
 * Paging through all keys in a bucket is possible with most backends, but is not terribly efficient.
 * The ordered-blob type can be implemented with secondary indexes or backend-maintained extra index tables.