RESTBase/StorageDesign

Current (low latency access)

 * 1) Storage of current revisions (most up to date render of most current revision);
 * 2) Resilient in the face of non-linearized writes; Precedence defined by revision ID and render time-stamp, not write time
 * 3) Storage of past revisions for a TTL period (at least) after they have been superseded by something newer (aka recent)
 * 4) Storage of arbitrarily old revisions (on request), for a TTL period (at least), from the time of the request (aka historical)
 * 5) 50p read latencies of 5ms, 99p of <100ms

Archive

 * 1) Read latencies on the order of 10x that of current.

Recency
One of the requirements is for a window of recently superseded values; Values must be preserved for a predefined period after they have been replaced by something newer. This sliding window of recent history is needed to support application concurrency (see: MVCC Multiversion concurrency control ). An example use-case is Visual Editor: A user begins an edit by retrieving the HTML of the most recent revision of a document, Rn. While they are working on their edit, another user commits a change, making the most recent revision Rn+1. The first user then subsequently attempts to commit their change, requiring the Parsoid meta-data for Rn, despite it having now been superseded by Rn+1.

An open question remains regarding the latency requirements for recent data. For example: Should access by comparable to that of current? Is the aggregate of current, and a secondary lookup of comparable latency acceptable (2x current)? Is the aggregate of current and a secondary lookup of archive storage acceptable (current + (10x current))?

Retention policies using application-enforced TTLs
This approach uses a schema identical to that of the current storage model, one that utilizes wide rows to model a one-to-many relationship between a title and its revisions, and a one-to-many relationship between each revision and its corresponding renders. It differs only in how it approaches retention.

Retention
Culling of obsolete data is accomplished using range deletes. For example: In order to obtain the predicates used in these range deletes, revisions and renders must be indexed by a timestamp that represents when they were superseded, or replaced, by a newer revision and/or render respectively. Records in the index tables are a compound key of the domain and title. Updates can be performed probabilistically, if necessary. TTLs can be applied to prevent unbounded partition growth.

Index/Time-line storage
Only a single index for each (revisions and renders) is needed for all logical tables (e.g. parsoid html, data, and section offsets, mobileapps, etc), so to eliminate update duplication, these indices are separately maintained by change-propagation.

Properties
The distribution of edit frequencies across Wikimedia projects is quite extreme, ranging from approximately 1/day, to nearly 200K/day. Without sampling, the lowest edit frequencies are sufficient to manage retentions of not less than 24 hours efficiently. The highest frequencies (again, without sampling) could place an unnecessary burden on storage in exchange or a resolution that vastly exceeds what is needed. Sampling applied to all time-line updates to correct for high frequency edits would render indexing of domains with lower edit frequencies less efficient. Ideally, rate-limiting by domain can be employed to sample writes from the few high edit frequency projects without effecting those with lower edit frequencies.

Implementation
The examples above demonstrate storage of the time-line in Cassandra, but there is no requirement to do so. Redis for example, would likely prove adequate for this use case. For example, the contents of the index/time-line need not be perfectly durable, a catastrophic loss of all entries would merely delay the culling of past entries, and only for a period equal to that of the retention configured. The index can be replicated to remote data-centers, but this is not a requirement, it could for example be independently generated in each without impacting correctness.

Secondary Storage
This proposal in its native form does not address the 4th requirement (storage of arbitrarily older revisions for a TTL from the time of request/generation). Ultimately, it may be possible to address this requirement by falling back to a lookup of archival storage, but as that is a longer term goal, a near-term solution for this is needed.

Option: Use pre-existing tables
Existing RESTBase key-rev-value tables are utilized as secondary storage. Updates to these tables are disabled, and revision retention policies are not used. When a request against  results in a miss, these existing storage tables are consulted. On a secondary storage miss, the content is generated in-line and persisted where it will live in perpetuity (or until the completion of archival storage replaces this as an intermediate solution). Since  misses are presumed to be exceptional, the amount of accumulated data should be manageable.

Option: Dedicated table with default TTL
An additional table of similar schema is used, this table utilizes a default Cassandra TTL. When a request against  results in a miss, secondary storage is consulted. If a secondary storage request results in a miss, the content is generated in-line and persisted where it will live for a period of (at least)  seconds.

Writes (updates)
NOTE: Steps 2 through 4 can be performed asynchronously from step 1; Failure to perform the range delete does not affect correctness
 * 1) Append updated value to the   table
 * 2) If the update created a new revision, query the   table for a revision that was superseded TTL seconds or longer from the current time
 * 3) Otherwise, if the update created a new render for an existing revision, query the   table for a render that was superseded TTL seconds or longer from the current time
 * 4) Perform a range delete of either revisions or renders, using the information obtained in #2 or #3 above

Reads

 * 1) The   table is consulted
 * 2) The   table is consulted
 * 3) On a miss, the content is generated in-line and written to the   table.

Pros

 * Easy to implement (low code delta)
 * Least risk; Inherits correctness from current (well tested) implementation
 * Minimal read / write amplification
 * Low latency access to values in recency window (identical to that of current)

Cons

 * Creates a hard-dependency on Cassandra 3.x (needed to create range tombstones using inequality operators)
 * Requires the indexing of revisions and renders by the time they were superseded by newer values
 * Corner case: A fully qualified lookup against  is a hit, but is subsequently removed by policy in less than TTL from the time of the request.  In other words, a request that corresponds with requirement #4, but is incidentally recent enough to be found in   at the time of the request.
 * Unclear how this will generalize, and which changes to the table storage interface would be needed.

Ideas for generalizing the storage module
Currently we have 3 layers of abstraction: application level, bucket level (provides a set generic key-value, key-rev-value interfaces to put/get/list content) and table level (abstracts away the particular storage technology and provides generic API to describe table schemas, including retention policies, put/get and list operations). The proposal is to keep the 3-layer approach, but shuffle the responsibilities of each layer. Application layer responsibilities are unchanged.

The table layer would loose the retention policies and secondary indexing support, but instead it will expand the API to support deletes (including range deletes) and batching the statements together. Example is below. This is pseudocode just to illustrate the idea, please don't take that literally. The responsibilities of the bucket value would expand and now include creating proper tables and doing retention.

For Parsoid, there will be a special bucket,  It will expose parsed-specific endpoints to store and retrieve content while internally it will create all the necessary tables, maintain indexes and manage range deletes. The parsed bucket is, obviously, not useful in any other use-case, because of the specific requirements we have for the parsoid storage.

Along with it we can design other types of buckets, useful in general, or just replicate the functionality of the key_value and key_rev_value buckets with certain retention policies. For example, for mobile apps, we would could use the key_rev_value bucket with `latest_hash` policy, or even design a special `mobile_bucket` if we want to batch together the lead and remaining section update.

Two-table: Range delete-maintained latest, TTL-maintained history
This approach uses two tables with schema identical to that of the current storage model, utilizing wide rows to model a one-to-many relationship between a title and its revisions, and a one-to-many relationship between each revision and its corresponding renders. It differs though in how it approaches retention. The first of the two tables uses (probabilistic) deletes of all previous revisions and/or renders on update in order to maintain a view of current versions. The second table uses Cassandra TTLs to automatically expire records and stores recently superseded values, along with any historical values that had to be generated in-line.

Writes (updates)

 * 1) Read the latest render from the   table
 * 2) Write the value read above to the   table
 * 3) Write the updated render to the   table
 * 4) Write the updated render to the   table
 * 5) Apply any range deletes for previous renders of the revision, (and for previous revisions if the   policy is used)

Reads

 * 1) The   table is consulted
 * 2) On a miss, the   table is consulted
 * 3) On a hit, the TTL may be refreshed if there is insufficient time remaining
 * 4) On a miss, the content is generated in-line and written to both   and

Longer-term alternatives to the table
Unlike the application-enforced TTL proposal above, this proposal's use of a TTL table doubles as storage of both recently superseded data, and historical records that were (re)generated in-line (requirements #3 and #4 above respectively). Nevertheless, it has been suggested that, once in place, archival storage could be used in place of this table. This would have the advantage of reducing transactional complexity, eliminating the need for the extra read and writes on update. It has the disadvantage of higher read latency for all requests in the recent window. Additionally, it creates the requirement that archival storage contain the full history in a window of recency according to the semantics dictated by requirement #3.

Pros

 * Avoids the need to index revisions and renders by the time of their replacement

Cons

 * Transactional complexity (read → 3x write → range delete)
 * Read latency; Misses on  seem unexceptional, and make latency within the recency window the aggregate of two requests, instead of one
 * Aggregate of current and archive (current + (10 x current)) if archive is used as fall-back for recent history instead of a dedicated TTL table
 * Corner case: A fully qualified lookup against  is a hit despite the values copied to   having since expired. After the successful read, and a probabilistically applied range delete removes the record.  The likelihood of this happening can be reduced by increasing the range delete probability (at the expense of generating more tombstones, obviously).  The possibility of this occurring can not be entirely eliminated if range delete probability is < 1.0.
 * If archive is used as a fall-back for values in the recency window, this pushes an additional requirement on archive (namely that it maintain the window of recency)

Table-per-query
This approach materializes views of results using distinct tables, each corresponding to a query.

Queries

 * The most current render of the most current revision (table: )
 * The most current render of a specific revision (table: )
 * A specific render of a specific revision (table: )

Algorithm
Data in the  table must be durable, but the contents of   and   can be ephemeral (should be, to prevent unbounded growth), lasting only for a time-to-live after the corresponding value in   has been superseded by something more recent. There are three ways of accomplishing this:

a) idempotent writes; write-through to all tables on update

b) copying the values on a read from, or

c) copying them on update, prior to replacing a value in.

With non-VE use-cases, copy-on-read is problematic due to the write-amplification it creates (think: HTML dumps). Additionally, in order to fulfill the VE contract, the copy must be done in-line to ensure the values are there for the forthcoming save, introducing additional transaction complexity, and latency. Copy-on-update over-commits by default, copying from  for every new render, regardless of the probability it will be edited, but happens asynchronously without impacting user requests, and can be done reliably. This proposal uses the copy-on-update approach.

Update logic pseudo-code:

Option A
Precedence is first by revision, then by render; The  table must always return the latest render for the latest revision, even in the face of out-of-order writes. This presents a challenge for a table modeled as strictly key-value, since Cassandra is last write wins. As a work around, this option proposes to use a constant for write-time, effectively disabling the database's in-built conflict resolution. Since Cassandra falls back to a lexical comparison of values when encountering identical timestamps, a binary value encoded first with the revision, and then with a type-1 UUID is used to satisfy precedence requirements.

Option B
Identical to the A proposal above, with the exception of how the  table is implemented; In this approach,   is modeled as "wide rows", utilizing a revision-based clustering key. For any given, re-renders result in the   and   attributes being overwritten each time. To prevent unbounded grow of revisions, range deletes are batched with the.

Strawman Cassandra schema:

Example: Batched INSERT+DELETE

Pros

 * Expiration using Cassandra TTL mechanism

Cons

 * Write amplification (4 additional writes for the copy scheme, 2 for the write-through)
 * Read-on-write (for copy schemes)
 * Race conditions (copy schemes)
 * Semantics of write-through scheme do not result in expiration after a value has been superseded (the clock on expiration starts at the time of the update)

Option A

 * Breaks  semantics (without timestamps tombstones do not have precedence)


 * Defeats a read optimization designed to exclude SSTables from reads (optimization relies on timestamps)
 * Defeats a compaction optimization meant to eliminate overlaps for tombstone GC (optimization relies on timestamps)
 * Is an abuse of the tie-breaker mechanism
 * Lexical value comparison only meant as a fall-back for something considered a rare occurrence (coincidentally identical timestamps)
 * Lexical value comparison is not part of the contract, could change in the future without warning (has changed in the past without warning)
 * Cassandra semantics are explicitly last write wins; This pattern is a violation of intended use/best-practice, and is isolating in nature

Option B

 * Introduces a dependency on Cassandra 3.x (option B only)

Cassandra 3.x
At the time of this writing, the production cluster is running Cassandra 2.2.6, so any of the solutions above that rely on features(s) in Cassandra 3.x call this out as a drawback. However, there are compelling reasons to move to Cassandra 3.x beyond just the capability that enable the proposals cited above:
 * Proper support for JBOD configurations (CASSANDRA-6696) allows us to solve the blast radius that having a single large RAID-0 creates
 * A side-effect of how CASSANDRA-6696 was implemented enables us to partition the compaction workload, improving key locality, and reducing read latency
 * Changes to how row indexing is handled drastically reduce partition overhead on the heap, making wider partitions possible
 * Storage in versions >= Cassandra 3.0.0 are more compact on disk (often more compact without compression than older versions with).

Decision
After much deliberation, a decision to move forward with the option entitled Revision Policies Using Application-enforced TTLs (aka "app-enforced TTLs"), over the option entitled Two-table: Range delete-maintained latest, TTL-maintained history (or "two-table"), was made. A high-level summary of the reasoning follows:

Algorithmically, this option is the most iterative; From the read/write perspective, it inherits the correctness properties of the current system. It differs only in how retention is managed (the aspect of the current system that is intractable). Like the current system, retention can be decoupled and opportunistic; A failure in applying a delete for retention does not effect correctness (it only delays the culling of obsolete data), nor does it need fail the entire request.

When compared to the two-table approach, app-enforced TTLs has less transactional complexity, requiring fewer reads to satisfy requests, fewer writes for updates, and less complexity in the sequencing of operations, to satisfy correctness.

The primary point of contention here related to generalization. For use-cases with multiple, referential tables, the app-enforced TTL approach requires deletes of affected tables to be performed in batches. This violates encapsulation when using the current interfaces. However, during discussion, consensus was that use-cases with tables needing referential integrity would be better served by a different abstraction anyway, one that can encapsulate the batching and/or sequencing of interdependent writes, and where retention can act up on the sum of affected tables.