RESTBase/StorageDesign

Current (low latency access)

 * 1) Storage of current revisions (most up to date render of most current revision);
 * 2) Resilient in the face of non-linearized writes; Precedence defined by revision ID and render time-stamp, not write time
 * 3) Storage of past revisions for a TTL period (at least) after they have been superseded by something newer (aka recent)
 * 4) Storage of arbitrarily old revisions (on request), for a TTL period (at least), from the time of the request (aka historical)

Archive
TBD

Retention policies using application-enforced TTLs
This approach uses a schema identical to that of the current storage model, one that utilizes wide rows to model a one-to-many relationship between a title and its revisions, and a one-to-many relationship between each revision and its corresponding renders. It differs only in how it approaches retention.

Retention
Culling of obsolete data is accomplished using range deletes. For example: In order to obtain the predicates used in these range deletes, revisions and renders must be indexed by a timestamp that represents when they were superseded, or replaced, by a newer revision and/or render respectively. Records in the index tables are a compound key of the domain and title. Updates can be performed probabilistically, if necessary. TTLs can be applied to prevent unbounded partition growth.

Index/Time-line storage
Only a single index for each (revisions and renders) is needed for all logical tables (e.g. parsoid html, data, and section offsets, mobileapps, etc), so to eliminate update duplication, these indices are separately maintained by change-propagation.

Properties
The distribution of edit frequencies across Wikimedia projects is quite extreme, ranging from approximately 1/day, to nearly 200K/day. Without sampling, the lowest edit frequencies are sufficient to manage retentions of not less than 24 hours efficiently. The highest frequencies (again, without sampling) could place an unnecessary burden on storage in exchange or a resolution that vastly exceeds what is needed. Sampling applied to all time-line updates to correct for high frequency edits would render indexing of domains with lower edit frequencies less efficient. Ideally, rate-limiting by domain can be employed to sample writes from the few high edit frequency projects without effecting those with lower edit frequencies.

Implementation
The examples above demonstrate storage of the time-line in Cassandra, but there is no requirement to do so. Redis for example, would likely prove adequate for this use case. For example, the contents of the index/time-line need not be perfectly durable, a catastrophic loss of all entries would merely delay the culling of past entries, and only for a period equal to that of the retention configured. The index can be replicated to remote data-centers, but this is not a requirement, it could for example be independently generated in each without impacting correctness.

Secondary Storage
This proposal in its native form does not address the 4th requirement (storage of arbitrarily older revisions for a TTL from the time of request/generation). Ultimately, it may be possible to address this requirement by falling back to a lookup of archival storage, but as that is a longer term goal, a near-term solution for this is needed.

Option: Use pre-existing tables
Existing RESTBase key-rev-value tables are utilized as secondary storage. Updates to these tables are disabled, and revision retention policies are not used. When a request against  results in a miss, these existing storage tables are consulted. On a secondary storage miss, the content is generated in-line and persisted where it will live in perpetuity (or until the completion of archival storage replaces this as an intermediate solution). Since  misses are presumed to be exceptional, the amount of accumulated data should be manageable.

Option: Dedicated table with default TTL
An additional table of similar schema is used, this table utilizes a default Cassandra TTL. When a request against  results in a miss, secondary storage is consulted. If a secondary storage request results in a miss, the content is generated in-line and persisted where it will live for a period of (at least)  seconds.

Pros

 * Easy to implement (low code delta)
 * Least risk; Inherits correctness from current (well tested) implementation
 * Minimal read / write amplification

Cons

 * Creates a hard-dependency on Cassandra 3.x (needed to create range tombstones using inequality operators)
 * Requires the indexing of revisions and renders by the time they were superseded by newer values
 * Corner case: A fully qualified lookup against  is a hit, but is subsequently removed by policy in less than TTL from the time of the request.  In other words, a request that corresponds with requirement #4, but is incidentally recent enough to be found in   at the time of the request.

Two-table: Range delete-maintained latest, TTL-maintained history
This approach uses two tables with schema identical to that of the current storage model, utilizing wide rows to model a one-to-many relationship between a title and its revisions, and a one-to-many relationship between each revision and its corresponding renders. It differs though in how it approaches retention.

Writes (updates)
The first of the two tables uses (probabilistic) deletes of all previous revisions and/or renders on update in order to maintain a view of current versions. The second table uses Cassandra TTLs to expire records.

On update, the following algorithm is applied:
 * 1) Read the latest render from the   table
 * 2) Write the value read above to the   table
 * 3) Write the updated render to the   table
 * 4) Write the updated render to the   table
 * 5) Apply any range deletes for previous renders of the revision, (and for previous revisions if the   policy is used)

Reads

 * 1) The   table is consulted
 * 2) On a miss, the   table is consulted
 * 3) On a hit, the TTL may be refreshed if there is insufficient time remaining
 * 4) On a miss, the content is generated in-line and written to both   and

Pros

 * Avoids the need to index revisions and renders by the time of their replacement

Cons

 * Transactional complexity (read → 3x write → range delete)
 * Read latency; Misses on  seem unexceptional, and make client latency the aggregate of two requests, instead of one
 * Corner case: A fully qualified lookup against  is a hit despite the values copied to   having since expired. After the successful read, and a probabilistically applied range delete removes the record.  The likelihood of this happening can be reduced by increasing the range delete probability (at the expense of generating more tombstones, obviously).  The possibility of this occurring can not be entirely eliminated if range delete probability is < 1.0.

Table-per-query
This approach materializes views of results using distinct tables, each corresponding to a query.

Queries

 * The most current render of the most current revision (table: )
 * The most current render of a specific revision (table: )
 * A specific render of a specific revision (table: )

Algorithm
Data in the  table must be durable, but the contents of   and   can be ephemeral (should be, to prevent unbounded growth), lasting only for a time-to-live after the corresponding value in   has been superseded by something more recent. There are three ways of accomplishing this:

a) idempotent writes; write-through to all tables on update

b) copying the values on a read from, or

c) copying them on update, prior to replacing a value in.

With non-VE use-cases, copy-on-read is problematic due to the write-amplification it creates (think: HTML dumps). Additionally, in order to fulfill the VE contract, the copy must be done in-line to ensure the values are there for the forthcoming save, introducing additional transaction complexity, and latency. Copy-on-update over-commits by default, copying from  for every new render, regardless of the probability it will be edited, but happens asynchronously without impacting user requests, and can be done reliably. This proposal uses the copy-on-update approach.

Update logic pseudo-code:

Option A
Precedence is first by revision, then by render; The  table must always return the latest render for the latest revision, even in the face of out-of-order writes. This presents a challenge for a table modeled as strictly key-value, since Cassandra is last write wins. As a work around, this option proposes to use a constant for write-time, effectively disabling the database's in-built conflict resolution. Since Cassandra falls back to a lexical comparison of values when encountering identical timestamps, a binary value encoded first with the revision, and then with a type-1 UUID is used to satisfy precedence requirements.

Option B
Identical to the A proposal above, with the exception of how the  table is implemented; In this approach,   is modeled as "wide rows", utilizing a revision-based clustering key. For any given, re-renders result in the   and   attributes being overwritten each time. To prevent unbounded grow of revisions, range deletes are batched with the.

Strawman Cassandra schema:

Example: Batched INSERT+DELETE

Pros

 * Expiration using Cassandra TTL mechanism

Cons

 * Write amplification (4 additional writes for the copy scheme, 2 for the write-through)
 * Read-on-write (for copy schemes)
 * Race conditions (copy schemes)
 * Semantics of write-through scheme do not result in expiration after a value has been superseded (the clock on expiration starts at the time of the update)

Option A

 * Breaks  semantics (without timestamps tombstones do not have precedence)


 * Defeats a read optimization designed to exclude SSTables from reads (optimization relies on timestamps)
 * Defeats a compaction optimization meant to eliminate overlaps for tombstone GC (optimization relies on timestamps)
 * Is an abuse of the tie-breaker mechanism
 * Lexical value comparison only meant as a fall-back for something considered a rare occurrence (coincidentally identical timestamps)
 * Lexical value comparison is not part of the contract, could change in the future without warning (has changed in the past without warning)
 * Cassandra semantics are explicitly last write wins; This pattern is a violation of intended use/best-practice, and is isolating in nature

Option B

 * Introduces a dependency on Cassandra 3.x (option B only)

Cassandra 3.x
At the time of this writing, the production cluster is running Cassandra 2.2.6, so any of the solutions above that rely on features(s) in Cassandra 3.x call this out as a drawback. However, there are compelling reasons to move to Cassandra 3.x beyond just the capability that enable the proposals cited above:
 * Proper support for JBOD configurations (CASSANDRA-6696) allows us to solve the blast radius that having a single large RAID-0 creates
 * A side-effect of how CASSANDRA-6696 was implemented enables us to partition the compaction workload, improving key locality, and reducing read latency
 * Changes to how row indexing is handled drastically reduce partition overhead on the heap, making wider partitions possible
 * Storage in versions >= Cassandra 3.0.0 are more compact on disk (often more compact without compression than older versions with).