RESTBase/StorageDesign/Short longterm

Requirements[edit]

Current (low latency access)[edit]

Storage of current revisions (most up to date render of most current revision);
Resilient in the face of non-linearized writes; Precedence defined by revision ID and render time-stamp, not write time
Storage of past revisions for a TTL period (at least) after they have been superseded by something newer (aka recent)
Storage of arbitrarily old revisions (on request), for a TTL period (at least), from the time of the request (aka historical)^[1]
50p read latencies of 5ms, 99p of <100ms

Archive[edit]

99p read latencies < 200ms.

Recency[edit]

One of the requirements is for a window of recently superseded values; Values must be preserved for a predefined period after they have been replaced by something newer. This sliding window of recent history is needed to support application concurrency (see: MVCC Multiversion concurrency control ). An example use-case is Visual Editor: A user begins an edit by retrieving the HTML of the most recent revision of a document, R_n. While they are working on their edit, another user commits a change, making the most recent revision R_n+1. The first user then subsequently attempts to commit their change, requiring the Parsoid meta-data for R_n, despite it having now been superseded by R_n+1.

An open question remains regarding the latency requirements for recent data. For example: Should access by comparable to that of current? Is the aggregate of current, and a secondary lookup of comparable latency acceptable (2x current)? Is the aggregate of current and a secondary lookup of archive storage acceptable (current + (10x current))?

Option 1: Timeline storage, current_and_recent + historical[edit]

Short term[edit]

In the short term, this approach uses two tables per logical table, and two shared timeline tables. Both per-logical tables are identical to the current schema (and can optionally use the current table), and only differ in the TTL table having a default time to live configured.

-- Strawman Cassandra schema
CREATE TABLE current_and_recent (
  "_domain" text,
  title text,
  rev int,
  tid timeuuid,
  value blob,
  PRIMARY KEY (("_domain", title), rev, tid)
) WITH CLUSTERING ORDER BY (rev DESC, tid DESC);

CREATE TABLE historical (
  "_domain" text,
  title text,
  rev int,
  tid timeuuid,
  value blob,
  PRIMARY KEY (("_domain", title), rev, tid)
) WITH CLUSTERING ORDER BY (rev DESC, tid DESC)
  AND default_time_to_live = 86400;

Shared timeline tables:

CREATE TABLE revision_timeline (
  "domain" text,
  title text,
  ts timestamp,
  rev int,
  PRIMARY KEY(("domain", title), ts)
) WITH CLUSTERING ORDER BY (rev DESC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy'}
  AND default_time_to_live = 2592000;  -- 30 days(?)

CREATE TABLE render_timeline (
  "domain" text,
  title text,
  ts timestamp,
  rev int,
  tid timeuuid,
  PRIMARY KEY(("domain", title), rev, ts)
) WITH CLUSTERING ORDER BY (rev DESC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy'}
  AND default_time_to_live = 2592000;  -- 30 days(?)

Writes (updates)[edit]

Read the previous value

Read old latest value from current_and_recent

Once per render / revision:

Update the revision timeline table
Update the render timeline table

Once per logical table write:

Write new value to current_and_recent
If old latest value has revision id > new revision id: Write new value to TTL table (unlikely to happen)

Asynchronous (and potentially probabilistic) retention policy update. Not all of the steps required each time, can only delete renders if old_rev == new_rev and only delete revisions otherwise. So maximum 2 queries.

Select expired revision from revision_timeline table
Select expired render from render_timeline table
Range delete revisions from current_and_recent
Range delete renders from current_and_recent

Reads[edit]

The current_and_recent table is consulted
On a miss and request for specific revision / render, the recent_and_historical table is consulted

Pros[edit]

Slightly lower latency for recent render cache misses: Requests for recent renders can always be handled by the current_and_recent table, and don't fall back to the TTL table.
One less write per logical table than Option 2. (But 1-2 additional writes per Parsoid render. Can decrease this load with probabilistic updates, at the cost of slower GC.).

Cons[edit]

Corner case: A by-revision or fully qualified lookup against current_and_recent can be a hit despite it being eligible for deletion. After the successful read, and a probabilistically applied range delete removes the record. The likelihood of this happening can be reduced by increasing the range delete probability (at the expense of generating more tombstones, obviously). The possibility of this occurring can not be entirely eliminated if range delete probability is < 1.0.
- Possible solutions:
  - Do not check the current_and_recent table for fully qualified renders?
  - Have a static column in current_and_recent indicating the currently latest revision and if the one received is not - stash it to TTL?
Timeline indexing is specific to Parsoid revisions, and does not readily generalize to arbitrary table schemas. We would lose the retention policy abstraction.

Long term[edit]

Long term, archival storage will replace the historical table.

Writes (updates)[edit]

Once per render / revision:

Update the revision timeline tables

Once per write to a logical table:

Write new value to current_and_recent
Write new value to archival storage

Asynchronous (and potentially probabilistic) retention policy update

Select expired revision from revision_timeline table
Select expired render from render_timeline table
Range delete revisions from current_and_recent
Range delete renders from current_and_recent

Reads[edit]

The current table is consulted
On a miss and request for specific revision / render, the archival table is consulted

Pros[edit]

Lower latency for recent render cache misses: Requests for recent renders can always be handled by Cassandra storage, and don't fall back to archival storage.

Cons[edit]

Corner case: A by-revision or fully qualified lookup against current_and_recent can be a hit despite it being eligible for deletion. After the successful read, and a probabilistically applied range delete removes the record. The likelihood of this happening can be reduced by increasing the range delete probability (at the expense of generating more tombstones, obviously). The possibility of this occurring can not be entirely eliminated if range delete probability is < 1.0.
Timeline indexing is specific to Parsoid revisions, and does not readily generalize to arbitrary table schemas. We would lose the retention policy abstraction.
Archival solution is expected to handle thinning & compression.

Option 2: Current + archival[edit]

Short term[edit]

In the short term, this approach uses two tables per logical table. Both tables are identical to the current schema (and can optionally use the current table), and only differ in the TTL table having a default time to live configured.

-- Strawman Cassandra schema
CREATE TABLE current (
  "_domain" text,
  title text,
  rev int,
  tid timeuuid,
  value blob,
  PRIMARY KEY (("_domain", title), rev, tid)
) WITH CLUSTERING ORDER BY (rev DESC, tid DESC);

CREATE TABLE recent_and_historical (
  "_domain" text,
  title text,
  rev int,
  tid timeuuid,
  value blob,
  PRIMARY KEY (("_domain", title), rev, tid)
) WITH CLUSTERING ORDER BY (rev DESC, tid DESC)
  AND default_time_to_live = 864000;

Writes (updates)[edit]

Read the latest render from the current table
Batch:
1. Write the value read above to the recent_and_historical table
2. Write the updated render to the recent_and_historical table^[2]
3. Write the updated render to the current table

Asynchronously, and potentially probabilistically:

Apply range delete for previous renders of the revision, (and for previous revisions if the latest_hash policy is used)

Reads[edit]

The current table is consulted
On a miss and request for specific revision / render, the recent_and_historical table is consulted

Pros[edit]

Simplicity.
Avoids the need to index revisions and renders by the time of their replacement
Avoids race conditions between timeline updates & actual render updates.
Generalizes to arbitrary tables and retention policies. Example: "latest" retention policy can omit archival parts.

Cons[edit]

One more content write than Option 1 in common case (latest revision update).
Corner case: A fully qualified lookup against current is a hit despite the values copied to recent_and_historical having since expired. After the successful read, and a probabilistically applied range delete removes the record. The likelihood of this happening can be reduced by increasing the range delete probability (at the expense of generating more tombstones, obviously). The possibility of this occurring can not be entirely eliminated if range delete probability is < 1.0.

Long term[edit]

As in Option 1, the recent_and_historical table is replaced by archival storage.

Writes (updates)[edit]

Write the updated render to the current table
Write the updated render to the archival table

Asynchronously, and potentially probabilistically:

Apply range delete for previous renders of the revision, (and for previous revisions if the latest_hash policy is used)

Reads[edit]

The current table is consulted
On a miss and request for specific revision / render, the archival table is consulted

Note: Archival storage is expected to perform asynchronous thinning & compression activities in the background. These are expected to only affect renders older than 24 hours or so.

Pros[edit]

Simplicity.
Avoids the need to index revisions and renders by the time of their replacement.
Avoids race conditions between timeline updates & actual render updates.
Generalizes to arbitrary tables and retention policies. Example: "latest" retention policy can omit archival parts.

Cons[edit]

Archival solution has a requirement to implement asynchronous thinning & compression.
Read latency: Requests for old revisions / renders that are not hits in Varnish will hit archival storage, which is likely to have higher latency than current revision storage.
Corner case: A fully qualified lookup against current can be a hit despite the values copied to recent_and_historical having since expired. After the successful read, and a probabilistically applied range delete removes the record. The likelihood of this happening can be reduced by increasing the range delete probability (at the expense of generating more tombstones, obviously). The possibility of this occurring can not be entirely eliminated if range delete probability is < 1.0.

Notes[edit]

↑ This is technically not current; A stop-gap until archival storage is in place
↑ This is necessary to avoid a race condition between concurrent updates resulting in lost writes

[1] This is technically not current; A stop-gap until archival storage is in place

[2] This is necessary to avoid a race condition between concurrent updates resulting in lost writes

[1]

[2]