Talk:RESTBase/StorageDesign

About this board

Update algorithms for multi-table approaches

5
GWicke (talkcontribs)

I was a bit surprised to see a discussion of read-before-write approaches in the multi table section. The original intention with the multi table designs was to avoid wide rows, avoid race conditions, and preserve eventual consistency. Read-before-write strategies without the use of idempotent writes (explicit TIMESTAMP or value ordering) sacrifice eventual consistency across datacenters. Since updates are going to be DC-local & not idempotent, network partitions between datacenters can easily lead to the wrong version being permanently considered "latest".

I re-added idempotent writes as an update strategy option, but am wondering whether it is worth considering read-before-write strategies at all.

EEvans (WMF) (talkcontribs)

TTBMK, the semantics we are interested in require that we retain historic revisions for a period of up to a specified TTL, after they have been superseded by something newer. How do we conform to these semantics if all we are doing is writing-through on update?

For example: revision 1 is written to all 3 tables on 2018-01-01T00:00:00, a TTL of 24 hours is used. A user begins an edit at 2018-01-01T23:58:00 and attempts to save at 2018-01-02T00:05:00, (after the records have expired).

GWicke (talkcontribs)

Idempotent writes use explicit TIMESTAMP or byte ordering to let the latest revision win in an eventually consistent manner.

EEvans (WMF) (talkcontribs)

This doesn't answer my question; If writes to the TTL tables only occur on update (when the corresponding value is also written to the current table), then it will only last for the TTL period after they were created. Any access after that is subject to a miss. IOW, if the semantics are such that we keep around past versions for a period of TTL after they were superseded (which is what the current semantics are), then this will fail. If this is intentional, then what are you proposing we do on such a miss? Perform an in-line request to Parsoid to re-generate and re-store the content?

GWicke (talkcontribs)

There are three cases we need to consider:

  1. Old revision is found in storage
    1. Remaining TTL is sufficient to finish typical tasks like VE editing: Do nothing, return content.
    2. Remaining TTL is not sufficient to finish typical tasks like VE editing: Rewrite data associated with render to extend TTL (proposal in original discussion).
  2. Old revision is not found in storage: Render on demand; TTL will be sufficient.
Reply to "Update algorithms for multi-table approaches"

Timeline scheme: Missing support for latest_hash with grace_ttl?

3
GWicke (talkcontribs)

As described, the timeline scheme does not seem to offer guarantees for the availability of old revisions. In terms of the schema, this corresponds to a "latest_hash" revision retention policy with grace_ttl.

Use case: A user edits an older revision using VisualEditor.

Issue: As described, renders of old revisions can be deleted at any time, irrespective of when they were actually rendered last. This means that the edit in the use case can fail, as original HTML or metadata needed for successful editing might not be available any more.

EEvans (WMF) (talkcontribs)

I don't think I understand your concern. Past revisions (and renders) will exist for a period of at least the TTL specified after they have been superseded by something newer. This is the reason for the timeline structure, to index revisions by timestamp so that we can choose a revision ID to use in the range delete, on the basis of its age.

If you're talking about a user attempting to edit an arbitrarily older revision (which I understand is an actual use case), neither this (nor the other strategy) will support it; This is storage of current revisions only.

GWicke (talkcontribs)

The requirement is to store current revisions long term, and old revisions for at least TTL after they were requested. In schema terms, the requirement is to support latest_hash with TTL. The prototype multi-table implementation supports this using TTLs and separate tables. What it does not implement yet is TIMESTAMP or byte order precedence for latest updates.

The timeline algorithm does not seem to take renders for old revisions into account, as the timeline is purely based on edits, and not on renders.

Reply to "Timeline scheme: Missing support for latest_hash with grace_ttl?"
There are no older topics