Flow/Architecture/Memcache

Flow is written to take full advantage of a memcache caching layer. To this end we have implemented something very roughly equivalent to queryable indexes in memcache. We have considered moving this to redis, but for now its using the memcache infrastructure.

What is being cached?
All of the flow specific data models. The reverse-proxy cluster serving up most of the wiki content to visitors does not apply to editors. To provide the editors the responsiveness they deserve we are aggressively caching within the application. We have decided to cache at the data model level rather than options like caching view fragments to simplify cache invalidation and variances between what different editors see based on their roles.

What is actually written in the cache?
The Flow "Indexes" are always nested arrays of arrays. e.g. array( array( /* first data model */ ), array( /* second data model */ ), ... ); Each index is defined with a memcache key prefix (e.g. 'flow_definition:pk' ). This is combined with an equality condition like array( 'definition_id' => 12345 ) to generate a cache key such as 'flow_definition:pk:12345'. When the index is defined it is told what keys within the data model are used for indexing such that it can always build cache keys with properties in the same order.

Currently we have two main index implementations, both of which build off a common parent(FeatureIndex). We have the UniqueFeatureIndex which is mostly used for the primary index of each data model and holds exactly 1 data model matching a unique set of features. We additionally have a TopKIndex which stores the top K items matching a defined group of features(properties) sorted by a pre-defined feature of the data model. In addition to these two there are a few indexes that extend TopKIndex and add additional related data to the model to allow indexing models based on data not explicitly a part of the data model.

For every data model there is exactly one primary index. This primary index generally uses the same keys as the primary key in MySQL. The primary index should be (but is not specifically restricted) the only index that contains the full data models. The primary index still stores its index as a nested array of arrays for consistency, but there is always exactly one data model stored in this index.

Most data models then use supplementary secondary(i.e. not primary) indexes. All secondary indexes limit their data model storage to the piece of the data model necessary for looking up the full data model in the primary index, along with whatever values are necessary for sorting the index. These secondary indexes are always pre-sorted and limited in length.

Will this affect other users of memcache, like the parser cache?
Yes. Depending on how long(or if) timeouts are set on the keys, and how much data is cached this could affect the parser cache and cause more requests to fall back to querying the DB layer parser cache. We are not sure how to measure this, or determine what level of impact it will have.

How long is it cached?
We are undecided on cache time. Currently we are not setting any kind of timeout allowing Memcache to perform LRU eviction. We are not tied to this setting, valid arguments are being accepted for what this should be. Iniitla review suggests a few possibilities:
 * Use a different memcache server(possibly on same machines) for flow, such that it uses LRU eviction but has a pre-defined and expected limit to its memory usage, preventing unexpected effects on the parser cache.
 * Set a timeout on the keys such that they get evicted over time.
 * Add other options - this means you

Memcache usage estimates?
We have not yet estimated how much memory flow will use within memcache

Reads
All reads in flow hit the memcache infrastructure first. Queries to storage are almost entirely done in terms of equality conditions(along with options that specify details such as the prefered sorting order) so we can use those equality conditions as a cache key.

by a key/value store. By providing that guarantee we can utilize CAS transactions within memcache to read/verify/write data back to memcache ensuring sequential access. This scheme does not fully protect against Race conditions involving multiple keys. In scenarios where it is a problem we can utilize Redis along with LUA scripts to atomicly adjust multiple keys(although that only works against a single redis instance, so the keys must be sharded to the same instance. The current production configuration involves 16 servers each running memcache and redis).

There must still be one source of truth within the memcache cluster. Basically that means per database id there should be one matching key in memcache containing its row content. Other query answers should be stored as an id, a list of ids, or perhaps some other structures but still containing ids. There are likely exceptions to this.

We must strive to fetch multiple keys in single requests. Heavy usage of memcached could mean quite a few round trips. We should minimize that were possible when it doesn't overly complicate the code. In that vein ops has recently enabled twemproxy on the foundation web servers. All memcached requests will go through a twemproxy which enables proxying multiple client connections onto one or few server connections. This architectural setup makes it ideal for pipelining requests and responses and hence saving on the round trip time.

Writes
For the most part wiki's dont delete anything. Even things that are deleted are only really hidden from view except in specific cases. This greatly simplifies our task of caching, we simply need to cache data that is guaranteed not to change. We must tailor our data model ideas to a write-once scenario. For the small bits of data that can be changed we need to use CAS to read the value from memcache, update it as necessary, and write it back to memcache.

We must write to memcached before we write to the database. Writes to memcache are fast. We are already reading everything from memcache. That means writing to database first would provide more time for stale data to be read, and for user actions to be performed against that stale data. By writing to memcache not only when it is requested and not found, but pre-filling with all appropriate data we minimize the time stale data exists and provide editors with a very responsive backend.

Off the top of my head, topic splits seem like the most difficult thing to get correct here.

So how does this actually work?
Yea, i'm not sure either. There are many difficult questions glossed over, which I'm sure your thinking right now. Tackle them as we see them.

Redis Sorted Sets
The foundation has recently added redis to the set of servers powering the WMF. Redis has a structure they call a sorted set. Each sorted set has a value and a score. The scores are numbers, it accepts double precision floating point scores. Possible values to use as a score are a timestamp or row number within a query result. The value can be any string, ids are useful.

The set can be queried in a variety of ways, for example executing a query that may only display 20 items could query the database for 100 and store them in the sorted set. The next requested page can be retrieved by issuing 'ZRANGE myzset 20 20' which has favorable big O complexity and fairly memory efficient.

In a sample test i created 100k sorted sets each with 100 members. The scores used were 64 bit integers, and stored value was also a 64 bit integer. This utilized 240M VIRT, 198M RSS, which gives us a density of perhaps 400k users each with 100 flow uuid's sorted by time in 1GB of memory.

Redis LUA
Among its many features, Redis embeds a Lua interpreter on the server side. It is possible that we could store text representations of the discussion trees in redis and perform manipulations like adding a reply in lua to provide atomicity to updates without retries. This may also be a premature optimization and will not be initially used.