Flow/Architecture/NoSQL

Abstract
For completeness we should review existing distributed database solutions to see if anything currently existing can meet our use case. We are not the first website to have scaling problems. Building upon and possibly contributing back to existing OSS solutions should be preferred over implementing our own distributed database layer for mysql in php.

Use Case
If we could pull a distributed database out of thin air, what features would it ideally have?


 * Persistent. Must not be a pure in-memory database
 * Small fish in a big pond: We want to be a 'mid-sized' installation of this database, meaning other people have already run into most of the same issues we will.
 * Add new nodes and re-distribute data to new or existing nodes in the background without downtime to prevent hot spots.
 * Ability to lose a servers to maintenance or failure without hiccups.
 * Supported by the creators for many years into the future
 * Free (as in speech).
 * Supported PHP client libraries. Python and node.js client libraries a plus.
 * Existing high quality abstractions over the databases feature set, rather than RPC.
 * Can support tagging/subscriptions/etc by users without complex client-side logic
 * Doesn't fall over when 1 user has 50k(or 500k, or whatever) subscriptions or when 1 object has 50k users subscribed to it.

Limitations
What limitations are we willing to accept from a distributed database?


 * Eventual Consistency?

What limitations are we not willing to accept from a distributed database?

Cassandra

 * Gracefully handles cluster expansion, failover, and balancing between nodes.
 * Architected as BASE rather than ACID
 * Is Eventual Consistency reasonable for our use case?
 * Atomicity is limited to operations on a single row key
 * Like most NoSQL solutions, data modeling is about the queries not the relationships
 * Insert and Update are the same, always Upsert.
 * No CAS
 * Data Model should be designed for idempotent updates
 * Fairly easy to integrate into php development
 * Uses a custom query langauge called CQL which is superficially similar to SQL
 * PHP interface to Cassandra is CQL through a PDO driver
 * CQL doesn't cover 100% of use cases, there are things you can only do from thrift.
 * Last driver commit was July 2012(but not much has changed requiring updates either)
 * Cannot perform scans across a column family due to sharding
 * Can scan across a single row in a column family though
 * Can use an order preserving partioner, but its provides worse balancing characteristics
 * Storage model can be thought of as Hash>
 * Fast lookups for table[key], table[key][keyA], and fast scans for table[key][keyA] through table[key][keyB]
 * Supports composite keys. Rows with composite primary key are sharded by the first key, not the full composite.
 * No joins. In its simplest form, you could say RDBMS pays for joins at query time, Cassandra pays for joins at design time(and in more writes)
 * Application server time must be mostly consistent between servers as cassandra performs 'last write wins' conflict resolution, with last write determined by client timestamp
 * Design the data model such that operations are idempotent. Or, make sure that your use case can live with inaccuracies or that inaccuracies can be corrected eventually.
 * Because of Cassandra’s fully distributed (and multi-master) nature, write failure does not guarantee that data is not written, unlike the behavior of relational databases.
 * No rollback or cleanup is performed on partially written data. Thus, a perceived write failure can result in a successful write eventually. So, retries on write failure can yield unexpected results if your model isn’t update idempotent.
 * Backwards compatability is not a primary concern of the developers - Full releases will have beta features that will change, like CQL3 syntax