Flow/Architecture/NoSQL

Abstract
For completeness we should review existing distributed database solutions to see if anything currently existing can meet our use case. We are not the first website to have scaling problems. Building upon and possibly contributing back to existing OSS solutions should be preferred over implementing our own distributed database layer for mysql in php.

Use Case
If we could pull a distributed database out of thin air, what features would it ideally have?


 * Persistent. Must not be a pure in-memory database
 * Small fish in a big pond: We want to be a 'mid-sized' installation of this database, meaning other people have already run into most of the same issues we will.
 * Add new nodes and re-distribute data to new or existing nodes in the background without downtime to prevent hot spots.
 * Ability to lose a servers to maintenance or failure without hiccups.
 * Supported by the creators for many years into the future
 * Free (as in speech).
 * Supported PHP client libraries. Python and node.js client libraries a plus.
 * Existing high quality abstractions over the databases feature set, rather than RPC.
 * Can support tagging/subscriptions/etc by users without complex client-side logic
 * Doesn't fall over when 1 user has 50k(or 500k, or whatever) subscriptions or when 1 object has 50k users subscribed to it.

Limitations
What limitations are we willing to accept from a distributed database?


 * Eventual Consistency?

What limitations are we not willing to accept from a distributed database?

Cassandra

 * Gracefully handles cluster expansion, failover, and balancing between nodes.
 * Architected as BASE rather than ACID
 * Cassandra consistency calculator: http://www.ecyrd.com/cassandracalculator/
 * Is Eventual Consistency reasonable for our use case?
 * Atomicity is limited to operations on a single row key
 * Like most NoSQL solutions, data modeling is about the queries not the relationships
 * You could call it query modeling, instead of data modeling
 * Insert and Update are the same. Its always Upsert. Also does not act like sql.
 * UPDATE foo SET bar='baz' WHERE zomg='hi' will create a row if it doesnt exist
 * UPDATE foo SET zomg='zomg' WHERE zomg='hi'
 * Read-before-Write operations will trigger inconsistency issues unless your consistency for read/write is R+W>N
 * In the general case, updating and then reading your own data is a cassandra anti-pattern.
 * No CAS in released server, code exists in cassandra 2.0 dev branch
 * Data Model should be designed for idempotent update
 * Also avoid data models that utilize delete. You can delete things, but it shouldn't be the go-to solution.
 * Existing PHP ecosphere for cassandra is fairly plain
 * PHP interface to Cassandra is CQL through a PDO driver from datastax, the company behind cassandra.
 * Last driver commit was July 2012, it doesn't support CQL3
 * Fork of the driver at https://github.com/Orange-OpenSource/YACassandraPDO supports CQL3 and has active commit historyll
 * Could potentially leave us in a situation where we have to maintain the PDO driver
 * UUID data types are not currently round-tripable. UUID is returned from cassandra as binary, needs to be unpacked and formatted into uuid.
 * Cannot perform scans across a column family due to sharding
 * Can scan across a single row in a column family though
 * Can use an order preserving partioner, but its provides worse balancing characteristics
 * Storage model can be thought of as Hash>.  PartitionKey and RowKey can be composite.
 * Fast lookups for table[key][keyA], and fast scans for table[key][keyA] through table[key][keyB]
 * No joins. In its simplest form, you could say RDBMS pays for joins at query time, Cassandra pays for joins at design time(and in more writes, which it supports well)
 * Application server time must be mostly consistent between servers as cassandra performs 'last write wins' conflict resolution, with last write determined by client timestamp
 * Design the data model such that operations are idempotent. Or, make sure that your use case can live with inaccuracies or that inaccuracies can be corrected eventually.
 * Because of Cassandra’s fully distributed (and no-master) nature, write failure does not guarantee that data is not written, unlike the behavior of relational databases.  This means a failed write is typically an eventual success.
 * No rollback or cleanup is performed on partially written data. Thus, a perceived write failure can result in a successful write eventually. So, retries on write failure can yield unexpected results if your model isn’t update idempotent.
 * Backwards compatability is not a primary concern of the developers - Full releases will have beta features that will change, like CQL3 syntax
 * Using cassandra as a queue is an anti-pattern
 * Avoid repeatedly writing data to the same row over very long spans of time
 * This is directly opposite of our needs - we need one row to represent a discussion and keep adding to it forever.

So, should we use it?
I have some concerns about the ecosphere of software existing arround cassandra for use with php. CQL is different enough from SQL that re-using existing SQL abstractions built ontop of PDO would not be exceptionally useful. We would be writing CQL and executing directly against the PDO instance. The primary use case for abstractions would be to conveniently generate IN conditions, but we can just write a quick helper for that.

Even though CQL looks like SQL, it has very different characteristics. All developers need to understand that difference between a CQL UPDATE and an SQL UPDATE, they are very different. All developers will also need to internalize the 'idempotent update' methodology and that you will not reliably read your own writes. The whole notion of writing data to cassandra and immediately reading it back is considered an anti-pattern.

The cassandra storage model is fairly easy to understand. Performing the mental map from CQL to cassandra storage is reasonably simple. That doesn't mean using it properly is easy though. One particular issue is the difference between a table where partition key === primary key (narrow rows) and a table where the primary key contains more than the partition key(wide row). Sorting can only be done against wide rows(the row is all cql3 rows within the same partition key). Concerns revolving around a single row being split between SSTables (cassandra on-disk storage) is greatly expanded when considering how often a particular "wide-row" will receive updates. If it receives constant updates(new inserts to existing partition key) then the SSTable compaction will never fully catch up and the row will always be spread out. There is a special compaction method that can be used, but it is much more IO heavy than the standard compaction(but less so than constantly reading rows from 10 different SSTables).

From an operations standpoint cassandra looks like a solid win versus custom mysql sharding, although like any sufficiently complex program it likely has many corner cases we just don't know about.

CQL has the same potential security concerns as SQL in respect to user parameters changing the meaning of a query. The PDO drivers for php perform incredibly basic escaping. Basically it loops through the string looking for single quotes and doubles them up. Probably not sufficient(and labeled as such in the driver code).

In the end, I think it is possible cassandras storage model could work for us. I'm not sure that cassandra has enough support within the PHP world to be used by mediawiki, unless we want to take on issues like supporting the PDO driver ourselves. The PDO driver is not a large piece of code, but essential enough that I worry it is an 'omen of things to come' in regards to utilizing cassandra from php.