RESTBase/Table storage backend options

From MediaWiki.org
Jump to navigation Jump to search

This page was created in November 2014 to collect details on possible backends for the RESTBase table storage layer. As a result of the evaluation, Cassandra was deployed to the Wikimedia infrastructure. See wikitech:Cassandra for more information about Wikimedia's current deployment, and see Phab:T76157 for discussion about the evaluation.

Exclusion criteria[edit]

These criteria need to be satisfied by candidates:

  • No single point of failure
  • horizontal read scaling: can add nodes at any time; can increase replication factor
  • horizontal write scaling: can add nodes at any time
  • Has been hardened by production use at scale

Candidates[edit]

Cassandra CouchBase Riak HBase HyperDex
Opinion
Pros
  • Gabriel: Mature, mainstream option with good support for large data sets; Used in first backend implementation
  • Gabriel: Strong read performance on in-memory data sets
  • Gabriel: Causal contexts more thorough than pure timestamps
  • Gabriel: interesting indexing & consistency concepts
Cons
  • Gabriel: Timestamp-based conflict resolution not as thorough as Spanner or Riak
  • Gabriel: Makes compromises on durability, read scaling through replication and support for large data sets
  • Gabriel: Cross-DC replication only in enterprise version; no rack awareness
  • Gabriel: Complex operations, relatively weak performance
  • Gabriel: Has not really seen production use. Fairly complex. No read scaling through replication.
Champions for practical evaluation (please sign)
Legal
License Apache, no CLA Apache, requires CLA; binary distributions under more restrictive license Apache Apache BSD
Community Datastax largest contributor, but working open source community 19 contributors; academic / open source
Technical
Architecture Symmetric DHT Sharding with master per shard; automatic shard migration with distributed cluster manager state machine Symmetric DHT HDFS-based storage nodes; coordinator nodes with master/slave fail-over Innovative hyperspace hashing scheme allows efficient queries on all attributes; replicated state machine handles load balancing / assignment of data to nodes (special nodes)
CAP type AP default, CP per-request CP AP default CP CP
Storage backends / persistence Memtable backed by immutable log-structured merge trees (SStables); persistent journal for atomic batch operations; writes to all replicas in parallel Memory-based with writes to a single master per bucket; asynchronous persistence; chance of data loss if master goes down before write to disk Bitcask, LevelDB (recommented for large data sets), Memory; writes to all replicas in parallel HDFS 'HyperDisk' log-structured storage with sequential scan. Not sure if there is compaction.
Implementation Java Erlang, C++, C Erlang Java C++
Primary keys Composite hash key to distribute load around cluster, hierarchical range keys String hash key String hash key Hash & Range keys Range keys
Schemas Enforced schemas, fairly rich data types; column storage makes adding / removing columns cheap schemaless JSON objects schemaless JSON objects Enforced Enforced
Indexing support Per-node secondary hash indexing (limited use cases), Apache Solr integration in enterprise version Per-node secondary hash and range indexing (limited use cases); Elasticsearch integration Per-node secondary indexes (limited use cases); Apache Solr integration Rich range query support on arbitrary combinations of attributes
Compression deflate, lz4, snappy; configurable block size, so can exploit repetition between revisions with right data layout snappy, per document snappy with LevelDB backend deflate, lzo, lz4, snappy snappy
TTL / expiry yes, per attribute yes likely no
Causality / conflict resolution Nanosecond wall clock timestamps, last write wins; relies on ntp for time sync Causal contexts; sibling version resolution Value-dependent chaining
Single-item CAS Supported on any table using built-in Paxos Supported Supported when configured for bucket Guarantees linearizability for operations on keys
Multi-item transactions Only manually via CAS Only manually via CAS Only manually via CAS Only manually via CAS In commercial version
Multi-item journaled batches Yes No?
Per-request consistency trade-offs one, localQuorum, all, CAS No No
Balancing / bootstrap load distribution DHT + virtual nodes (default 256 per physical node, configurable per node to account for capacity differences); new node streams data evenly from most other nodes & takes up key space proportional to number of virtual nodes relative to entire cluster virtual buckets (typically 1024 per cluster); semi-manual rebalancing by moving entire vbuckets, documented as fairly expensive operation DHT + virtual nodes Coordinator nodes manage balancing Coordinator nodes manage balancing
Rack aware replica placement Yes In enterprise edition or custom build No Yes No
Multi-cluster / DC replication Mature & widely used in large deployments; TLS supported; choice of synchronous vs. async per query In enterprise edition or custom build Only in enterprise version; possibly custom build from source? No
Read hot spot scaling Increase replication factor, reads distributed across replicas Single node responsible for all reads for a vBucket Increase replication factor, reads distributed across replicas Single region server responsible for all reads to a region
Performance characterization Very high write throughput with low latency impact on reads. Good average read performance. Somewhat limited per-instance scaling due to JVM GC limits Performs well on small in-memory data sets. Seems to performs worse than competitors when configured for more frequent writes for durability. Support for large datasets (>> RAM) questionable. Most benchmarks show worse performance than competitors like Cassandra Claims high performance [1]
Operations
Large-scale production use Yes, since 2010. Yes Yes Yes, since at least 2010. No
Support / community fairly active community; Datastax offering help Fairly small & young community; commercial fork
Setup complexity Low (symmetric DHT, yaml configuration) Low to medium: package-based install (community edition), REST configuration Low High (HDFS, designated coordinator nodes) High: designated coordinator nodes, still much development churn
Packaging Good Apache Debian packages; apt-get install gets you a fully working one-node cluster Ubuntu packages under restrictive license / reduced functionality Official packages with restrictive license Comes with Debian unstable Debian packages
Monitoring Fairly rich data via JMX; pluggable metrics reporter support (incl. graphite & ganglia) Graphite collector Graphite collector JMX, Graphite collector
Backup & restore Immutable SSTables can be snapshotted at point in time (nodetool snapshot) & backed up as files: documentation mbbackup tool, FS-level backups FS-level backups Recommended to use a second cluster FS-level backups
Automatic rebalancing after node failure Yes; fine-grained and implicit in DHT Manually triggered, moves entire vBuckets, documented as expensive Yes; fine-grained and implicit in DHT In balancer background process; moves entire regions