User:GWicke/Notes/Storage/Cassandra testing

Testing Cassandra as a backend for the Rashomon storage service. See also User:GWicke/Notes/Storage.

Hosts:
 * cerium 10.64.16.147
 * praseodymium 10.64.16.149
 * xenon 10.64.0.200

Cassandra docs (we are testing 2.0.1 2.0.2 (latest changes)):
 * Cassandra 2.0
 * CQL 3.1

Cassandra node setup
apt-get install cassandra openjdk-7-jdk libjna-java libjemalloc1

Set up /etc/cassandra/cassandra.yaml according to the docs. Main things to change:
 * listen_address, rpc_address : set to external IP of this node
 * seed_provider / seeds : set to list of other cluster node IPs: "10.64.16.147,10.64.16.149,10.64.0.200"

(Re)start cassandra:. The command

nodetool status

should return information and show your node (and the other nodes) as being up. Example output: root@xenon:~# nodetool status Datacenter: datacenter1

=
========== Status=Up/Down -- Address       Load       Tokens  Owns   Host ID                               Rack UN 10.64.16.149  91.4 KB    256     33.4%  c72025f6-8ad8-4ab6-b989-1ce2f4b8f665  rack1 UN 10.64.0.200   30.94 KB   256     32.8%  48821b0f-f378-41a7-90b1-b5cfb358addb  rack1 UN 10.64.16.147  58.75 KB   256     33.8%  a9b2ac1c-c09b-4f46-95f9-4cb639bb9eca  rack1
 * / State=Normal/Leaving/Joining/Moving

Rashomon setup
The cassandra bindings used need node 0.10. For Ubuntu precise LTS, we need to do some extra work : apt-get install python-software-properties python g++ make add-apt-repository ppa:chris-lea/node.js apt-get update apt-get install build-essential nodejs # this ubuntu package also includes npm and nodejs-dev On Debian unstable, we'd just do  and get the latest node including security fixes rather than the old Ubuntu PPA package.

Now onwards to the actual rashomon setup: npm config set https-proxy http://brewster.wikimedia.org:8080 npm config set proxy http://brewster.wikimedia.org:8080 cd /var/lib https_proxy=brewster.wikimedia.org:8080 git clone https://github.com/gwicke/rashomon.git cd rashomon npm install cp contrib/upstart/rashomon.conf /etc/init/rashomon.conf adduser --system --no-create-home rashomon service rashomon start
 * 1) temporary proxy setup for testing
 * 1) will package node_modules later

Create the revision tables (on one node only): cqlsh < cassandra-revisions.cql

Cassandra issues

 * With the default settings and without working jna (see install instructions above), cassandra on one node ran out of heap space during a large compaction. The resulting state was inconsistent enough that it would not restart cleanly. The quick fix was wiping the data on that replica and re-joining the cluster.
 * Increased heap from quarter of the RAM (4G in this case) to 7G and installed an up-to-date jna
 * This might actually be related to missing jna and (less likely on linux) subprocesses as explained in . Should check using the default heap size with JNA enabled.
 * Stopping and restarting the cassandra service with  did not work. Faidon tracked this down to a missing '$' in the init script: . Fixed in 2.0.2.
 * Compaction was fairly slow for a write benchmark. Changed  to   in cassandra.yaml. Compaction is also niced and single-threaded, so during high load it will use less disk bandwidth than this upper limit. See  for background.
 * Not relevant for our current use case, but good to double-check if we wanted to start using CAS: bugs in 2.0.0 Paxos implementation. The relevant bugs seem to be fixed in 2.0.1.

Dump import, 600 writers
Six writer processes working on one of these dumps with up to 100 concurrent requests each. Rashomon uses write consistency level quorum for these writes, so 2 nodes out of three need to ack. The Cassandra commit log is placed on an SSD, data files on rotating metal RAID1.

6537159 revisions in 42130s (155/s); total size 85081864773 6375223 revisions in 42040s (151/s); total size 84317436542 6679729 revisions in 39042s (171/s); total size 87759806169 5666555 revisions in 32704s (173/s); total size 79429599007 5407901 revisions in 32832s (164/s); total size 72518858048 6375236 revisions in 37758s (168/s); total size 84318152281

=
================================================= 37041803 revisions total, 493425716820 total bytes (459.5G) 879/s, 11.1MB/s du -sS on revisions table, right after test: 162 / 153 / 120 G (avg 31.5% of raw text) du -sS on revisions table, after some compaction activity: 85G (18.4% of raw text) du -sS on revisions table, after full compaction: 73.7G (16% of raw text)


 * clients, rashomon and cassandra on the same machine
 * clients and cassandra CPU-bound, rashomon using little CPU
 * basically no IO wait time despite data on spinning disks. Compaction too throttled for heavy writes, but low wait even with a higher max compaction bandwidth cap. In a pure write workload all reads and writes are sequential. Cassandra also uses posix_fadvise for read-ahead and page cache optimization.

Write test 2: Dump import, 300 writers
Basically the same setup, except:
 * clients on separate machine
 * Cassandra 2.0.2
 * additional revision index maintained, which allows revision retrieval by oldid
 * better error handling in Rashomon
 * client connects to random Rashomon, and Rashomon uses set of Cassandra backends (instead of just localhost)

With default heap size, one node ran out of he heap about 2/3 through the test. Eventually a second node suffered the same, which let all remaining saves fail as there was no more quorum.

A similar failure happened before in preliminary testing. Setting the heap size to 1/2 of the RAM (7G instead of 4G on these machines) fixed this in the follow-up test, the same way it did before the first write test run.

Write test 3: Dump import, 300 writers
Same setup as in Write test 2, except heap limit increased from 4G default to 7G.



5407950 revisions in 46007s (117.5/s); Total size: 72521960786; 36847 retries 5666609 revisions in 52532s (107.8/s); Total size: 79431029856; 41791 retries 6375283 revisions in 67059s (95.0/s); Total size: 84318276123; 38453 retries 6537194 revisions in 64481s (101.3/s); Total size: 85084097888; 41694 retries 6679780 revisions in 60408s (110.5/s); Total size: 87759962590; 43422 retries 6008332 revisions in 50715s (118.4/s); Total size: 64537467290; 39078 retries

=
============================ 648/s, 7.5MB/s

0.65% requests needed to be retried after timeout 441.12G total

After test: 85G on-disk for revisions table (19.3%) 2.2G on-disk for idx_revisions_by_revid table

With improved error reporting and -handling in the client these numbers should be more reliable than the first test. The secondary index adds another query for each eventually consistent batch action, which slows down the number of revision inserts per second slightly. The higher compaction throughput also performs more of the compaction work upfront during the test, and results in significantly smaller disk usage right after the test.



Write test 4-6: Heap size vs. timeouts, GC time and out-of-heap errors
I repeated the write tests a few more times and got one more out-of heap error on a node. Increasing the heap to 8G had the effect of increasing the number of timeouts to about 90k per writer. The Cassandra documentation mentions 8G as the upper limit for reasonable GC pause times, so it seems that the bulk of those timouts are related to GC.

In a follow-up test, I reset the heap to the default value (4.3G on these machines) and lowered  from the 1/3 heap default to 1200M to avoid out-of-heap errors under heavy write load despite a relatively small heap. Cassandra will flush the largest memtable when this much memory is used.


 * : out of heap space
 * : better, but still many timeouts
 * default,   and   : fast, but not tested on full run
 * ,  and   : out of heap space
 * ,  and   : no crash, but slowdown once db gets larger without much IO
 * ,  and   : larger flush queue seems to counteract slowdown, flushing more often should reduce heap pressure even further. Slowed down with heavy GC activity, very likely related to compaction and the pure-JS deflate implementation.
 * reduced  from the default (number of cores) to 4 : ran out of heap
 * increased heap to 10G, reduced  to 2, disabled GC threshold so that collection starts not just when the heap is 75% full : Lowering the compactor thresholds is good, but removing the GC threshold actually made it worse. Full collection happens only close to max, which makes it more likely to run out of memory.

Flush memory tables often and quickly to avoid timeouts and memory pressure

 * use multiple flusher threads: ; See
 * increase the queue length to absorb burst:
 * can play with queue length for load shedding
 * seems to be sensitive to number of tables: more tables -> more likely to hit limit

GC tuning for short pauses and OOM error avoidance despite heavy write load
After a lot of trial and error, the following strategy seems to work well:


 * during a heavy write test that has been running for a few hours, follow the OU column in  output. Check memory usage after a major collection. This is the base heap usage.
 * in heavy write tests with small memtables (600M) between 1 and 2G
 * size MAX_HEAP_SIZE ~5x base heap usage for headroom, but not more to preserve ram for page cache
 * set CMSInitiatingOccupancyFraction to something around 45 (75% default is to close to the max -> OOM with heavy writes). This marks the high point where a full collection is triggered. It should be ~2-3x the size after a major collection. Starting too late makes it hard to really collect all garbage, as CMS is inexact in the name of low pause times and misses some garbage. Starting early keeps the heap to traverse small, which limits pause times.
 * size HEAP_NEWSIZE to ~2/3 the base heap use
 * for more thorough young generation collection, increase MaxTenuringThreshold from 1 to something between 1 and 15, for example 10. This means that something in the young generation (HEAP_NEWSIZE) needs to survive 10 minor collections to make it into the old generation.

All these settings are in /etc/cassandra/cassandra-env.sh.

Write test 12 (or so)
Settings in : MAX_HEAP_SIZE = "10G" # can be closer to 6 HEAP_NEWSIZE = "1000M" CMSInitiatingOccupancyFraction = 30 MaxTenuringThreshold = 10
 * 1) about 3.2G, needs to be adjusted to keep 3.2G point
 * 2) similar when MAX_HEAP_SIZE changes

Results: 5407948 revisions in 29838s (181.24/s); 72521789775 bytes; 263 retries 5666608 revisions in 31051s (182.49/s); 79431082810 bytes; 295 retries 6375272 revisions in 33578s (189.86/s); 84318196084 bytes; 281 retries 6537196 revisions in 34220s (191.03/s); 85084340595 bytes; 295 retries 6679784 revisions in 34597s (193.07/s); 87759936033 bytes; 284 retries 6008332 revisions in 30350s (197.96/s); 64537469847 bytes; 288 retries

=
=============== 1133 revisions/s


 * This also includes a successful concurrent repair at high load after changing settings on all nodes not too long after the test start
 * The retries were all triggered by 'Socket reset' on the test client, there don't seem to be any server timeouts any more. Need to investigate the reason for these.

Random reads
Goal: simulate read-heavy workload for revisions (similar to ExternalStore), and verify writes from the previous test.


 * Access random title, but most likely the newest revision
 * verify md5

The random read workload will be much IO-heavier. There should be noticeable differences between data on SSD vs. rotating disks.

Mix of a few writes and random reads
Perform about 50 revision writes / second, and see how many concurrent reads can still be sustained at acceptable latency. Closest approximation to actual production workload. Mainly looking for impact of writes on read latency.