User:GWicke/Notes/Storage/Cassandra testing

Testing Cassandra as a backend for the Rashomon storage service. See also User:GWicke/Notes/Storage.

Hosts:
 * cerium 10.64.16.147
 * praseodymium 10.64.16.149
 * xenon 10.64.0.200

Cassandra docs (we are testing 2.0.1):
 * Cassandra 2.0
 * CQL 3.1

Cassandra node setup
apt-get install cassandra openjdk-7-jdk libjna-java libjemalloc1

On older Ubuntu versions and until is fixed, upgrade jna according to : cd /tmp https_proxy=brewster.wikimedia.org:8080 wget https://raw.github.com/twall/jna/master/dist/jna.jar cp jna.jar /usr/share/java/jna.jar ln -s /usr/share/java/jna.jar /usr/share/cassandra/lib/

'jna.jar' should be listed in the jvm parameters when you start cassandra.

On Debian/Ubuntu, open /etc/cassandra/cassandra-env.sh and uncomment/edit this line (localhost is key here):

JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=localhost"

Set up /etc/cassandra/cassandra.yaml according to the docs. Main things to change:
 * listen_address : set to external IP of this node
 * seed_provider / seeds : set to list of other cluster node IPs: "10.64.16.147,10.64.16.149,10.64.0.200"

(Re)start cassandra. Right after install it does not seem to be running by default, so a simple  should be enough. If it is running, the restart might involve using kill, as the init scripts in 2.0.1 have a bug (see above). After this fix, the command

nodetool status

should return information and show your node (and the other nodes) as being up. Example output: root@xenon:~# nodetool status Datacenter: datacenter1

=
========== Status=Up/Down -- Address       Load       Tokens  Owns   Host ID                               Rack UN 10.64.16.149  91.4 KB    256     33.4%  c72025f6-8ad8-4ab6-b989-1ce2f4b8f665  rack1 UN 10.64.0.200   30.94 KB   256     32.8%  48821b0f-f378-41a7-90b1-b5cfb358addb  rack1 UN 10.64.16.147  58.75 KB   256     33.8%  a9b2ac1c-c09b-4f46-95f9-4cb639bb9eca  rack1
 * / State=Normal/Leaving/Joining/Moving

Rashomon setup
The cassandra bindings used need node 0.10. For Ubuntu precise LTS, we need to do some extra work : apt-get install python-software-properties python g++ make add-apt-repository ppa:chris-lea/node.js apt-get update apt-get install build-essential nodejs # this ubuntu package also includes npm and nodejs-dev On Debian unstable, we'd just do  and get the latest node including security fixes rather than the old Ubuntu PPA package.

Now onwards to the actual rashomon setup: npm config set https-proxy http://brewster.wikimedia.org:8080 npm config set proxy http://brewster.wikimedia.org:8080 cd /var/lib https_proxy=brewster.wikimedia.org:8080 git clone https://github.com/gwicke/rashomon.git cd rashomon npm install cp contrib/upstart/rashomon.conf /etc/init/rashomon.conf adduser --system --no-create-home rashomon service rashomon start
 * 1) temporary proxy setup for testing
 * 1) will package node_modules later

Create the revision tables (on one node only): cqlsh < cassandra-revisions.cql

Note re nodejs version: The PPA listed above is not quite up to date with security fixes etc. Maybe we should try to build the Debian unstable source package on Ubuntu Precise and use that if successful.

Cassandra issues

 * With the default settings and without working jna (see install instructions above), cassandra on one node ran out of heap space during a large compaction. The resulting state was inconsistent enough that it would not restart cleanly. The quick fix was wiping the data on that replica and re-joining the cluster.
 * Increased heap from quarter of the RAM (4G in this case) to 7G and installed an up-to-date jna
 * This might actually be related to missing jna and (less likely on linux) subprocesses as explained in . Should check using the default heap size with JNA enabled.
 * Stopping and restarting the cassandra service with  did not work. Faidon tracked this down to a missing '$' in the init script:.
 * Compaction was fairly slow for a write benchmark. Changed  to   in cassandra.yaml. Compaction is also niced and single-threaded, so during high load it will use less disk bandwidth than this upper limit. See  for background.
 * Not relevant for our current use case, but good to double-check if we wanted to start using CAS: bugs in 2.0.0 Paxos implementation. The relevant bugs seem to be fixed in 2.0.1 which we are using.

Dump import, 600 writers
Six writer processes working on one of these dumps with up to 100 concurrent requests each. Rashomon uses write consistency level quorum for these writes, so 2 nodes out of three need to ack. The Cassandra commit log is placed on an SSD, data files on rotating metal RAID1.

6537159 revisions in 42130s (155/s); total size 85081864773 6375223 revisions in 42040s (151/s); total size 84317436542 6679729 revisions in 39042s (171/s); total size 87759806169 5666555 revisions in 32704s (173/s); total size 79429599007 5407901 revisions in 32832s (164/s); total size 72518858048 6375236 revisions in 37758s (168/s); total size 84318152281

=
================================================= 37041803 revisions total, 493425716820 total bytes (459.5G) 879/s, 11.1MB/s du -sS on revisions table, right after test: 162 / 153 / 120 G (avg 31.5% of raw text) du -sS on revisions table, after some compaction activity: 85G (18.4% of raw text) du -sS on revisions table, after full compaction: 73.7G (16% of raw text)


 * clients, rashomon and cassandra on the same machine
 * clients and cassandra CPU-bound, rashomon using little CPU
 * basically no IO wait time despite data on spinning disks. Compaction too throttled for heavy writes, but low wait even with a higher max compaction bandwidth cap. In a pure write workload all reads and writes are sequential. Cassandra also uses posix_fadvise for read-ahead and page cache optimization.

Random reads
Goal: simulate read-heavy workload for revisions (similar to ExternalStore), and verify writes from the previous test.


 * Access random title, but most likely the newest revision
 * verify md5

The random read workload will be much IO-heavier. There should be noticeable differences between data on SSD vs. rotating disks.

Mix of a few writes and random reads
Perform about 50 revision writes / second, and see how many concurrent reads can still be sustained at acceptable latency. Closest approximation to actual production workload. Mainly looking for impact of writes on read latency.