Parsoid/Round-trip testing/Cassandra

The Parsoid code includes a round-trip testing system that is used to test code changes on a collection of 160k wikipedia articles from 16 languages. The system is composed of a server that hands out tasks and presents results and clients that do the testing and report back to the server. The test performed on the client is the conversion of wiki articles to HTML, then the conversion of that HTML back to wikitext and finally a classification of any differences into semantic and purely syntactic differences. This is essentially a map/reduce style workflow, in which distribution to about 40 cores in different VMs lets us finish a round-trip run over all 160k pages in around 4 hours.

Problem statement / introduction
The round-trip server has become a bottleneck in the round-trip test system that prevents us from scaling up the system with more clients to process more pages. We are currently using a MySQL backend (after migrating from SQLite earlier). A few month's worth of round-trip results take up 31gb on disk, and queries on this data slow down a lot with a growing database.

We saw very good results when testing Cassandra as a backend for a revision store for page content recently. Apart from the obvious benefits of replication and automatic fail-over and scalability for writes, we were impressed by the compression ratios (~19% of input text, including indexes and overhead) achieved when storing many revisions of a wiki page in consecutive blocks on disk. Test results for a given wiki page have similar characteristics of small changes between revisions (typically), so should compress similarly well.

A challenge for data modeling with Cassandra are its relatively limited abilities to query the database. Cassandra specializes on queries that can be efficiently processed by reading a contiguous chunk of storage on one of the replicas. There is no (efficient) support for range queries on the primary 'partition' key, as this is used to map an entry to a node in the DHT. There are also no joins, and very limited support for filtering a non-contiguous result set. For more complex queries (a list of regressions for example) this means that information often needs to be pre-computed and denormalized. Compared to relational systems, data modeling is driven very heavily by the main queries expected. Overall this means that moving from the relational MySQL schema to Cassandra will require a full redesign of the data model.

Cassandra bindings
We have tested Cassandra and node.js in the Rashomon revision store prototype. The node-cassandra-cql bindings used there worked well, and can also be used to hook up Cassandra to the round-trip server.

Getting started
git clone https://github.com/gwicke/testreduce.git

Quick start on Debian
If you are running Debian / Ubuntu, try adding this to /etc/apt/sources.list: deb http://parsoid.wmflabs.org:8080/debian wmf-production/

Now install testreduce apt-get update apt-get install testreduce

If everything went well you should have a test server running at http://localhost:8001/

General install
You need node.js 0.10 and MySQL, which is available in most current Linux distros ( on Debian) and for OSX. It might also work on Windows (we heard positive reports), but we don't really support Windows. The main developers all use Debian or Ubuntu Linux.

cd testreduce npm install

To try the MySQL version of the server, you also need to install MySQL, create a db and user

In mysql: create user testreduce; create database testreduce; GRANT ALL ON testreduce.* TO 'testreduce'@'localhost'; flush privileges; Create the db: mysql -u testreduce testreduce < sql/create_everything.mysql

Now copy server.settings.js.example to server.settings.js and change the following settings: user testreduce database testreduce password "" (empty string)

Now start the server at http://localhost:8001/: node server

Cassandra setup
See the Cassandra download page for most systems. On Debian, simply add  to /etc/apt/sources.list, and then do apt-get update apt-get install cassandra openjdk-7-jdk libjna-java libjemalloc1

In /etc/cassandra/cassandra-env.sh, change this line (near the end) to point to localhost:

JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=localhost"

(Re)start cassandra:. The command

nodetool status

should return information and show your node (and the other nodes) as being up. Example output: root@xenon:~# nodetool status Datacenter: datacenter1

=
========== Status=Up/Down -- Address    Load       Tokens  Owns   Host ID                               Rack UN 127.0.0.1  336.9 KB   256     100.0%  1d4b5052-63db-428b-8a62-0c8b15fdae10  rack1
 * / State=Normal/Leaving/Joining/Moving

Now you can start playing with Cassandra using the  cli interface. See the Cassandra 2.0 and CQL 3.1 docs for more Cassandra background.

Next steps

 * Check out the current round-trip server code and familiarize yourself with node.js
 * Read up on and play with Cassandra. You can also play with the Rashomon storage service prototype as a relatively simple example of node + cassandra.
 * Read up on eventual consistent systems and idempotence. Some starting points (feel free to edit):
 * Eventually Consistent
 * Dynamo
 * Building on Quicksand
 * Spanner as an example of a different trade-off with a good use of logical time that tracks GPS time

Contacting us

 * IRC: we are hanging out in #mediawiki-parsoid on freenode (you can also use the web chat). Nicks are gwicke (Gabriel, San Francisco) and marcoil (Marc, Spain)
 * Mail: {gwicke,marcoil,aschulz}@wikimedia.org