Parsoid/Round-trip testing/Cassandra

The Parsoid code includes a round-trip testing system that is used to test code changes on a collection of 160k wikipedia articles from 16 languages. The system is composed of a server that hands out tasks and presents results and clients that do the testing and report back to the server. The test performed on the client is the conversion of wiki articles to HTML, then the conversion of that HTML back to wikitext and finally a classification of any differences into semantic and purely syntactic differences. This is essentially a map/reduce style workflow, in which distribution to about 40 cores in different VMs lets us finish a round-trip run over all 160k pages in around 4 hours.

Problem statement / introduction
The round-trip server has become a bottleneck in the round-trip test system that prevents us from scaling up the system with more clients to process more pages. We are currently using a MySQL backend (after migrating from SQLite earlier). A few month's worth of round-trip results take up 31gb on disk, and queries on this data slow down a lot with a growing database.

We saw very good results when testing Cassandra as a backend for a revision store for page content recently. Apart from the obvious benefits of replication and automatic fail-over and scalability for writes, we were impressed by the compression ratios (~19% of input text, including indexes and overhead) achieved when storing many revisions of a wiki page in consecutive blocks on disk. Test results for a given wiki page have similar characteristics of small changes between revisions (typically), so should compress similarly well.

A challenge for data modeling with Cassandra are its relatively limited abilities to query the database. Cassandra specializes on queries that can be efficiently processed by reading a contiguous chunk of storage on one of the replicas. There is no (efficient) support for range queries on the primary 'partition' key, as this is used to map an entry to a node in the DHT. There are also no joins, and very limited support for filtering a non-contiguous result set. For more complex queries (a list of regressions for example) this means that information often needs to be pre-computed and denormalized. Compared to relational systems, data modeling is driven very heavily by the main queries expected. Overall this means that moving from the relational MySQL schema to Cassandra will require a full redesign of the data model.

Cassandra bindings
We have tested Cassandra and node.js in the Rashomon revision store prototype. The node-cassandra-cql bindings used there worked well, and can also be used to hook up Cassandra to the round-trip server.

Getting started
The code is in parsoid/tests/server and parsoid/tests/client.

We are using the Gerrit code review system at the Wikimedia foundation. To set up an account, please follow these first steps:
 * Install and configure Git
 * Create an SSH key
 * Create a developer account
 * Log in and add your public key to gerrit and to wikitech.
 * Install git-review

Now get the code: git clone ssh://USERNAME@gerrit.wikimedia.org:29418/mediawiki/services/parsoid

You need node.js 0.10 and MySQL, which is available in most current Linux distros ( on Debian) and for OSX. It might also work on Windows (we heard positive reports), but we don't really support Windows. The main developers all use Debian or Ubuntu Linux.

To verify that your node install was successful: node --version v0.10.24 Now install the dependencies for the server in the parsoid repository: cd parsoid/tests/server npm install

To try the MySQL version of the server, you also need to install MySQL, create a db and user and init it with   and set up server.settings.js based on the example file provided. You will need to adjust the user, password and database name. Now you should be able to start the server with 'node server', which should give you a web interface at http://localhost:8001/.

Cassandra setup
See the Cassandra download page for most systems. On Debian, simply add  to /etc/apt/sources.list, and then do apt-get update apt-get install cassandra openjdk-7-jdk libjna-java libjemalloc1

Set up /etc/cassandra/cassandra.yaml according to the docs. Main things to change:
 * listen_address, rpc_address : set to external IP of this node
 * seed_provider / seeds : set to list of other cluster node IPs: "10.64.16.147,10.64.16.149,10.64.0.200"

(Re)start cassandra:. The command

nodetool status

should return information and show your node (and the other nodes) as being up. Example output: root@xenon:~# nodetool status Datacenter: datacenter1

=
========== Status=Up/Down -- Address       Load       Tokens  Owns   Host ID                               Rack UN 10.64.16.149  91.4 KB    256     33.4%  c72025f6-8ad8-4ab6-b989-1ce2f4b8f665  rack1 UN 10.64.0.200   30.94 KB   256     32.8%  48821b0f-f378-41a7-90b1-b5cfb358addb  rack1 UN 10.64.16.147  58.75 KB   256     33.8%  a9b2ac1c-c09b-4f46-95f9-4cb639bb9eca  rack1
 * / State=Normal/Leaving/Joining/Moving

Now you can start playing with Cassandra using the  cli interface. See the Cassandra 2.0 and CQL 3.1 docs for more Cassandra background.

Next steps

 * Check out the current round-trip server code and familiarize yourself with node.js
 * Read up on and play with Cassandra. You can also play with the Rashomon storage service prototype as a relatively simple example of node + cassandra.
 * Read up on eventual consistent systems and idempotence. Some starting points (feel free to edit):
 * Eventually Consistent
 * Dynamo
 * Building on Quicksand
 * Spanner as an example of a different trade-off with a good use of logical time that tracks GPS time

Contacting us

 * IRC: we are hanging out in #mediawiki-parsoid on freenode (you can also use the web chat). Nicks are gwicke (Gabriel, San Francisco) and marcoil (Marc, Spain)
 * Mail: {gwicke,marcoil,aschulz}@wikimedia.org