Wikibase/Indexing/Prototype

Source
https://github.com/smalyshev/wikidata-gremlin/tree/titan_flat

Usage

 * 1) Build with mvn install.
 * 2) Copy runit.sh and runit.groovy to Titan directory.
 * 3) Start with sh runint.sh 1 (change the number for each instance run in parallel)

Loading data
The processed line count is reported in file processed.dump.1.0 (the actual filename depends on the parameters).
 * 1) Separate properties list: gunzip -c dump | grep '{"id":"P' > props.json. Should be around 1318 lines.
 * 2) Start the console as in above, then do propLoader.file("props.json").load(10000) to load properties
 * 3) Load data with dataLoader.gzipFile("dump").load(1000000) - this loads 1M lines from the dump.

Loader API
The loader class has following useful methods: All methods can be chained except for load.
 * file/gzipFile(String) - set source file
 * setNum(int) - set the start line in the dump
 * failOnError(bool) - if true, the loading fails immediately on exception, otherwise it proceeds and the failing line is written to rejects file.
 * load(int) - load the number of lines specified as the argument

Benchmarking
Benchmarking can be done by using w.benchmark {closure} which reports raw running time in ms, and w.measure(times) { closure } which runs the closure given number of times for 5 sessions and calculates the average.

Changes/modifications on einsteinium
These are things that I had to do on einsteinium from default config:

sudo apt-get install unzip

sudo apt-get install openjdk-7-jdk

sudo apt-get install traceroute

sudo apt-get install groovy

write_request_timeout_in_ms in cassandra.yaml to 5 s

grape install org.codehaus.groovy groovy-backports-compat23 2.3.7 Set up proxy: export HTTP_PROXY=http://webproxy.eqiad.wmnet:8080 export HTTPS_PROXY=http://webproxy.eqiad.wmnet:8080
 * copy groovy-backports-compat23 to titan/lib