Wikibase/Indexing/Prototype

Source
https://git.wikimedia.org/summary/wikidata%2Fgremlin

Usage

 * 1) Build with mvn install.
 * 2) Copy runit.sh and runit.groovy to Titan directory.
 * 3) Prepare config.properties with titan configurations
 * 4) Start with sh runint.sh

Loading data
The processed line count is reported in file processed.dump.1.0 (the actual filename depends on the parameters).
 * 1) Call dataLoader.preload to initialize sitelink/language properties.
 * 2) Separate properties list: gunzip -c dump | grep '{"id":"P' > props.json. Should be around 1318 lines.
 * 3) Set storage.batch-loading = false in config.properties. TBD: this is necessary since props from the beginning of the file use later ones, we may want to automate that one day.
 * 4) Start the console as in above, then do propLoader.file("props.json").load(10000) to load properties
 * 5) Set storage.batch-loading = true in config.properties
 * 6) Load data with dataLoader.gzipFile("dump").load(1000000) - this loads 1M lines from the dump.

Setting batch-loading step may be eliminated in the future.

Loader API
The loader class has following useful methods: All methods can be chained except for load.
 * file/gzipFile(String) - set source file
 * setNum(int) - set the start line in the dump
 * failOnError(bool) - if true, the loading fails immediately on exception, otherwise it proceeds and the failing line is written to rejects file.
 * recover - reset the line to the last one from processed file for this run (use the same setNum parameter!)
 * load(int) - load the number of lines specified as the argument

Benchmarking
Benchmarking can be done by using w.benchmark {closure} which reports raw running time in ms, and w.measure(times) { closure } which runs the closure given number of times for 5 sessions and calculates the average.

Rexster Setup (works for 0.5)

 * 1) Copy rexster-wikidata.xml from the repo to $TITAN/conf.
 * 2) Copy rexster-init.groovy from the repo to $TITAN/rexhome.
 * 3) Edit rexster-wikidata.xml to match your settings re Cassandra/ES setup and parameters.
 * 4) Copy titan-wikidata.sh from repo to $TITAN/bin/.
 * 5) Copy or link wikidata-gremlin-0.0.1-SNAPSHOT.jar to $TITAN/ext directory.
 * 6) Run   to start the server.
 * 7) Run $TITAN/bin/rexster-console.sh to connect to the console.
 * 8) The logs are in $TITAN/log/rexstitan.log.
 * 9) The graph can be instantiated as   or just.

Changes/modifications on einsteinium
These are things that I had to do on einsteinium from default config:

sudo apt-get install unzip

sudo apt-get install openjdk-7-jdk

sudo apt-get install traceroute

sudo apt-get install groovy

write_request_timeout_in_ms in cassandra.yaml to 5 s

grape install org.codehaus.groovy groovy-backports-compat23 2.3.7 Set up proxy: export HTTP_PROXY=http://webproxy.eqiad.wmnet:8080 export HTTPS_PROXY=http://webproxy.eqiad.wmnet:8080
 * copy groovy-backports-compat23 to titan/lib