Wikibase/Indexing/Prototype

From mediawiki.org

Source[edit]

https://git.wikimedia.org/summary/wikidata%2Fgremlin

Usage[edit]

The instructions below may be out of date for Titan 0.9. Will be updated soon.

  1. Build with mvn install.
  2. Copy runit.sh and runit.groovy to Titan directory.
  3. Prepare config.properties with titan configurations
  4. Start with sh runint.sh

Loading data[edit]

  1. Call dataLoader.preload() to initialize sitelink/language properties.
  2. Separate properties list: gunzip -c dump | grep '{"id":"P' > props.json. Should be around 1380 lines.
  3. Start the console as in above, then do propLoader.file("props.json").load(10000) to load properties. The argument of load() should be greater than the number of lines in props.json.
  4. Load data with dataLoader.gzipFile("dump").load(1000000) - this loads 1M lines from the dump.

The processed line count is reported in file processed.dump.1.0 (the actual filename depends on the parameters).

Setting batch-loading step may be eliminated in the future.

Loader API[edit]

The loader class has following useful methods:

  • file/gzipFile(String) - set source file
  • setNum(int) - set the start line in the dump
  • failOnError(bool) - if true, the loading fails immediately on exception, otherwise it proceeds and the failing line is written to rejects file.
  • recover() - reset the line to the last one from processed file for this run (use the same setNum parameter!)
  • load(int) - load the number of lines specified as the argument

All methods can be chained except for load().

Benchmarking[edit]

Benchmarking can be done by using w.benchmark {closure} which reports raw running time in ms, and w.measure(times) { closure } which runs the closure given number of times for 5 sessions and calculates the average.

Rexster Setup (works for 0.5)[edit]

  1. Copy rexster-wikidata.xml from the repo to $TITAN/conf.
  2. Copy rexster-init.groovy from the repo to $TITAN/rexhome.
  3. Edit rexster-wikidata.xml to match your settings re Cassandra/ES setup and parameters.
  4. Copy titan-wikidata.sh from repo to $TITAN/bin/.
  5. Copy or link wikidata-gremlin-0.0.1-SNAPSHOT.jar to $TITAN/ext directory.
  6. Run $TITAN/bin/titan-wikidata.sh start to start the server.
  7. Run $TITAN/bin/rexster-console.sh to connect to the console.
  8. The logs are in $TITAN/log/rexstitan.log.
  9. The graph can be instantiated as rexster.getGraph('wikidata') or just gg().

Changes/modifications on einsteinium[edit]

These are things that I had to do on einsteinium from default config:

sudo apt-get install unzip

sudo apt-get install openjdk-7-jdk

sudo apt-get install traceroute

sudo apt-get install groovy

write_request_timeout_in_ms in cassandra.yaml to 5 s

grape install org.codehaus.groovy groovy-backports-compat23 2.3.7

  • copy groovy-backports-compat23 to titan/lib

Set up proxy: export HTTP_PROXY=http://webproxy.eqiad.wmnet:8080 export HTTPS_PROXY=http://webproxy.eqiad.wmnet:8080