Wikidata Toolkit/Client

From mediawiki.org

The Wikidata Toolkit Client, or WDTK Client, is a stand-alone application provided as part of Wikidata Toolkit. It can be used to download and process Wikidata content without developing own software.

Download and installation[edit]

The WDTK Client is distributed as a jar file that can be downloaded from the releases page on github (attached to the most recent release). For example, the client for WDTK v0.7.0 is found at

https://github.com/Wikidata/Wikidata-Toolkit/releases/download/v0.7.0/wdtk-client-0.7.0.jar

You can download this file to any location on your computer (maybe also check the digital signature). To run the client, you need to have Java 1.7 or above installed on your machine, which provides you with the "java" command. You can run the client with the java command as follows:

java -jar wdtk-client-0.7.0.jar

Doing this will produce a help message that gives you an overview of the available command-line parameters. There is no further installation necessary, but you could also write a shell script (most operating systems) or batch file (Microsoft Windows) that runs the above command for you, so that you do not need to type the above parameters each time. For example, on MacOS, Linux, and Unix, you might create a file wdtk-client.sh with the following content:

#!/bin/bash
java -jar /full/path/to/the/client/wdtk-client-0.7.0.jar "$@"

On Windows, the batch file should be named wdtk-client.bat, with the following content:

java -jar \full\path\to\the\client\wdtk-client-0.7.0.jar %*

You could also specify additional Java parameters in this way. The script could then be marked executable and placed into a directory on you path, so that you only need to call wdtk-client.sh to start the client. In the rest of this page, we use this as a command, but you can also call java with the -jar parameter each time if preferred.

Basic usage[edit]

The WDTK Client can execute several actions. Typically, an action runs an operation on the latest Wikidata content dump. If necessary, this dump will first be downloaded, which can take a while. Once this is done, the file remains stored locally so that it can be used in the future. The place where the dump is stored can be configured (by default it is a directory called "dumpfiles" in the working directory of the client), and it is also possible to run the client in offline mode to restrict to previously downloaded dumps rather than trying to get the most recent one. Moreover, one can specify a particular dump file that is located on the local file system rather than relying on dumps downloaded from Wikibase (see option --input).

Dumps are usually compressed and they will not be decompressed to disk, so you do not need additional disk space beyond what you need to download the dump file in the first place (plus space for any output that the script may create).

As of WDTK 0.7.0, there are two actions available:

  • json: converts Wikidata dumps to JSON format
  • rdf: converts Wikidata dumps to RDF format in a variety of ways (this is used to generate the Wikidata RDF dumps)
  • sqid: creates JSON files with statistical information about the use of classes, properties, and other overall statistics as used in the SQID Wikidata Browser. See the SQID file format.md for documentation of the content and structure of these files.

An action is selected with the parameter --action (or -a). For example, to run the JSON conversion, one could do

wdtk-client.sh --action json

Note that the official dumps are already in JSON format, so this action would merely recreate the JSON dump with largely the same data (possibly filtering out some errors). The JSON action makes more sense when used with other parameters, e.g., if you want to create a JSON dump that contains only English language labels.

Some actions may require further parameters to be set in order to specify what exactly should be done. For example, "rdf" requires a parameter --rdftasks to be set. Usually, an action should provide you with documentation on missing parameters if you just run it. For example, the documentation for --rdftasks is shown when running

wdtk-client.sh --action rdf

Different actions might require different parameters, but the documentation should always be provided if no parameters are given.

General parameters[edit]

There are a number of further parameters that apply to most actions: for example:

wdtk-client.sh --action json --offline --output my-output.json --z bz2 --fLang en,de --fSite -

This would create a JSON file, in offline mode (no new downloads), write the output to a file my-output.json, which will be compressed with BZ2 (so the actual output file is my-output.json.bz2), including only terms in languages English and German and no site links.

The documentation for the parameters is as follows:

parameter documentation
-a,--action <action> define the action that should be performed (use --help) to show all currently available actions
-h,--help print help message about all general parameters currently supported
-d,--dumps <path> set the location of the dump files; by default this will be the directory "dumpfiles" in the working directory; to reuse previously downloaded dumps, this should be a single location, so if you run the client from varying directories, you could set this parameter within the shell script
-i,--input <path> select a dump file for processing; if omitted, then the latest dump from Wikidata will be used (and possibly downloaded)
-n,--offline execute all operations in offline mode, using only previously downloaded dumps
-o,--output <path> place the output into the file at <path>; this can be a relative path (such as a file name)
-z,--compression <type> use a certain compression format for the output; possible values are gz and bz2; it is strongly recommended to use output compression, since the files can otherwise become very big
-s,--stdout write output to stdout; if this is given, the option -o is ignored
-q,--quiet perform all actions quietly, without printing status messages to the console; errors/warnings are still printed to stderr
--rdftasks <task> specify which data to include in RDF dump (use with action "rdf"); run with options "-a rdf -n" for help
-c,--config <file> set a config file; use this to define multiple actions for a single run (see below)

In addition, there are several parameters that can be used to filter some parts of the data, i.e., to remove some data before it is further processed. This can be useful, e.g., to create dump files that contain only the data for a particular language.

parameter documentation
--fLang <languages> specifies a list of language codes; if given, only terms in languages in this list will be processed; the value "-" denotes the empty list (no terms are processed)
--fProp <ids> specifies a list of property ids; if given, only statements for properties in this list will be processed; the value "-" denotes the empty list (no statements are processed)
--fSite <sites> specifies a list of site keys; if given, only site links to sites in this list will be processed; the value "-" denotes the empty list (no site links are processed)

Configuration files: running multiple actions at once[edit]

Using the command-line options, only a single action can be configured. However, basic tasks such as decompressing and parsing the JSON file are required by many actions, and it is not efficient to run them three times if three actions need to be performed. For this reason, WDTK Client supports the configuration of multiple actions that will all be performed together in a single run. This is achieved by writing a simple configuration file.

The configuration file is a simple "ini" file. Here is an example:

[general]
offline = true
dumps = /my/dump/location
fSite = -
fProp = P31
fLang = fr,zh

[rdf-statement-dump]
action = rdf
compression = gz
rdftasks = items,statements
output = /tmp/wikidata-item-statements.nt

[rdf-label-dump]
action = rdf
compression = gz
rdftasks = items,labels
output = /tmp/wikidata-item-labels.nt

[json-dump]
action=json
compression=bz2
output = /tmp/wikidata-dump.json

This file configures three actions: two RDF dumps and one JSON dump. In addition it defines some general parameters that are applied to all actions at the top of the file. The parameters that are used are exactly the same as the long names of the command-line parameters. To use this configuration file, the parameter --config is used, e.g.,

wdtk-client.sh --config myconfig.ini

You could use additional command-line parameters to make some setting on the command line rather than in the configuration file. You can also configure another action through the command-line parameters if you think this is useful.

Some parameters can only be used as general parameters (they will always refer to all actions), while others can only be used for individual actions. The example shows this: action configuration, output, and compression are specific to each action, whereas dump location, offline mode, and general filters apply to all actions.