Wikidata Toolkit/Client

The Wikidata Toolkit Client, or WDTK Client, is a stand-alone application provided as part of Wikidata Toolkit. It can be used to download and process Wikidata content without developing own software.

Download and installation
The WDTK Client is distributed as a jar file that can be downloaded from the releases page on github (attached to the most recent release). For example, the client for WDTK v0.4.0 is found at

https://github.com/Wikidata/Wikidata-Toolkit/releases/download/v0.4.0/wdtk-client-0.4.0.jar

You can download this file to any location on your computer. To run the client, you need to have Java 1.7 or above installed on your machine, which provides you with the "java" command. You can run the client with the java command as follows:

Doing this will produce a help message that gives you an overview of the available command-line parameters. There is no further installation necessary, but you could also write a shell script (most operating systems) or batch file (Microsoft Windows) that runs the above command for you, so that you do not need to type the above parameters each time. For example, on MacOS, Linux, and Unix, you might create a file wdtk-client.sh with the following content:

You could also specify additional Java parameters in this way. The script could then be marked executable and placed into a directory on you path, so that you only need to call wdtk-client.sh to start the client. In the rest of this page, we use this as a command, but you can also call java with the -jar parameter each time if preferred.

Basic usage
The WDTK Client can execute several actions. Typically, an action runs an operation on the latest Wikidata content dump. If necessary, this dump will first be downloaded, which can take a while. Once this is done, the file remains stored locally so that it can be used in the future. The place where the dump is stored can be configured (by default it is a directory called "dumpfiles" in the working directory of the client), and it is also possible to run the client in offline mode to restrict to previously downloaded dumps rather than trying to get the most recent one. Dumps are usually compressed and they will not be decompressed to disk, so you do not need additional disk space beyond what you need to download the dump file in the first place (plus space for any output that the script may create).

As of WDTK 0.4.0, there are two actions available:
 * json: converts Wikidata dumps to JSON format
 * rdf: converts Wikidata dumps to RDF format in a variety of ways (this is used to generate the Wikidata RDF dumps)

An action is selected with the parameter --action (or -a). For example, to run the JSON conversion, one could do

Note that the official dumps are already in JSON format, so this action would merely recreate the JSON dump with largely the same data (possibly filtering out some errors). The JSON action makes more sense when used with other parameters, e.g., if you want to create a JSON dump that contains only English language labels.

Some actions may require further parameters to be set in order to specify what exactly should be done. For example, "rdf" requires a parameter --rdftasks to be set. Usually, an action should provide you with documentation on missing parameters if you just run it. For example, the documentation for --rdftasks is shown when running

Different actions might require different parameters, but the documentation should always be provided if no parameters are given.

General parameters
There are a number of further parameters that apply to most actions. Example:

This would create a JSON file, in offline mode (no new downloads), write the output to a file my-output.json, which will be compressed with BZ2 (so the actual output file is my-output.json.bz2), including only terms in languages English and German and no site links.

The documentation for the parameters is as follows:

In addition, there are several parameters that can be used to filter some parts of the data, i.e., to remove some data before it is further processed. This can be useful, e.g., to create dump files that contain only the data for a particular language.

Configuration files: running multiple actions at once
Using the command-line options, only a single action can be configured. However, basic tasks such as decompressing and parsing the JSON file are required by many actions, and it is not efficient to run them three times if three actions need to be performed. For this reason, WDTK Client supports the configuration of multiple actions that will all be performed together in a single run. This is achieved by writing a simple configuration file.

The configuration file is a simple "ini" file. Here is an example:

This file configures three actions: two RDF dumps and one JSON dump. In addition it defines some general parameters that are applied to all actions at the top of the file. The parameters that are used are exactly the same as the long names of the command-line parameters. To use this configuration file, the parameter --config is used, e.g.,

You could use additional command-line parameters to make some setting on the command line rather than in the configuration file. You can also configure another action through the command-line parameters if you think this is useful.

Some parameters can only be used as general parameters (they will always refer to all actions), while others can only be used for individual actions. The example shows this: action configuration, output, and compression are specific to each action, whereas dump location, offline mode, and general filters apply to all actions.