Wikidata Toolkit

From MediaWiki.org
Jump to: navigation, search

The Wikidata Toolkit is an open-source Java library for using data from Wikidata and other Wikibase sites. Its main goal is to make it easy for external developers to take advantage of this data in their own applications. The project started in early 2014, supported by an Individual Engagement Grant of the Wikimedia Foundation. The original project proposal envisions features for loading data from dumps or through the Web API, as well as query functionalities to access and analyse the data.

This page and its subpages provide the main entry points to documentation and resources about the Wikidata Toolkit.

What is Wikidata? What is Wikibase?[edit]

Wikidata is a project of the Wikimedia project that aims to gather data from all Wikipedias and many other projects in a single location. It is a wiki, and anyone can edit the data. If you want to know more about the project, its goals, content, and development, then the introductory article Wikidata: A Free Collaborative Knowledge Base is a good place to start. More details are found on the project pages at wikidata.org.

The software that is used to run this site is Wikibase. This is an extension to the MediaWiki software, which is still used underneath. Indeed Wikidata also has many wikitext pages that co-exist with the data pages that make up most of the content. It is possible to use Wikibase on other sites (the first example of this is the wiki of the EAGLE Project). The Wikidata Toolkit is written to support such sites as well.

How to use Wikidata Toolkit[edit]

There are two main ways of using Wikidata Toolkit right now:

  1. As a Java library to process Wikidata content in your application
  2. As a stand-alone command-line client to process Wikidata content

The rest of this page focuses on the use in Java. For more information on the usage of the client, see the Wikidata Toolkit Client documentation. The current version of Wikidata Toolkit can be used to automatically download and process dump-files from Wikidata.org in Java. This is useful if you want to process all data in Wikidata.org in a streaming fashion in Java. Advanced query capabilities will be added in upcoming versions.

An example of how to use Wikidata Toolkit in a Java project is provided by the Wikidata examples project. It contains a number of example programs and bots that demonstrate basic functions. The following is a selection; for complete details, see the example documentation.

  • FetchOnlineDataExample reads some live data from Wikidata.org (without having to download a dump file first).
  • EditOnlineDataExample writes data to test.wikidata.org. This shows how to create new items, and how to update existing statements. Wikidata Toolkit will merge duplicate statements for you, combining their references.
  • EntityStatisticsProcessor computes some simple statistics of the current dumps. This will need to download a dump file first (this is done only once for each dump).
  • GreatestNumberProcessor Which TV series had the largest number of episodes? Find out with this simple program that scans the data for the largest value of a property.
  • LifeExpectancyProcessor Reveals the average life expectancy of people on Wikipedia and creates a CSV file that you can open in a spreadsheet to draw a graph – the results are surprising. Also shows how to handle times.
  • GenderRatioProcessor Inspired by the work of Max Klein, this program analyses the distribution of genders on Wikipedia an other Wikimedia projects. Again, some surprising results can be seen from the resulting table in CSV format.
  • JsonSerializationProcessor This program reads the dumps and writes the data to an output file using the standard JSON format. With some variation, you could create files that contain only selected items that can be used by scripts that understand the JSON format.
  • SitelinksExample Wikidata Toolkit can also be used to resolve links to Wikimedia projects, e.g., to find the proper URL of an article on German Wikivoyage. This example shows how to do this.
  • RdfSerializationExample Example program that creates the RDF exports of Wikidata. Such exports can also be created with the Wikidata Toolkit Client

The code repository contains further documentation on how to run these examples.

Download and installation[edit]

The current release of Wikidata Toolkit is version 0.7.0. The easiest way of using the library is with Maven. Maven users must add the following dependency to the dependencies in their pom.xml file:

<dependency>
	<groupId>org.wikidata.wdtk</groupId>
	<artifactId>wdtk-dumpfiles</artifactId>
	<version>0.7.0</version>
</dependency>

You need to use Java 1.7 or above. If you are using Maven from Eclipse, this might require a change in your Maven config (see beginner's guide below).

Currently, the following Maven modules (artifacts) are available:

  • wdtk-wikibaseapi: Reading and writing data to Wikidata or any other Wikibase site via the Web API. This can be used to write Java-based bots for Wikidata, or simply to fetch some live data in a program. Example programs on how to read and write data with this library.
  • wdtk-dumpfiles: Downloading and processing dumpfiles. As shown in the examples, this can be used to get Java access to all Wikidata.org data. It could also be used to download and process XML dumpfiles for arbitrary MediaWiki projects, especially for the Wikimedia projects that publish dumps at dumps.wikimedia.org. However, this access would be on the wikitext level; a parser for MediaWiki wikitext is not included.
  • wdtk-datamodel: Representing Wikibase data in Java. This is an implementation of the Wikibase datamodel as used by Wikidata and other Wikibase sites. It also includes all code needed to convert back and forth between the standard JSON format and Java objects.
  • wdtk-rdf: Code for serializing Wikibase data in RDF, the W3C Resource Description Format.
  • wdtk-storage: Custom data structures that are used by Wikidata Toolkit for storing data in memory.
  • wdtk-util: Utility code that is not specific to any of the other modules.

Most likely, you want to use wdtk-wikibaseapi and/or wdtk-dumpfiles at the current stage (they depend on other modules, which will be downloaded automatically). You could also use wdtk-datamodel alone to represent Wikibase data in your application. However, be aware that some API details may still change until the first stable release.

If you are interested in RDF, you may want to look at the RDF exports for Wikidata that are created regularly using Wikidata Toolkit. This might already be enough for many applications. For more customized RDF exports, however, you could also adapt the underlying code yourself.

If you are not using Maven, you can still download the jars for the above modules manually from Maven Central. Alternatively, there is also a single all-in-one jar available from the Wikidata Toolkit release page. Note that this code depends on other libraries (this is why using Maven is so much simpler).

The source code is hosted at github, where it can be browsed, forked, and downloaded:

Beginner's guide[edit]

If you have not worked with Java a lot yet, then the above may still seem somewhat daunting to you. To get started, we suggest you get the free IDE Eclipse to make things a little simpler. You should also install the Eclipse plugin for Maven support as described on our Eclipse setup page (don't worry about git unless you want to use this already).

The following instructions describe how to set up a new project using Wikidata Toolkit from scratch. Alternatively, you could also clone the existing examples project and modify it to your needs.

If you don't know Maven at all, you might want to have a glance at the Maven Getting Started Guide. This is already quite long, so don't read it yet, but it's good to know where to start if you have Maven questions. Anyway, to get started right away, open Eclipse, select File -> New -> Project ... -> Maven -> Maven Project and click "Next". Check "Create simple project" at the top and click "Next". You are then asked for basic data about your new project (the meaning of these fields is explained in the Maven guide). Enter a group id (e.g., "org.example") and an artifact ID (e.g., "my-first-test-project"); the other fields are optional. Click "Finish".

You should now see your new project folder with several standard files pre-created in the Eclipse package explorer on the left. You can browse it to find pom.xml, your Maven configuration. Double-click on this file to open a pom.xml editor. Go to the tab "Dependencies" and click "Add ...". Enter the data given above: group id "org.wikidata.wdtk", artifact id "wdtk-dumpfiles", version "0.7.0" (or whatever is current). Click "OK".

You can look at the "pom.xml" tab to see what this did to the actual file. While you are there, copy and paste the following block to the end of that XML file (just before </project>:

<build>
	<plugins>
		<plugin>
			<!-- Used to set JRE version; will be used by IDEs like Eclipse as the 
				target JRE (default is 1.5) -->
			<groupId>org.apache.maven.plugins</groupId>
			<artifactId>maven-compiler-plugin</artifactId>
			<version>3.1</version>
			<configuration>
				<source>1.7</source>
				<target>1.7</target>
			</configuration>
		</plugin>
	</plugins>
</build>

This tells Eclipse to use Java 1.7, which is required for Wikidata Toolkit. Don't forget to save the file. Right-click your project in the Package explorer on the left, "Maven -> Update project", "OK" to make sure the changed Java version is really set properly for Eclipse. You have thus configured your project to use Wikidata Toolkit in your programs. Right-click on the project on the left, select "Run As -> Maven install" to see if Maven is reasonably happy with this configuration so far.

Your project has no packages or classes yet. You can create them with Eclipse as usual (right click on "src/main/java" -> "New ...") and start hacking away. You can also try out some Wikidata Toolkit example code in your own project (copying just the code from the files rather than importing the whole project). To run the code of EntityStatisticsProcessor (for example), you need to add another dependency to you project (this is used for logging; you can also use something else there; see the example code for how this is configured):

<dependency> 
	<groupId>org.slf4j</groupId>
	<artifactId>slf4j-log4j12</artifactId>
	<version>1.7.6</version>
</dependency>

Add this to your pom.xml (you can use the editor you used before or paste it into XML right away; but don't forget to save the file). You are now able to run the program: find your class file in the package explorer on the left, right click "Run as -> Java application". If anything should fail, "Run As -> Maven install" or "Maven -> Update project" on the project again first to make sure the configuration works.

If you (really) want to run the example from the command line using Maven without Eclipse, you can do this by changing to the root directory of your project (the one where your "src" folder is) and run:

mvn compile
mvn exec:java -Dexec.mainClass="org.wikidata.wdtk.examples.FetchOnlineDataExample" 

where "org.wikidata.wdtk.examples.FetchOnlineDataExample" is the class you want to run (with its full package name). Note that the command line does not use the same Maven installation that Eclipse is using, so you might need to install "mvn" for your platform first.

Getting help[edit]

Bugs and feature requests should be reported through github under the Wikidata Toolkit issue tracker. For further discussion, the mailing list wikidata-l (usage, general requirements) and wikidata-tech (technical discussions, development) should be used.

For convenience, we also provide an online version of the Wikidata Toolkit API documentation for the current development branch. API documentation for the releases ships with the Maven packages and should be accessible in Ecplise as soon as you have configured your dependencies.

Getting involved[edit]

Developers are invited to contribute to the toolkit. Developers can download or fork the github repository, and are generally invited to send comments and requirements. The project uses Maven to manage dependencies and to build the code, making it very easy for developers to compile the project. Change to the folder where the source code has been downloaded to and run the following commands to compile and to test the code (required Maven >=3.0 to be installed):

mvn install
mvn test

Maven integration is available for standard Java IDEs:

People[edit]

The project is led by Markus Kroetzsch; see also IEG proposal project team. The list of contributors can be found at github.