WikidataEntitySuggester

The Wikidata Entity Suggester aims to make the task of adding or editing Items on Wikidata easier by suggesting different entities to the author.

Features
Here is a breakdown of its prime features:


 * Suggest properties to be used in a claim, based on the properties that already exist in the item's claims.
 * The API can take an item's prefixed ID and recommend properties for it.
 * The API can also be fed a list of properties and it can recommend properties based upon the list.
 * Suggest properties to be used in source references, based on the properties that already exist in the claim containing the source ref.
 * The API can take a claim GUID and recommend properties for its source ref.
 * The API can also be fed a list of properties and it can recommend properties based upon the list.
 * Suggest qualifiers for a given property
 * Suggest values for a given property.

Basic components + Software requirements
The Suggester consists of two main parts - a backend REST API written in Java and a frontend MediaWiki extension containing the API module written in PHP.

The backend consists of a number of parts - it has two Myrrix instances (ie. two WAR files or Java EE apps running on Tomcat) and another Java EE war app (the REST API containing the Recommenders, Servlets etc.). The REST API provides a number of servlets to suggest entities and to ingest datasets (train the recommendation engine). In order to train the recommendation engine, a number of CSV-style datasets need to be generated. Python MapReduce scripts have been written, to be run on Hadoop through Hadoop Streaming, that will generate the training datasets from a wikidata data dump like wikidatawiki-20130922-pages-meta-current.xml.bz2 on this page.

So, the external software required to run the backend API are (assuming Python, Java, PHP etc. are installed and configured as usual on a LAMP server):
 * Apache Tomcat (tested with Aapache Tomcat 7.0.39 downloadable here)
 * Hadoop (tested with Hadoop 0.20.2-cdh3u6 downloadable here)

Everything has been tested with Oracle Java build 1.7.0_25-b15. It is recommended that you use Oracle Java 1.7, otherwise Hadoop will cause problems.

Installation and Setup
Create a directory where you'd like to do all Entity Suggester work and setup.

Tomcat Setup
Let's begin by setting up Tomcat.

wget http://archive.apache.org/dist/tomcat/tomcat-7/v7.0.39/bin/apache-tomcat-7.0.39.tar.gz tar xzf apache-tomcat-7.0.39.tar.gz cd apache-tomcat-7.0.39 mkdir ROOT_backup mv webapps/ROOT* ./ROOT_backup/ rm -rf work

Set some JVM parameters for Tomcat (Set the heap according to your available resources, but at least 4GB is recommended. You might need 5-6 GB to support both the Myrrix instances and the REST API, not including the memory you should keep aside for Hadoop):

There, it's ready.

Hadoop Setup
There is no dearth of info on Hadoop on the internet, it’s easy to find helpful tutorials and beginner’s guides on what it is, how to set it up, the works. This section is meant to point you to the right places and get you running. You may take a look at this blog post to get a more elaborate version of this.

Hadoop consists of four kinds of services -


 * NameNode: Stores metadata for the HDFS. This runs on the master node.
 * DataNode: These services store and retrieve the actual data in the HDFS. This service is run on the slave nodes.
 * JobTracker: This service runs on the master node; it coordinates and schedules jobs and distributes them on the TaskTracker nodes.
 * TaskTracker: Runs the tasks, performs computations and communicates its progress with the JobTracker nodes.

Assuming there is only one node with multiple cores (say, 4 cores) in this scenario, here's how to proceed:

Disable iptables if it's enabled, or open up the Hadoop-specific ports.

Add a user for hadoop and its own directory: sudo useradd -m hadoop sudo mkdir /hadoop sudo chown -R hadoop:hadoop /hadoop

Download the CDH3 Hadoop tarball from Cloudera and set it up: sudo su hadoop cd /hadoop wget http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u6.tar.gz tar xzf hadoop-0.20.2-cdh3u6.tar.gz mv hadoop*/* ./ rm *.tar.gz

Enable passwordless SSH authentication for the hadoop user for localhost, so that you can ssh into localhost from the shell without requiring a password. Remember to chmod the contents of ~hadoop/.ssh/ to 600.

Edit  and   so that both contain the word "localhost" without quotes.

Properly configure the JAVA_HOME variable in. Next, modify the *-site.xml files in the /hadoop/conf directory and add these:

/hadoop/conf/core-site.xml: hadoop.tmp.dir /home/hadoop/tmp fs.default.name hdfs://node1:54310

/hadoop/conf/hdfs-site.xml: dfs.replication 3                true dfs.permission false true

/hadoop/conf/mapred-site.xml: mapred.reduce.tasks 4                mapred.job.reuse.jvm.num.tasks -1                mapred.map.tasks 4                mapred.tasktracker.map.tasks.maximum 4                mapred.tasktracker.reduce.tasks.maximum 4                mapred.job.tracker hdfs://node1:54311 mapred.child.java.opts -Xmx2048m mapred.child.ulimit 5012m

This is a prototype for the Entity Suggester's first and second objectives - suggesting properties and values for a new item in wikidata. I'll be working on adding this entity suggester to Wikidata and improving the sorting order of the entity selector for Google Summer of Code 2013.

As of now, Myrrix is used to build a basic model. Optimal value of lambda and no. of features that I found from ParameterOptimizer are not being used currently. I need to do more experimentation for that.

It's an initial prototype written in Java and PHP, using Myrrix' Java API and Guzzle. The Java backend is a Myrrix instance, plus a couple of custom wrapper servlets that are used to push data into the Myrrix instance and get recommendations from it. The PHP client is built on top of Guzzle and exposes a neat PHP API that can be used to query the backend.

Setting it up is easy - basically, fire up tomcat with the backend war file, run a few commands. Use the PHP API to reap it. I have included a command line standalone client jar too. After building from source, you can find it here:

Wiki Pages
Please read these pages in sequence to learn how to set everything up and how it works. The instructions are for Ubuntu, so it should be fairly easy to follow them and set this up on Labs.
 * How to set everything up on linux (must read!)
 * CSV file explanation
 * Using the PHP client (also contains examples)
 * Using the command line client (also contains examples)
 * Which class does what

Acknowledgements

 * 1) Byrial for sharing the programs used to generate database tables from the wikidata data dump. The property statistics are also being very helpful. I have written a couple of sql codes to generate CSV files as required by the Entity Suggester. The C codes are slow, and the char arrays often break - not portable/robust. Therefore, I'm migrating to Python MapReduce (to run through hadoop streaming) scripts that I'm writing myself for parsing the wiki dumps.
 * 2) bcc-myrrix - it's a PHP client for Myrrix built on top of Guzzle. I used its code and modified it to suit my needs for the Entity Suggester PHP Client.

Progress Reports
I'll be maintaining monthly and weekly reports on this page.