WikidataEntitySuggester

The Wikidata Entity Suggester aims to make the task of adding or editing Items on Wikidata easier by suggesting different entities to the author.

= Features =

Here is a breakdown of its prime features:


 * Suggest properties to be used in a claim, based on the properties that already exist in the item's claims.
 * The API can take an item's prefixed ID and recommend properties for it.
 * The API can also be fed a list of properties and it can recommend properties based upon the list.
 * Suggest properties to be used in source references, based on the properties that already exist in the claim containing the source ref.
 * The API can take a claim GUID and recommend properties for its source ref.
 * The API can also be fed a list of properties and it can recommend properties based upon the list.
 * Suggest qualifiers for a given property
 * Suggest values for a given property.

= Basic components + Software requirements =

The Suggester consists of two main parts - a backend REST API written in Java and a frontend MediaWiki extension containing the API module written in PHP.

The backend consists of a number of parts - it has two Myrrix instances (ie. two WAR files or Java EE apps running on Tomcat) and another Java EE war app (the REST API containing the Recommenders, Servlets etc.). The REST API provides a number of servlets to suggest entities and to ingest datasets (train the recommendation engine). In order to train the recommendation engine, a number of CSV-style datasets need to be generated. Python MapReduce scripts have been written, to be run on Hadoop through Hadoop Streaming, that will generate the training datasets from a wikidata data dump like wikidatawiki-20130922-pages-meta-current.xml.bz2 on this page.

So, the external software required to run the backend API are (assuming Python, Java, PHP etc. are installed and configured as usual on a LAMP server):
 * Apache Tomcat (tested with Aapache Tomcat 7.0.39 downloadable here)
 * Hadoop (tested with Hadoop 0.20.2-cdh3u6 downloadable here)

Everything has been tested with Oracle Java build 1.7.0_25-b15. It is recommended that you use Oracle Java 1.7, otherwise Hadoop will cause problems.

= Setup =

Software Installation
I have detailed setup and installation instructions for Tomcat and Hadoop here.

Setting up the Entity Suggester
Clone the WikidataEntitySuggester repo and build it: git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/WikidataEntitySuggester cd WikidataEntitySuggester mvn install

Copy the built Myrrix war files to Tomcat's webapps: cp myrrix-claimprops/target/myrrix-claimprops.war /webapps/ cp myrrix-refprops/target/myrrix-refprops.war /webapps/

Check the catalina log file  in the Tomcat directory to see whether the Myrrix WARs have been deployed successfully. Check  and   to see whether the Myrrix instances are running.

Now, copy the REST API war to webapps: cp client/target/entitysuggester.war /webapps/

Wait for it to be deployed by the server and check  to see if the welcome page has come up with examples of possible actions.

= Training the Suggester =

Download the latest wikidata data dump, decompress it and push it to HDFS: cd  wget http://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-meta-current.xml.bz2 bzip2 -d wikidatawiki-latest-pages-meta-current.xml.bz2 cd /hadoop bin/hadoop dfs -copyFromLocal /wikidatawiki-latest-pages-meta-current.xml /input/dump.xml

You can find two Python scripts in the /wikiparser source directory, called mapper.py and reducer.py. There are docstrings at the beginning that explain how to run these files with Hadoop.

There are different ways you may run Hadoop. The method mentioned in this document relies on the Cloudera tarball. I haven't set Hadoop environment variables. But if you make an installation through rpms or debs, hadoop will be in your path. According to that you may have to make some modifications to the  shell script in the wikiparser source directory.

Assuming you're going to run hadoop from /hadoop, change the hadoop and hadoop_command variables in  to "/hadoop" and "bin/hadoop" (Yes, you have to make it bin/hadoop and run it with /hadoop as the present directory, or else it'll break with access problems.) Make sure the hadoop user has permissions to access /hadoop. cp wikiparser/*.py /hadoop/ cp wikiparser/*.sh /hadoop/ chmod a+x /hadoop/*.py /hadoop/*.sh cp -rf wikiparser/target /hadoop/ chown -R hadoop:hadoop /hadoop cd /hadoop ./runhadoop.sh global-ip-pairs /input/dump.xml /output/global-ip-pairs ./train-claimprops.csv

OR, if the hadoop binaries are already in your path and everything is setup correctly, you may not need to modify  and you can simply do, as an example: cd  for convenience.

= How to use the backend REST API =

The REST API has two types of servlets - suggester (/suggest/*) servlets and ingester (/ingest/*) servlets. Please note that all entity IDs that the suggester deals with are prefixed IDs because the training datasets contain prefixed IDs. The suggester makes no internal assumption of prefixes or the nature of IDs and treats them as raw strings. Therefore, it'll behave just the way it is trained to by the training datasets.

Suggester Servlets

 * Claim property suggester:
 * This servlet can suggest properties based on a comma-separated input list of properties like this: code>/entitysuggester/suggest/claimprops/P41,P24,P345
 * One can also omit the list of properties to get some default suggestions provided by a popularity-sorted property recommender:
 * A howMany parameter should be added, to limit the number of suggestions. It is 0 by default. Example:  or


 * Source ref property suggester:
 * This servlet can suggest properties based on a comma-separated input list of properties like this:
 * One can also omit the list of properties to get some default suggestions provided by a popularity-sorted property recommender:
 * A howMany parameter should be added, to limit the number of suggestions. It is 0 by default. Example:  or

NOTE: The two different property suggesters are trained by different data sets; hence they provide different suggestions.


 * Qualifier property suggester:
 * This servlet can suggest qualifiers for a mandatory single property input like:
 * A howMany parameter should be added, to limit the number of suggestions. It is 0 by default. Example:


 * Value suggester:
 * This servlet can suggest values for a mandatory single property input like:
 * A howMany parameter should be added, to limit the number of suggestions. It is 0 by default. Example:

Output Format for Suggester Servlets
All the suggester servlets give output in JSON format. As an example,  may yield an output like: "P143",0.9924422],["P248",0.007505652

It is an array of arrays, where each constituent array consists of the entity ID (string) and the relative score (float).

Ingester Servets
All ingest servlets read the datasets from the POST body. As explained in the "Training the Suggester" section above, it's easy to train the suggester using curl to POST the training file to the servlet. Example: curl -X POST --data-binary @value-train.csv http://machine_ip:8080/entitysuggester/ingest/values

= Progress Reports = I'll be maintaining monthly and weekly reports on this page.