WikidataEntitySuggester

The Wikidata Entity Suggester aims to make the task of adding or editing Items on Wikidata easier by suggesting different entities to the author.

= Features =

Here is a breakdown of its prime features:


 * Suggest properties to be used in a claim, based on the properties that already exist in the item's claims.
 * The API can take an item's prefixed ID and recommend properties for it.
 * The API can also be fed a list of properties and it can recommend properties based upon the list.
 * Suggest properties to be used in source references, based on the properties that already exist in the claim containing the source ref.
 * The API can take a claim GUID and recommend properties for its source ref.
 * The API can also be fed a list of properties and it can recommend properties based upon the list.
 * Suggest qualifiers for a given property
 * Suggest values for a given property.

= Basic components + Software requirements =

The Suggester consists of two main parts - a backend REST API written in Java and a frontend MediaWiki extension containing the API module written in PHP.

The backend consists of a number of parts - it has two Myrrix instances (ie. two WAR files or Java EE apps running on Tomcat) and another Java EE war app (the REST API containing the Recommenders, Servlets etc.). The REST API provides a number of servlets to suggest entities and to ingest datasets (train the recommendation engine). In order to train the recommendation engine, a number of CSV-style datasets need to be generated. Python MapReduce scripts have been written, to be run on Hadoop through Hadoop Streaming, that will generate the training datasets from a wikidata data dump like wikidatawiki-20130922-pages-meta-current.xml.bz2 on this page.

So, the external software required to run the backend API are (assuming Python, Java, PHP etc. are installed and configured as usual on a LAMP server):
 * Apache Tomcat (tested with Aapache Tomcat 7.0.39 downloadable here)
 * Hadoop (tested with Hadoop 0.20.2-cdh3u6 downloadable here)

Everything has been tested with Oracle Java build 1.7.0_25-b15. It is recommended that you use Oracle Java 1.7, otherwise Hadoop will cause problems.

= Setup =

Software Installation
I have detailed setup and installation instructions for Tomcat and Hadoop here.

Setting up the Entity Suggester
Clone the WikidataEntitySuggester repo and build it: git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/WikidataEntitySuggester cd WikidataEntitySuggester mvn install

Copy the built Myrrix war files to Tomcat's webapps: cp myrrix-claimprops/target/myrrix-claimprops.war /webapps/ cp myrrix-refprops/target/myrrix-refprops.war /webapps/

Check the catalina log file  in the Tomcat directory to see whether the Myrrix WARs have been deployed successfully. Check  and   to see whether the Myrrix instances are running.

Now, copy the REST API war to webapps: cp client/target/entitysuggester.war /webapps/

Wait for it to be deployed by the server and check  to see if the welcome page has come up with examples of possible actions.

= Training the Suggester =

Download the latest wikidata data dump, decompress it and push it to HDFS: cd  wget http://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-meta-current.xml.bz2 bzip2 -d wikidatawiki-latest-pages-meta-current.xml.bz2 cd /hadoop bin/hadoop dfs -copyFromLocal /wikidatawiki-latest-pages-meta-current.xml /input/dump.xml

You can find two Python scripts in the /wikiparser source directory, called mapper.py and reducer.py. There are docstrings at the beginning that explain how to run these files with Hadoop.

Copy the wikiparser jar and the aforementioned .py scripts from the source build directory to somewhere the hadoop user can access: cp wikiparser/mapper.py ~hadoop/ cp wikiparser/reducer.py ~hadoop/ chmod a+x ~hadoop/*.py cp wikiparser/target/wikiparser-0.1.jar ~hadoop/

There are six examples each in both the files that can be used to build the six datasets needed to train the entity suggester. The datasets are:
 * 1) To train claim property suggester
 * 2) To train claim property suggester for empty items (if no property input is given)
 * 3) To train source ref property suggester
 * 4) To train source ref property suggester for empty claims (if no property input is given)
 * 5) To train qualifier suggester
 * 6) To train value suggester

Here is one of the six examples to build the dataset for training the value suggester: bin/hadoop jar contrib/streaming/hadoop*streaming*jar -libjars ~hadoop/wikiparser-0.1.jar \ -inputformat org.wikimedia.wikibase.entitysuggester.wikiparser.WikiPageInputFormat \ -input /input/dump.xml -output /output/prop-values \ -file ~hadoop/mapper.py -mapper '~hadoop/mapper.py prop-values' \ -file ~hadoop/reducer.py -reducer '~hadoop/reducer.py prop-values'

After the Hadoop job completes, you may copy the output from HDFS to a local file: bin/hadoop dfs -cat /output/prop-values/part-* > value-train.csv

Now, to train the Entity Suggester for suggesting values, do a HTTP POST with the file's contents in the POST body: curl -X POST --data-binary @value-train.csv http://machine_ip:8080/entitysuggester/ingest/values

Similarly, all the other five suggesters can be trained using the /ingest/* servlets.

NOTE: The case of building the dataset used to suggest global claim properties (the first example case in mapper.py and reducer.py) is a bit different that the rest of them. In this case, when you run the command given in the Python docs like this, bin/hadoop jar contrib/streaming/hadoop*streaming*jar -libjars ~hadoop/wikiparser-0.1.jar \ -inputformat org.wikimedia.wikibase.entitysuggester.wikiparser.WikiPageInputFormat \ -input /input/dump.xml -output /output/global-ipv-pairs \ -file ~hadoop/mapper.py -mapper '~hadoop/mapper.py /hadoop/global-ipv-pairs' \ -file ~hadoop/reducer.py -reducer '~hadoop/reducer.py global-ipv-pairs /hadoop/item-prop.csv /hadoop/item-propvalue.csv' Two csv files will be generated in the /hadoop directory. is what we need to train the claim property suggester: curl -X POST --data-binary @item-prop.csv http://machine_ip:8080/entitysuggester/ingest/claimprops

You may safely ignore the second csv file, as of now it's not being used by the suggester.

= How to use the backend REST API =

The REST API has two types of servlets - suggester (/suggest/*) servlets and ingester (/ingest/*) servlets. Please note that all entity IDs that the suggester deals with are prefixed IDs because the training datasets contain prefixed IDs. The suggester makes no internal assumption of prefixes or the nature of IDs and treats them as raw strings. Therefore, it'll behave just the way it is trained to by the training datasets.

Suggester Servlets

 * Claim property suggester:
 * This servlet can suggest properties based on a comma-separated input list of properties like this: code>/entitysuggester/suggest/claimprops/P41,P24,P345
 * One can also omit the list of properties to get some default suggestions provided by a popularity-sorted property recommender:
 * A howMany parameter should be added, to limit the number of suggestions. It is 0 by default. Example:  or


 * Source ref property suggester:
 * This servlet can suggest properties based on a comma-separated input list of properties like this:
 * One can also omit the list of properties to get some default suggestions provided by a popularity-sorted property recommender:
 * A howMany parameter should be added, to limit the number of suggestions. It is 0 by default. Example:  or

NOTE: The two different property suggesters are trained by different data sets; hence they provide different suggestions.


 * Qualifier property suggester:
 * This servlet can suggest qualifiers for a mandatory single property input like:
 * A howMany parameter should be added, to limit the number of suggestions. It is 0 by default. Example:


 * Value suggester:
 * This servlet can suggest values for a mandatory single property input like:
 * A howMany parameter should be added, to limit the number of suggestions. It is 0 by default. Example:

Output Format for Suggester Servlets
All the suggester servlets give output in JSON format. As an example,  may yield an output like: "P143",0.9924422],["P248",0.007505652

It is an array of arrays, where each constituent array consists of the entity ID (string) and the relative score (float).

Ingester Servets
All ingest servlets read the datasets from the POST body. As explained in the "Training the Suggester" section above, it's easy to train the suggester using curl to POST the training file to the servlet. Example: curl -X POST --data-binary @value-train.csv http://machine_ip:8080/entitysuggester/ingest/values

= Progress Reports = I'll be maintaining monthly and weekly reports on this page.