WikidataEntitySuggester/Proposal

Wikidata Entity Suggester

 * Public URL: Entity Suggester
 * Bugzilla report: Entity Suggester : Bug #46555, Entity Selector sort : Bug #45351
 * Mailing list thread: wikitech-l archive link
 * Announcement: proposal announcement

Name and contact information

 * Name: Nilesh Chakraborty
 * Email: nilesh@nileshc.com
 * IRC or IM networks/handle(s):
 * jabber: nilesh@nileshc.com
 * freenode nick: nileshc
 * Location: Kolkata, India (UTC +0530h)
 * Typical working hours: 15:00–24:00 IST (9:30–18:30 UTC)

Summary
Wikidata authors have to spend a considerable amount of time on finding the required properties and values for them. This project is meant to make their task easier. The goal of this project is three-fold - (i) suggesting properties relevant to the context (depends upon the item that is being edited), (ii) suggesting values to the recommended properties or a new property that the author starts with, (iii) make the sorting mechanism of the entity selector smarter so that more relevant properties appear at the top. A collaborative filtering approach will be used to suggest the properties and do the sorting. In order to suggest the values, individual approaches (collaborative filtering, complex SQL queries) has to be used for each type of property.

Benefit to Wikidata
This project will make the process of adding a new item to wikidata much more efficient and easier for the authors, since they will receive real-time recommendations for properties and values rather than always having to repeatedly come up with all the properties themselves. Also, the ordering of properties under an item will be improved.

Project overview and implementation ideas
I will describe my initial ideas for each of the three objectives: (Please see the last section on this page for info on a prototype I'm building)


 * 1) Suggesting properties :
 * 2) Write a map-reduce job with Apache Hadoop to parse the latest wikidatawiki pages-meta-current dump and extract user-item pairs with required metadata - this will be used to train the recommendation engine. Let's call this Dataset 1.
 * 3) Feed the user-item pairs and required metadata (if any) in Dataset 1 into Myrrix where 'user' implies a wikidata item (eg. New York City) and 'item' implies properties. Using collaborative filtering, properties will be recommended to items.
 * 4) Suggesting values :
 * 5) This will need individual approaches for each property or type of property. At first I will write a map-reduce job that will parse the latest wikidatawiki pages-meta-current dump as in objective 1, to yield a different kind of dataset, with the unnecessary info stripped off (details will be decided after some more experimentation). Let's call this Dataset 2. This will most probably be fed into an SQL database (I'll investigate the possibility for a noSQL one too, but it's unlikely, since JOINs and complex queries may have to be performed upon this data.) Let's consider a few examples now:
 * 6) Place-oriented properties like place of birth, country of citizenship - suggest values based upon already entered values, ie. it is highly probable that the birth place will be a subset of the country (countries?) of citizenship of a person.
 * 7) Relationship-related properties - Father/Son : If the item being edited is already listed as a Father/Son of another item, the Father or Son fields can be easily suggested. It's similar for aunts, uncles, spouses etc.
 * 8) Properties like alma mater, occupation, field of work, employer and any Music-related property (namely performer, composer, producer, record label etc) : I think it'll be a good idea to use collaborative filtering to suggest values to properties like these since they often overlap or show "A is a member of this, so probably A is also a member of that"-style characteristics.
 * 9) Product and Literature based properties should respond well to a collaborative filtering approach too. I need to experiment on individual types of properties to see which method fits.
 * 10) Making sorting order of properties on an item page better : This can be done by a similar collaborative filtering approach, treating wikidata item-property pairs as user-item pairs and recommend "items" to "users". But, sorting the order of items in search results will require some other heuristics.

Project goals and prime deliverables

 * 1) Build an entity suggester module for suggesting properties for statements, that can be trained with existing datasets.
 * 2) Add support for suggesting values (for a selected few types of properties).
 * 3) Make the sorting mechanism of the entity selector smarter so that more relevant properties appear at the top.

Tentative roadmap/timeline
The following is a breakup of how I wish to complete this project - each section, ranging from a few days to more than one week, has a tentative time period or deadline, and a specific task to be done:

About you
I am a 3rd year undergraduate student of computer science, pursuing my B.Tech degree. In short, I love programming and it's pretty much what I do all day, if I'm not on occasion busy doing something else! I have unending enthusiasm for working on anything related to big data, data mining, machine learning and recommendation engines and like researching on those topics because I'm passionate about them.

To find the idea on building an entity suggester for wikidata, on the MediaWiki GSoC ideas page, was serendipity if not anything else. If I could build something that would make the job easier for wikidata authors and let them become more efficient, it would be nothing short of fabulous. Since I have a thorough experience with recommendation engines (both Apache Mahout and Myrrix), I believe that I can use my skills to the fullest and make the entity suggester quite possibly "the most awesomest wiki enhancement ever". :-)

Participation
I will make a weekly or bi-weekly post on my blog at nileshc.com about my progress on the project, status on the milestones etc. and communicate with my mentor and the community via the wikitech-l mailing list.

Though honestly I'm not much of a blogger and prefer to just focus on working, with only a moderate amount of interaction.

I will use github to track the source code and will share the repo link with my mentors and the community once it's set up.

Past open source experience
Honestly, I do not have a lot of published open source code. I am currently working on a Facebook friend-suggester that recommends friends based on semantic similarity of each other's interests. Previously I have worked on an online interactive social college magazine from scratch (using Java EE/JSF, Websphere and DB2 server) and designed the database schema for it; I was in a team of 4. Unfortunately, it never reached a point of completion. The database schema and use-case diagrams I designed are available here.

Currently I'm trying to work on the following bugs : 21139, 6653, 21299

Any other info
Byrial has written a few C programs that have turned out to be really helpful to me. Please check out this link: http://www.wikidata.org/wiki/User:Byrial

I've generated a csv file with item-property pairs (~44MB) to use with the recommendation engine. I'm writing a simple php file to accept a few parameters and call the engine. I tried to host this on a remote VPS that I currently have access to, but unfortunately, it's Burst RAM goes upto 1GB and the recommendation engine alone is using a heap of 1100MB currently. I'll experiment on my own machine and share the results here soon.