WikidataEntitySuggester/Proposal

From MediaWiki.org
Jump to: navigation, search

Wikidata Entity Suggester[edit]

Name and contact information[edit]

  • Name: Nilesh Chakraborty
  • Email: nilesh@nileshc.com
  • IRC or IM networks/handle(s):
    • jabber: nilesh@nileshc.com
    • freenode nick: nileshc
  • Location: Kolkata, India (UTC +0530h)
  • Typical working hours: 15:00–24:00 IST (9:30–18:30 UTC)

Synopsis[edit]

Summary[edit]

Wikidata authors have to spend a considerable amount of time on finding the required properties and values for them. This project is meant to make their task easier. The goal of this project is three-fold - (i) suggesting properties relevant to the context (depends upon the item that is being edited), (ii) suggesting values to the recommended properties or a new property that the author starts with, (iii) make the sorting mechanism of the entity selector smarter so that more relevant properties appear at the top. A collaborative filtering approach will be used to suggest the properties and do the sorting. In order to suggest the values, individual approaches (collaborative filtering, complex SQL queries) has to be used for each type of property.

Benefit to Wikidata[edit]

This project will make the process of adding a new item to wikidata much more efficient and easier for the authors, since they will receive real-time recommendations for properties and values rather than always having to repeatedly come up with all the properties themselves. Also, the ordering of properties under an item will be improved.

Project overview and implementation ideas[edit]

I will describe my initial ideas for each of the three objectives: (Please see the last section on this page for info on a prototype I'm building)

  1. Suggesting properties :
    1. Write a map-reduce job with Apache Hadoop to parse the latest wikidatawiki pages-meta-current dump and extract user-item (item-property) pairs with required metadata - this will be used to train the recommendation engine. Let's call this Dataset 1.
    2. Feed the user-item pairs and required metadata (if any) in Dataset 1 into Myrrix where 'user' implies a wikidata item (eg. New York City) and 'item' implies properties. Using collaborative filtering, properties will be recommended to items.

  2. Suggesting values :
    1. This will need individual approaches for each property or type of property. At first I will write a map-reduce job that will parse the latest wikidatawiki pages-meta-current dump as in objective 1, to yield a different kind of dataset, with the unnecessary info stripped off (details will be decided after some more experimentation). Let's call this Dataset 2. This will most probably be fed into an SQL database (I'll investigate the possibility for a noSQL one too, but it's unlikely, since JOINs and complex queries may have to be performed upon this data.)
      Let's consider a few examples now:
      1. Place-oriented properties like place of birth, country of citizenship - suggest values based upon already entered values, ie. it is highly probable that the birth place will be a subset of the country (countries?) of citizenship of a person.
      2. Relationship-related properties - Father/Son : If the item being edited is already listed as a Father/Son of another item, the Father or Son fields can be easily suggested. It's similar for aunts, uncles, spouses etc.
      3. Properties like alma mater, occupation, field of work, employer and any Music-related property (namely performer, composer, producer, record label etc) : I think it'll be a good idea to use collaborative filtering to suggest values to properties like these since they often overlap or show "A is a member of this, so probably A is also a member of that"-style characteristics.
      4. Product and Literature based properties should respond well to a collaborative filtering approach too. I need to experiment on individual types of properties to see which method fits.

    2. Investigate using a collaborative filtering method exactly similar to : 1. Suggesting properties

  3. Making sorting order of properties on an item page better : This can be done by a similar collaborative filtering approach, treating wikidata item-property pairs as user-item pairs and recommend "items" to "users".
    But, sorting the order of items in search results will require some other heuristics.

Deliverables[edit]

Project goals and prime deliverables[edit]

  1. Build an entity suggester module for suggesting properties for statements, that can be trained with existing datasets.
  2. Add support for suggesting values (for a selected few types of properties).
  3. Make the sorting mechanism of the entity selector smarter so that more relevant properties appear at the top.

Tentative roadmap/timeline[edit]

The following is a breakup of how I wish to complete this project - each section, ranging from a few days to more than one week, has a tentative time period or deadline, and a specific task to be done:

Time Period Tasks
Upto May 21 Research : Decide upon the optimal implementation methods for each objective. Chalk out more implementation details, map-reduce job details, get familiar with MediaWiki source code and how to integrate new functionality or java .jar files into wikidata.
I will have my university exams from May 21st to June 11th. I will be available during that period, but will not be able to start with coding. During this time we can have necessary discussions and plot out more details about my objectives.
June 11 - June 20 Setup Apache Hadoop on the MediaWiki vagrant server. Write mapper and reducer for objective 1 to parse wikidata dump.
June 21 - June 25 Feed user-item pairs from output of reducer into Myrrix and try out recommendations. Ask mentors and community about documentation regarding integration into wikidata. Write a simple javascript recommender client that will be inserted into the item page. It will call the Myrrix REST API via ajax and retrieve recommendations.
June 26 - June 30 Test usability with mentors and the community. Write documentation. Make bug fixes.
July 1 - July 7 Write another map-reduce job to parse the data dump and start adding support for suggesting values, beginning with Music related properties.
July 8 - July 11 Test usability. Fix bugs. Write documentation. Consult with mentors regarding accuracy of suggestions.
July 12 - July 20 Add support for suggesting values for Product and Literature related values.
July 21 - July 23 Polish code, fix bugs if any.
July 24 - July 28 Add support for relationship and Place oriented properties. Plan for other kinds of properties.
July 29 - August 6 Buffer week. Implement support for other properties if planned.
August 7 - August 16 Experiment and decide the best methods for smart sorting of properties and items.
August 17 - August 22 Implement smart sorting of properties.
August 23 - August 28 Implement smart sorting of items.
August 29 - September 6 Finish integrations with wikidata frontend.
September 6 - September 13 Write documentation. Check bug posts and fix them. Test usability.
September 14 - September 22 Finishing touches, last minute polishing, bug fixes, writing unit tests, improving documentation if needed.
Post GSoC Optimize recommendation scripts if needed. Add support for more properties for value suggestion. Try making the property/item sorter more intelligent.

About you[edit]

I am a 3rd year undergraduate student of computer science, pursuing my B.Tech degree. In short, I love programming and it's pretty much what I do all day, if I'm not on occasion busy doing something else! I have unending enthusiasm for working on anything related to big data, data mining, machine learning and recommendation engines and like researching on those topics because I'm passionate about them.

To find the idea on building an entity suggester for wikidata, on the MediaWiki GSoC ideas page, was serendipity if not anything else. If I could build something that would make the job easier for wikidata authors and let them become more efficient, it would be nothing short of fabulous. Since I have a thorough experience with recommendation engines (both Apache Mahout and Myrrix), I believe that I can use my skills to the fullest and make the entity suggester quite possibly "the most awesomest wiki enhancement ever". :-)

Participation[edit]

I will make a weekly or bi-weekly post on my blog at nileshc.com about my progress on the project, status on the milestones etc. and communicate with my mentor and the community via the wikitech-l mailing list. I will set up a page under my User page and make the same post there too. I will maintain documentation, HOW-TOs etc. in the Entity Suggester's wiki page. I'll post my monthly and weekly reports on this page.

Though honestly I'm not much of a blogger and prefer to just focus on working, with only a moderate amount of interaction.

I will preferably use this gerrit repository to track the source code.

Past open source experience[edit]

Honestly, I do not have a lot of published open source code. I am currently working on a Facebook friend-suggester that recommends friends based on semantic similarity of each other's interests. Previously I have worked on an online interactive social college magazine from scratch (using Java EE/JSF, Websphere and DB2 server) and designed the database schema for it; I was in a team of 4. Unfortunately, it never reached a point of completion. The database schema and use-case diagrams I designed are available here.

Any other info[edit]

Byrial has written a few C programs that have turned out to be really helpful to me. Please check out this link: http://www.wikidata.org/wiki/User:Byrial. Unfortunately, the last couple of wikidata dumps seem to be breaking those C codes. So I wrote my own Hadoop MapReduce scripts in Python.

This GitHub repo was the initial place where I began prototyping. I've moved my code to Gerrit. Please check the entity suggester's extension page which I'll update once the project matures.