WikidataEntitySuggester/Progress

From mediawiki.org

Monthly reports[edit]

I'll be dividing each report into "Things done" and "Things to do" sections, the former being what I did over the month, the latter being my immediate goals.

June[edit]

Things done:

  1. Did some research on techniques to provide recommendations for values.
  2. I finished documentation on the Wikidata Entity Suggester prototype, and am in the middle of posting them at this page and its sub-pages.
  3. Set up the Gerrit repository, got access to an m1.large Labs instance wikidata-suggester.
  4. I had written MapReduce scripts in Python to be used with Disco in May, to replace the C programs that Byrial shared (to parse the wiki dump, generate a csv file and database tables) since the C programs sometimes broke if some fields overshot some limits. Disco has an erlang dependency, so I decided to change the scripts to be used with Hadoop through Hadoop Streaming. I've configured Hadoop on the wikidata-suggester server and tested the scripts.
  5. Transferred the prototype code to Gerrit. I'm yet to push the PHP client.
  6. Made a few changes at my GSoC proposal page to reflect the new developments (addition of the wiki pages, extension page etc.)
  7. Have done partial deployment of the prototype on the labs server, should be finished in a couple of days. The instance now has a public IP; have opened a few ports to monitor Hadoop, Myrrix etc.
  8. Created the extension page for entity suggester here.

Things to do:

  1. Add some functionality to the MapReduce scripts to create database tables.
  2. Finish deploying the prototype on labs.
  3. Receive feedback, ask for new ideas/features.
  4. Do more research for recommendation, case-based reasoning, write code.


July[edit]

Things done:

  1. Almost finished the property-suggester. Two Servlets and some code-review left to be done.
  2. Wrote tests for the the Java-side backend with the property suggester. (Higher level HTTP-based tests may be written later too)
  3. Added code docs for the Python MapReduce scripts and the Java classes.
  4. Made a few small improvements in the Java code and removed value suggestion stuff, extra dependencies.
  5. Wrote MapReduce scripts for counting the property frequencies for a) all items and b) source references.
  6. Planned how to implement value and qualifier suggestions, discussed in brief the requirements of the MediaWiki PHP-side API to be written for the entity suggester.

Things to do:

  1. Write Java servlets for suggesting properties for empty items (items that don't have any props yet) and for empty source refs - initially a naive implementation will be done that'll fetch the top N properties from the two ordered property frequency lists (see 5. above). This should finish the property suggester.
  2. Work on the MediaWiki API. Read up on docs, learn what work is needed to be done and start on it.
  3. Implement value and qualifier suggestions. Should be simple to add new features like these since one feature of the backend is already done.


August to mid September[edit]

  1. Made thorough changes in the code for the backend Java REST API - the backend can now be trained and can suggest properties for claims, source refs and qualifiers, and also values for a given property.
  2. Written tests and code-level docs for the Java REST API.
  3. Organized the Python scripts into two all-encompassing modules, mapper.py and reducer.py, complete with documentation.
  4. Tested all features of the REST API on the wikidata-suggester test instance.
  5. Most of the bulk of coding of the MediaWiki API module is complete. Some code reviews left.
  6. Demonstrating the PHP API module not finished, tests need to be written.