I'll be dividing each report into "Things done" and "Things to do" sections, the former being what I did over the month, the latter being my immediate goals.
- Did some research on techniques to provide recommendations for values.
- I finished documentation on the Wikidata Entity Suggester prototype, and am in the middle of posting them at this page and its sub-pages.
- Set up the Gerrit repository, got access to an m1.large Labs instance wikidata-suggester.
- I had written MapReduce scripts in Python to be used with Disco in May, to replace the C programs that Byrial shared (to parse the wiki dump, generate a csv file and database tables) since the C programs sometimes broke if some fields overshot some limits. Disco has an erlang dependency, so I decided to change the scripts to be used with Hadoop through Hadoop Streaming. I've configured Hadoop on the wikidata-suggester server and tested the scripts.
- Transferred the prototype code to Gerrit. I'm yet to push the PHP client.
- Made a few changes at my GSoC proposal page to reflect the new developments (addition of the wiki pages, extension page etc.)
- Have done partial deployment of the prototype on the labs server, should be finished in a couple of days. The instance now has a public IP; have opened a few ports to monitor Hadoop, Myrrix etc.
- Created the extension page for entity suggester here.
Things to do:
- Add some functionality to the MapReduce scripts to create database tables.
- Finish deploying the prototype on labs.
- Receive feedback, ask for new ideas/features.
- Do more research for recommendation, case-based reasoning, write code.
- Almost finished the property-suggester. Two Servlets and some code-review left to be done.
- Wrote tests for the the Java-side backend with the property suggester. (Higher level HTTP-based tests may be written later too)
- Added code docs for the Python MapReduce scripts and the Java classes.
- Made a few small improvements in the Java code and removed value suggestion stuff, extra dependencies.
- Wrote MapReduce scripts for counting the property frequencies for a) all items and b) source references.
- Planned how to implement value and qualifier suggestions, discussed in brief the requirements of the MediaWiki PHP-side API to be written for the entity suggester.
Things to do:
- Write Java servlets for suggesting properties for empty items (items that don't have any props yet) and for empty source refs - initially a naive implementation will be done that'll fetch the top N properties from the two ordered property frequency lists (see 5. above). This should finish the property suggester.
- Work on the MediaWiki API. Read up on docs, learn what work is needed to be done and start on it.
- Implement value and qualifier suggestions. Should be simple to add new features like these since one feature of the backend is already done.
August to mid September
- Made thorough changes in the code for the backend Java REST API - the backend can now be trained and can suggest properties for claims, source refs and qualifiers, and also values for a given property.
- Written tests and code-level docs for the Java REST API.
- Organized the Python scripts into two all-encompassing modules, mapper.py and reducer.py, complete with documentation.
- Tested all features of the REST API on the wikidata-suggester test instance.
- Most of the bulk of coding of the MediaWiki API module is complete. Some code reviews left.
- Demonstrating the PHP API module not finished, tests need to be written.