User:Kalan/gsoc


 * Name: Kalan
 * Email: kalan.001@gmail.com
 * Project title: Caterpillar the Category Tool

Contact/working info

 * Timezone: UTC+4
 * Typical working hours: 09–23 (UTC; may be easily altered if requested)
 * IRC or IM networks/handle(s): enhydra on freenode; IMs are listed on m:User:Kalan

Project summary
There are several tools that intersect two categories (or a category with some other set of pages). However, only a limited number of operations one may want can be done this way. Ideally, a user should be able to do any possible set-theory operations on categories. This will add a whole new way to browse wiki’s content and make some maintenance tasks much simplier.

I want Caterpillar to be available for any MediaWiki install (with enough RAM, of course), not only for Wikimedia wikis. So this is going to be a MediaWiki extension, not a Toolserver project.

Required deliverables

 * C or C++ daemon that stores category graph in memory, responds to elementary queries and caches intermediate results
 * MediaWiki extension
 * Special page and API that accepts queries
 * Language parser that interprets them
 * Maintenance script that (re)builds the category graph
 * Hook that updates the daemon’s version of the graph in real time and invalidates corresponding caches
 * Fully automated installation script

If time permits

 * AJAX suggestions for category names
 * Slow and daemonless version for small installations

Project schedule

 * Apr: getting familiar with Wikimedia infrastructure
 * May: connecting with mentor, clarifying the schedule, setting up a testing environment capable of holding real category graphs
 * May 21: start
 * May 21…30: implementing the graph-holding daemon
 * Jun 01…04: proof-of-concept special page
 * Jun 05…13: real special page that parses DSL queries
 * Jun 14…16: API for this page
 * Jun 17…26: means to build the tree from database
 * Jun 27…30: documenting, shooting remaining bugs
 * Jul 01…10: lots of testing and probably some bugfixing (and Wikimania!)
 * Jul 10…13: mid-term evaluations
 * Jul 13…22: hook for real-time update, researching on consistency
 * Jul 23…31: installation script
 * Aug 01…12: something from “if time permits”
 * Aug 13…20: polishing code and documentation
 * Aug 13…20: finish

About
Caterpillar was inspired by another user who complained that existing tools are not enough for his maintenance he is doing on Commons. He is making lists he needs by hand or with very suboptimal scripts or bots — sometimes involving me for help. I sometimes see manually-maintained categories or lists that really shouldn’t be there, but they are because there is yet no way for them not to be, and I feel an urge to replace them with a more suitable tool.

This project is a good challenge involving a complicated system of inhomogenous components working together, and it excites me much.

Participation
I have a gerrit account (kalan), and I plan my commits to there to be self-explanatory when someone wants a detailed report on what and how I do. I always (literally always) hang out on IRC, being available for questions. Should there arise serious architectural concerns, I will start a thread on Wikimedia-l; for minor things, I will ask on IRC.

Past open source experience
I am a Wikimedian since 2006 and Toolserver user kalan since 2007. In wikis (and, in particular, in Russian Wikipedia, my home wiki), I mostly focus on templates, CSS and JavaScripts. On the Toolserver, I run Python bots and systems for arbcom elections and POTY voting (to be rolled out in next few days). Server-side MediaWiki coding is something I have yet to try out.

Any other info
Cumbersome interfaces with lots of text fields would take tremendous amount of time to build, debug, and use. The solution is using a small DSL (domain-specific language) — making complex queries as clear as math notation. The queries may look like this:

(a function defined by its name as horizontal truth table column; I expect this feature to be used rarely)
 * Intersect two categories:  AND 
 * Union of two subtrees intersected with the third:  (<<18th-century births>> OR <<19th-century births>>) AND <> 
 * Applying an arbitrary Boolean function to categories and subtrees:  01101011(<>, , <>) </tt>