User:Kalan/gsoc

See this page as of April 6th.
 * Name: Kalan
 * Email: kalan.001@gmail.com
 * Project title: Caterpillar the Category Tool

Contact/working info

 * Timezone: UTC+4
 * Typical working hours: 09–23 (UTC; may be easily altered if requested)
 * IRC or IM networks/handle(s): enhydra on freenode; IMs are listed on m:User:Kalan

Project summary
There are several tools that intersect two categories (or a category with some other set of pages). However, only a limited number of operations one may want can be done this way. Ideally, a user should be able to do any possible set-theory operations on categories. This will add a whole new way to browse wiki’s content and make some maintenance tasks much simplier.

I want Caterpillar to be available for any MediaWiki install (with enough RAM, of course), not only for Wikimedia wikis. So this is going to be a MediaWiki extension, not a Toolserver project.

Related projects

 * DynamicPageList has features to build an intersection of categories and negation of them. It only allows for simple queries: one can’t specify OR operation here, and the default limit is 200 pages in a set. Also, it is designed for lists displayed to the readers, so there is no way to automatically retrieve the result of the query other than using screen-scraping. This extension is enabled on only some Wikimedia wikis, apparently because of performance issues.
 * Multi-Category Search provides similar simple queries, but returns a navigable list.
 * Semantic MediaWiki provides powerful querying mechanism, and page categories are among the parameters that can be queried. However, it does not allow NOT operator for categories, making a large subset of queries impossible.

None of the above projects offer tree traversing (“category with all subcategories” operation).

They are all written in PHP and communicate directly with the database, which makes them unsuitable for large sites with branchy category trees, such as largest Wikipedias.


 * Wikimedia-specific CatScan is the most known tool. It provides means to only intersect one category and some other set of pages. It communicates to the copy of the database, so some queries may be slow.
 * Beta-tested Wikimedia-specific CatGraphApi allows one to build a conjunctive normal form on categories. It uses approach similar to that suggested in this page, with help of graphcore and graphserv free software helpers implemented under a contract of Wikimedia Deutschland. However, the beta test currently works unbearably slow, and the interface of text fields and dropdown boxes is baffling and inconvenient for building queries.

Also see bugzilla:5244.

Required deliverables

 * MediaWiki extension
 * Special page and API that accepts queries
 * Language parser that interprets them
 * Maintenance script that (re)builds the category graph
 * Hook that updates the daemon’s version of the graph in real time and invalidates corresponding caches
 * Installation script

If time permits

 * AJAX suggestions for category names
 * Slow and daemonless version for small installations

Project schedule

 * Apr: getting familiar with Wikimedia infrastructure and existing code
 * May: connecting with mentor, clarifying the schedule, setting up a testing environment capable of holding real category graphs
 * May 21: start
 * May 21…30: possible fixes to graphcore and graphserv
 * Jun 01…04: proof-of-concept special page
 * Jun 05…13: real special page that parses queries
 * Jun 14…16: API for this page
 * Jun 17…26: means to build the tree from database
 * Jun 27…30: documenting, shooting remaining bugs
 * Jul 01…10: lots of testing and probably some bugfixing (and Wikimania!)
 * Jul 10…13: mid-term evaluations
 * Jul 13…22: hook for real-time update, researching on consistency
 * Jul 23…31: installation script
 * Aug 01…12: something from “if time permits”
 * Aug 13…20: polishing code and documentation
 * Aug 13…20: finish

About
Caterpillar was inspired by another user who complained that existing tools are not enough for his maintenance he is doing on Commons. He is making lists he needs by hand or with very suboptimal scripts or bots — sometimes involving me for help. I sometimes see manually-maintained categories or lists that really shouldn’t be there, but they are because there is yet no way for them not to be, and I feel an urge to replace them with a more suitable tool.

This project is a good challenge involving a complicated system of inhomogenous components working together, and it excites me much.

Participation
I have a gerrit account (kalan), and I plan my commits to there to be self-explanatory when someone wants a detailed report on what and how I do. I always (literally always) hang out on IRC, being available for questions. Should there arise serious architectural concerns, I will start a thread on Wikimedia-l; for minor things, I will ask on IRC.

Past open source experience
I am a Wikimedian since 2006 and Toolserver user kalan since 2007. In wikis (and, in particular, in Russian Wikipedia, my home wiki), I mostly focus on templates, CSS and JavaScripts (for example, see Voting.js and Gadget-markblocked.js). On the Toolserver, I run Python bots and systems for arbcom elections and POTY voting (to be rolled out by April 22nd). Server-side MediaWiki coding is something I have yet to try out.

Any other info
Cumbersome interfaces with lots of text fields would take tremendous amount of time to build, debug, and use. The solution is using a small DSL (domain-specific language) — making complex queries as clear as math notation. The queries may look like this:

(a function defined by its name as horizontal truth table column; I expect this feature to be used rarely)
 * Intersect two categories:  AND 
 * Union of two subtrees intersected with the third:  (<<18th-century births>> OR <<19th-century births>>) AND <> 
 * Applying an arbitrary Boolean function to categories and subtrees:  01101011(<>, , <>) </tt>