WMDE contract offers/Rewrite CatScan
CatScan  m:User:Duesentrieb/CatScan is an online tool for searching categories recursively according to diverse criteria. It can be used among other things to intersect two categoiries (including subcategories), to find new pages in a category, etc. CatScan is especially useful for maintenance work in topic-based projects like Wikipedia's "portals".
CatScan is pretty complex already, but there is a constant flow of requests for new criteria for filtering, output formats, etc. It would make sense to rewrite CatScan using a flexible framework, and to provide specialized user interfaces for specific tasks.
CatScan is to be rewritten with the goal of creating a flexible framework that allows for the easy creation of tools for category-based queries.
- Generators for generating a set of categories, and listing all pages in that set of categories
- Generator for listing subcategories recursively, down to some given level.
- Combinators for category-contents
- Combinator for finding the cross-section of the pages in two category-sets
- Combinator for finding the union of the pages in two category-sets
- Combinator for finding the difference of the pages in two category-sets
- Filters for category-contents
- Filter by namespace (only main, only categories, only images, only non-talk, etc)
- Filter by template(s)
- Filter by date of last change (optionally excluding edits by bots, or only including edits by anons, or un-flagged revisions)
- Filter by size and number of links (stub detection)
- Filters should be invertible
- Filters can add information to the result, such as justifies why and how the filter criteria apply. E.g. the size of page, or the date of the last change, etc.
It shall be possible to combine the filters programatically, and they should interact in such a way to produce a resonably efficient set of database queries.
The shall be output renderers for the following formats:
- HTML (direct display)
- CSV/TSV (for bots)
- Wiki-Text (for copy&past)
- Evtl others like PHP-Serialized, JSON, XML, YAML...
When producing the output, any additional info provided by filter components should be incorporated in a meaningful way.
There are several options for implementing combinable filters:
- The components build a complex SQL query, which is then run.
- The components run SQL queries and store the result into a temporary table, which is then used as the input for the next component.
- A combination if the above
- Perhaps some kinds of intermediary results can be cached for some hours or a day. This seems particularly prodent for the recursive contents of categories.
- perhaps it would be best to use a specialized search and indexing system for this - Lucene would be an obvious choice, since Wikimedia already uses it to implement the site search.
User Interface Specification
User interfacers are to be developed for several use cases:
- Find all pages in category A, including subcategories, optionally filtered by namespace.
- Find all pages that are in category A and (not) in category B, including subcategories, optionally filtered by namespace.
- Find all pages in category A that have at least one (that have none) of a set of categories, including subcategories, optionally filtered by namespace.
- Find all pages in category A that have been created/edited in the last x hours/days, including subcategories, optionally filtered by namespace. In addition, filter minor edits, bot edits, etc.
- Find all pages in category A that are considered a stub (by some criteria), including subcategories, optionally filtered by namespace.
Integration with MediaWiki proper, as an extension, has been suggested and is discussed is being discussed on the talk page.
See WMDE_contract_offers for how to apply for a contract.