Laxative

Basic idea: be able to do character-level searches of current wikitext of a particular Wikimedia wiki using regular expressions in a very short period of time (30 seconds or less would be fantastic!). Basically, we want fast and loose dumps.

This goal is split into two primary tasks:


 * 1) retrieving, storing, and constantly updating the current wikitext of all pages on a particular wiki; and
 * 2) scanning that data using regular expressions at a very fast speed.

Why? Being able to search at the character level all current wikitext of a wiki in a short amount of time has a number of possible uses. In addition, with current ongoing work on the visual editor, being able to find particular wikimarkup constructs quickly will become more important.

Prep
This section encompasses what's needed to be done to do large-scale retrieval, storage, and interactive updating of current wikitext.

Decompress and store the data in a file system
In order to speed things up, the idea is stop using tools like bzcat and stop requiring parsing of the dump using an XML reader. Instead, take each page and store it in a file system tree.

The idea would be to go by page ID and store in two or three levels deep of directories, hopefully using the page ID.

For a page with the ID "209320", you could store in /enwiki/0/23/209320, where the file system lookup is based on reading the ID backward (read backward to avoid 1 and 2). Plus it means you can find the file on the system with needing any hashes, you just need to be able to reverse and truncate the string (which surely every language can do very quickly and very easily). For IDs that aren't long enough, you can zero-pad.

You're ignoring other bits of data from the dump, but in return, by storing and retrieving by ID, you can join against the Toolserver's replicated databases and get current stats (about template uses, etc.). Assuming the wikitext is current, you're golden.

Regularly update the data
Dumps currently come out about monthly.[someone should check this] Nobody wants month-old data. The page text is constantly changing for certain pages, so a script needs to follow a feed (via IRC, RSS, API requests, whatever) to know to update a particular file to keep the database current at all times. I guess the feed will need to have page ID, so IRC might not be an option. Depends how creative you want to get.

You could not use a dump at all and simply poll the API, but that would take over a month at 1/second on the English Wikipedia! The idea here is that you would need fewer initial API calls by starting with a dump (basically you front-load and then hope a bunch of pages haven't updated in the meantime).

Then you have a second process that occasionally scans and updates the whole tree or rebuilds the database. Orphaned page IDs will eventually get in the way, though with a join against current page IDs, they'll be easily filterable. It'll just be a matter of longer and longer searches (more data).

[Hmmm, time between dump and when the feed starts being read needs to be accounted for. Dump recent changes (past 30 days) and get unique page IDs to update and then overlap for a brief period? Dunno.]

[Maybe possible to track deletions. Or do a set comparison of current page IDs to IDs on the file system and do deletions occasionally?]

Scan
This section encompasses what's needed to be done to scan a large amount of wikitext at a character level using regular expressions.

Research scanning software
With decompression, storage, and currency (updatedness?) of the wikitext out of the way, all that's left is scanning! Two basic requirements to scanning:


 * 1) support regular expressions (hopefully including syntax such as \d rather than requiring [0-9] nonsense); and
 * 2) work at the character level (so that you can find, for example, ": [[" in wikitext easily).

grep might be an option here. Or even something like Python, assuming that Python can recurse into the file system fast enough....

It kind of depends what the input is like here. If you looked up by page IDs on the Toolserver and then just found all article IDs, you could easily read each file and not need to traverse the tree at all. That might be fastest.

If you traverse the tree, that could be equivalent to searching all pages. Other idea is to have two different data sets: articles and all pages. Doesn't expand well to other wikis, though. You want to be able to customize at a high level (by namespace).