Basic idea: be able to do character-level searches of current wikitext of a particular Wikimedia wiki using regular expressions in a very short period of time (30 seconds or less would be fantastic!). Basically, we want fast and loose dumps.
This goal is split into two primary tasks:
- retrieving, storing, and constantly updating the current wikitext of all pages on a particular wiki; and
- scanning that data using regular expressions at a very fast speed.
Why? Being able to search at the character level all current wikitext of a wiki in a short amount of time has a number of possible uses. In addition, with current ongoing work on the visual editor, being able to find particular wikimarkup constructs quickly will become more important.
This section encompasses what's needed to be done to do large-scale retrieval, storage, and interactive updating of current wikitext.
Decompress and store the data in a file system
In order to speed things up, the idea is stop using tools like bzcat and stop requiring parsing of the dump using an XML reader. Instead, take each page and store it in a file system tree.
The idea would be to go by page ID and store in two or three levels deep of directories, hopefully using the page ID.
For a page with the ID "209320", you could store in /enwiki/0/23/209320, where the file system lookup is based on reading the ID backward (read backward to avoid 1 and 2). Plus it means you can find the file on the system with needing any hashes, you just need to be able to reverse and truncate the string (which surely every language can do very quickly and very easily). For IDs that aren't long enough, you can zero-pad.
You're ignoring other bits of data from the dump, but in return, by storing and retrieving by ID, you can join against the Toolserver's replicated databases and get current stats (about template uses, etc.). Assuming the wikitext is current, you're golden.
Regularly update the data
Dumps currently come out about monthly.[someone should check this] Nobody wants month-old data. The page text is constantly changing for certain pages, so a script needs to follow a feed (via IRC, RSS, API requests, whatever) to know to update a particular file to keep the database current at all times. I guess the feed will need to have page ID, so IRC might not be an option. Depends how creative you want to get.
You could not use a dump at all and simply poll the API, but that would take over a month at 1/second on the English Wikipedia! The idea here is that you would need fewer initial API calls by starting with a dump (basically you front-load and then hope a bunch of pages haven't updated in the meantime).
Then you have a second process that occasionally scans and updates the whole tree or rebuilds the database. Orphaned page IDs will eventually get in the way, though with a join against current page IDs, they'll be easily filterable. It'll just be a matter of longer and longer searches (more data).
[Hmmm, time between dump and when the feed starts being read needs to be accounted for. Dump recent changes (past 30 days) and get unique page IDs to update and then overlap for a brief period? Dunno.]
[Maybe possible to track deletions. Or do a set comparison of current page IDs to IDs on the file system and do deletions occasionally?]
This section encompasses what's needed to be done to scan a large amount of wikitext at a character level using regular expressions.
Research scanning software
With decompression, storage, and currency (updatedness?) of the wikitext out of the way, all that's left is scanning! Two basic requirements to scanning:
- support regular expressions (hopefully including syntax such as \d rather than requiring [0-9] nonsense); and
- work at the character level (so that you can find, for example, ": [[" in wikitext easily).
grep might be an option here. Or even something like Python, assuming that Python can recurse into the file system fast enough....
It kind of depends what the input is like here. If you looked up by page IDs on the Toolserver and then just found all article IDs, you could easily read each file and not need to traverse the tree at all. That might be fastest.
If you traverse the tree, that could be equivalent to searching all pages. Other idea is to have two different data sets: articles and all pages. Doesn't expand well to other wikis, though. You want to be able to customize at a high level (by namespace).
The impediments to this pipe dream are mainly that computers suck.
There are a few speed bottlenecks that are vaguely easy to overcome with enough resources:
- compression of the data; and
- requiring the data to be parsed as XML.
By decompressing the data and putting it into a file system and parsing the XML so that each file contains only the MediaWiki page's wikitext, you kill both of these bottlenecks without too much trouble.
However, concurrent reads of a hard drive aren't really possible right now, so you quickly hit a bottleneck of being able to quickly scan the data, even decompressed.
There are three general ideas for resolving the remaining speed issues:
- put the data into memory;
- use multiple hard drives; or
- distribute the workload between a lot of computers.
Each of these ideas has severe limitations.
Putting the data into memory requires having about 64 GB of memory free (for the decompressed wikitext of every page on the English Wikipedia) and having that memory dedicated to storing the data for the indefinite future. This is a fuck-ton of memory to have, much less have available on an ongoing basis.
Using multiple hard drives is expensive, particularly if the hard drives must be stored in a datacenter. Storing the data locally is a challenge for several reasons, particularly getting the data and keeping it up-to-date. A considerable amount of bandwidth is needed for both operations. If you could split the data across 64 1 GB drives, it might be possible to scan the data quickly, but that's prohibitively expensive.
Distributing the load across a network of computers is possible, but significantly more difficult to implement. It requires having people willing to keep a connection open, download data constantly, and process data constantly. Any scan would have to hit every computer, any one of which may not be available, so you'd need redundancy. It would be a coordination nightmare and could ultimately be slower than simply doing the scan in linear time on a single machine.
Constant updates to the data risk disk thrashing and other hardware issues.
Indexing the data is easy enough, but defeats nearly the entire purpose of the project: namely character-level searches.
$ time wget "http://dumps.wikimedia.org/enwiki/20120307/enwiki-20120307-pages-meta-current.xml.bz2" real 62m40.527s user 0m22.917s sys 2m55.503s
This means that 15,931,147,003 bytes were downloaded in a bit over an hour.