WMF Projects/Data Dumps

Feature justification
We produce XML-formatted copies of the content of each of the wiki projects in each active language. These files are used by researchers doing analysis, by contributors who run bots to manipulate content, and by organizations that mirror the projects, among others. In addition the existence of these files permits the existence of a fork of the content at any time, much as the existence of downloadable source code for various open source projects permits the community to fork it when it is deemed appropriate.

Since 2007 the availability of the dumps has been pretty variable. The dumps of the English-language Wikipedia in particular, since they are so huge, take a long time to run, and if something goes wrong during that run: a network issue, a power outage, a problem with the server hosting the files or the server generating the dumps, a change to some part of the MediaWiki code that causes corruption, the entire job is lost and must be restarted.

Speed and availability of current files are not the only issues; we have had no mirrors of these files, nor backups.

We have had many requests for additional fields to be added to the dumps or for subsets of dumps to be provided as well.

This project will address these issues.

Tasks
See Dumps/Development 2011 on wikitech.wikimedia.org.

Current Status
See Dumps/Development status 2011 on wikitech.wikimedia.org.

Related documents
See:
 * Dumps on wikitech
 * Dumps Parallelization on wikitech
 * Mirroring Wikimedia project XML dumps
 * Research Data Proposals

Schedule
Ongoing (fill this in)