WMF Projects/Data Dumps

Status as of 31 January 2012: A problem with the rsync to our mirror site was located and fixed. Another organization agreed to mirror the dumps as well, and we are waiting for their server to come online. Back issues of dumps from 2002 through 2006 were made available, for folks interested in historical data. New hardware has arrived in our Virginia datacenter, and we'll be copying all dumps over there as soon as it's ready. We're thinking about how to provide image dumps in some fashion, even if we don't keep local copies of the dumps or they are not run on a regular basis. We also cleaned up the dumps documentation and drafted this year's development plans. Finally, we have a contractor, Christian Aistleitner, who will be working on a test suite for dump generation.

Feature justification
We produce XML-formatted copies of the content of each of the wiki projects in each active language. These files are used by researchers doing analysis, by contributors who run bots to manipulate content, and by organizations that mirror the projects, among others. In addition the existence of these files permits the existence of a fork of the content at any time, much as the existence of downloadable source code for various open source projects permits the community to fork it when it is deemed appropriate.

Since 2007 the availability of the dumps has been pretty variable. The dumps of the English-language Wikipedia in particular, since they are so huge, take a long time to run, and if something goes wrong during that run: a network issue, a power outage, a problem with the server hosting the files or the server generating the dumps, a change to some part of the MediaWiki code that causes corruption, the entire job is lost and must be restarted.

Speed and availability of current files are not the only issues; we have had no mirrors of these files, nor backups.

We have had many requests for additional fields to be added to the dumps or for subsets of dumps to be provided as well.

This project will address these issues.

Tasks
See Dumps/Development 2011 on wikitech.wikimedia.org.

Current Status
See Dumps/Development status 2011 on wikitech.wikimedia.org.

Related documents
See:
 * Dumps on wikitech
 * Dumps Parallelization on wikitech
 * Mirroring Wikimedia project XML dumps
 * Research Data Proposals

Schedule
Ongoing (fill this in)