WMF Projects/Data Dumps

Feature justification
We produce XML-formatted copies of the content of each of the wiki projects in each active language. These files are used by researchers doing analysis, by contributors who run bots to manipulate content, and by organizations that mirror the projects, among others. In addition the existence of these files permits the existence of a fork of the content at any time, much as the existence of downloadable source code for various open source projects permits the community to fork it when it is deemed appropriate.

Since 2007 the availability of the dumps has been pretty variable. The dumps of the English-language Wikipedia in particular, since they are so huge, take a long time to run, and if something goes wrong during that run: a network issue, a power outage, a problem with the server hosting the files or the server generating the dumps, a change to some part of the MediaWiki code that causes corruption, the entire job is lost and must be restarted.

Speed and availability of current files are not the only issues; we have had no mirrors of these files, nor backups.

We have had many requests for additional fields to be added to the dumps or for subsets of dumps to be provided as well.

This project will address these issues.

Tasks

 * Backups, Availability
 * These files should be backed up to a local host from which the files would be instantly available at any time.
 * The files should also be backed up to remote storage.
 * We should work with other organizations and with individuals to set up mirroring of the files.
 * Old copies of missing dumps should be obtained from community members so that they can be archived and mirrored.


 * Speed
 * Dumps should be run in batches, with smaller wikis in one set, larger wikis in another, and enwikipedia in a third, so that every wiki project will be dumped on a regular interval without too much waiting in the queue.
 * Dumps of larger wikis, or at least of enwikipedia, should be broken into pieces that can be run in parallel, with each piece taking approximately the same time to run.


 * Robustness
 * Each dump consists of a number of smaller steps; each step should be restartable from the beginning in case of failure, rather than requiring restart of the entire dump.
 * Individual steps of a dump should be able to be run independently of the main dump process, so that new dumps of a project can be generated while an old one is being tidied up.
 * We should have safeguards in place against writing wrong or corrupt text into the dump files.
 * We should have an automated means of doing random spot checks of dump file content for accuracy.
 * Text of some older revisions for various projects is missing both in the database and in current dumps. We should examine older dumps and see if this content is available and can be restored to the database.
 * Scheduled and regular testing of dumps should be done before MediaWiki code is synced from the deployment branch, rather than automatic syncing via puppet.


 * Configuration, running, monitoring
 * The dump process should support alternate configuration files and alternate lists of wiki projects.
 * The dump process should support starting and stopping a number of runs via a script, rather than manually, and with appropriate cleanup at termination.
 * We should generate and keep statistics about the number of downloads for each project in a given time frame, about bandwidth usage, and about bot downloads. (Perhaps we should see whether organizations doing automated downloads would like to host a local copy.)
 * When a dump process hangs, we should be notified by some automated means so that we can investigate.
 * All non-MediaWiki software needed for the dumps to run should be packaged and the installation of such packages puppetized.
 * The MediaWiki backups code should include setup and operation documentation and sample configuration files.


 * Enhancement
 * There have been many requests for new data fields to be included in the dumps. These need to be prioritized and added as appropriate. See Research Data Proposals.
 * We have had many many requests for "incremental" dumps that would include a list of moves and deletions, changed content an new pages only, since the last run. We should evaluate this carefully to see if it's doable.
 * People have asked us for image dumps. While we would not provide downloads of the entire set of images (8T? Sorry, folks ;-)) we should consider providing smaller reasonable-sized subsets for download.

Related documents
See:
 * Dumps on wikitech
 * Dumps Parallelization on wikitech
 * Mirroring Wikimedia project XML dumps
 * Research Data Proposals

Schedule
Ongoing (fill this in)