Wikimedia Platform Engineering/MediaWiki Core Team/Backlog/Improve dumps

Improve Wikimedia dumping infrastructure is a collection of ideas about how the current dump creation and distribution infrastructure could be improved to increase stability and add often requested features.

How to ensure the right to fork and preservation?
Criteria need to be identified that allow to select what *must* be made available.

Mirroring needs to have clear goals; we currently rely on the good will of few poorly-assisted mirrors and on some volunteers who mirror datasets on archive.org.

Who uses (or would like to use) dumps? What do they want to do with the data?
Actually use Research Data Proposals


 * Researchers doing various type of analysis on user behavior or revision content changes
 * Bot runnners or AWB users generating a list of pages on which to operate
 * Power users generating a list of "problem" pages so that other editors may work on the pages on this list
 * Power users doing various type of analysis and putting the results on the wiki where the results are shown to readers using templates
 * Offline reader creators/maintainers (covers a large range, from OpenZim to small mobile apps to 'I have a subset of Wikipedia on my laptop')
 * Companies keeping local mirrors of current content which with value added is presented to their users
 * Dbpedia-style extraction of specific content such as infoboxes, converted to db format and queriable by users

Cf. Data analysis/mining of Wikimedia wikis.

What dump formats would be most universally useful?

 * raw wikitext (what wrapper format?)
 * static HTML dumps (wikitext parsed to HTML including expansion of all templates): missing since, it's too much
 * something that can be consumed by a basic displayer à la WikiTaxi/XOWA, ideally for other wikis as well (would be useful to browse wikiteam dumps)
 * HTML expansion of just certain templates e.g. infoboxes; alternatively markup of full HTML such that template content is identifiable
 * sql page/revision/etc tables that can be directly imported or that can provide easy access to metadata (such as categorylinks, templatelinks etc.) without the need to download and parse a full wikitext dump, in particular for tasks where access to page content is not needed.
 * word lists, DICT
 * media tarballs
 * Additionally an index page for all dumps and all dump files is needed, preferably in a machine-readable format (json, xml, ...)
 * Additionally an index page for all dumps and all dump files is needed, preferably in a machine-readable format (json, xml, ...)

Compression

 * Bring to a conclusion the work and investigation on further compression formats.
 * Find a way to deprecate bz2 to save one order of magnitude of disk space (probably requires ensuring wikistats, pywikibot and other dump parsers are able to stream 7z as well, "7zcat"?)?
 * Take advantage of the parallel LZMA compressiona available in the latest version of LZMA utils.

How do we support and embrace third party data distributors?
Kiwix is especially important. Parsoid made the ZIM production easier and your.org/mirrorservice.org provided mirrors, but much more needs to be done.

Are there practical file size limits that should be considered?
Even rsyncing large (10s of gigabytes) files gets annoying when interrupted; more smaller (5gb or less?) files are better for that purpose.

Some downloaders have explicitly said they have problems with vry large files and an inability to resume interrupted downloads.

On the other side of the issue, some downloaders prefer to process one large file rather than multiple small files. Tools for recombining these files might alleviate these concerns.

Should we supply tools for requesting and processing the dumps?
We can help maintain an index of tools produced by others (cf. ), but being realistic, we'll probably want to provide a reference set of tools that always works and isn't horribly slow. We need not commit to have it work on all platforms, however.

What is needed for incrementals to be useful to mirror maintainers?

 * delete/move/add lists for revision content
 * delete/insert/update sql for all sql tables that are dumped

How often should full dumps be produced? Incremental dumps?
Incrementals: some users want daily changesets. Others want updates even more often than that (every few minutes, once an hour).

Having incremental dumps would reduce the need for frequent full dumps. Also, incremental dumps have linear disk space need, whereas full dumps have quadratic space consumption.

The present frequency of dumps is by far insufficient. Full dumps and sql page/revision/categorylinks/templatelinks/redirect/iwlinks/etc tables should be produced quite often so that lists of "problem" pages can be easily updated every few days, which increases motivation to deal with the pages found and listed.

How long should full and incremental dumps be retained?
Full dumps: Until the next full dump has completed successfully.

Interdependencies

 * Can we have sinergies with Wikia and other wikifarms which suffer considerable pains in dumps production?
 * How to make sure that WMF keeps having an urgency to keep dumps running even if internal consumers like Wikistats happened to abandon dumps in favour of DB access?

User stories
A company wants to mirror English Wikipedia with relatively up to the minute changes (or at least up to the hour) and use it to return search results to its customers, with changes to the format, extra capabilities, etc.

A Wiktionary editor wants to update all pages containing definitions for a word in a given language, that have a certain template.

A Wiktionary editor wants to update all pages with new interwiki links depending on the pages added or removed on other Wiktionary pages.

A researcher wants to examine reversions of revisions across all articles in the main Wikipedia namespace.

Someone wants to create an offline reader based on a static dump of Wikipedia. It would be nice if the output were easily mungeable HTML (not the skin and all that other stuff). They then want to update it once a week with new content and removal of deleted content.

A bot runner wants to check all recent pages for formatting and spelling issues once a day, doing corrections as needed.

Someone wants to download all the articles about X topic and dump them into a local wiki, or an offline reader, or serve them as HTML on their laptop.

Someone wants to mirror the contents of all projects in their language, with a homegrown search across all wiki prjects in that language returning results.

Someone wants to do research on Wikipedia as it existed 2 years ago and as it exists today, setting up two sql dbs with the appropriate content.

Someone wants to grab all the infoboxes and turn their content into a searchable cross-wiki database.

Someone wants to process Wikipedia articles graph to provide analytical toolkit for browsing and managing Wikipedia categories.

Architecture
Full dumps are going to get larger and larger over time, taking more and more capacity to generate and to store. If parallelized in some smart way, the run time might be kept relatively stable, but the time required to combine the results of many small jobs to a single or small group of downloadable files will increase; it would be nice to minimize this by being clever about the final output format.

Ideally there would not be one host (spof) with an nfs mounted filesystem mounted on all the generating hosts. Dump content would be generated locally on a cluster of hosts and then synced to one or a cluster of hosts for storage and web service.

Users are used to seeing what's going on with the dumps in real time and downloading the latest files the instant they are available, so any new setup would need to tke that into account.

Dump jobs can be interrupted for a number of reasons: MW core bug discovered at deployment time, swapping out one db server for another, network outage, rebooting the host that generates them or the host where they are written, for security updates, etc. I'd love it if we could apply security updates instantly to any of these hosts, as often as needed, so it would be ideal if a dump run could automatically pick up where it left off, losing at most say 30 minutes of run time data. This may impact the file size question raised above. This applies for both sql table dumps and for revision content dumps.

Currently dumps are done on a 'rolling' basis: the wiki with the oldest dump goes next, and a wiki which has never been dumped is considered oldest and so jumps to the top of the queue. We want this fairness in the new setup.

The current rolling dumps can't be run by cron, since they run continuously; their design also doesn't lend itself to startup via puppet. The new setup should do one or the other.

We should be able to customize a list of jobs per wiki for special cases, including e.g. json wikibase repo dumps for wikidatawiki¸ as opposed to what we do now with those running separately by cron and stored in a separate directory tree. Incrementals should be integrated into the process in the same way.