Wikimedia Platform Engineering/MediaWiki Core Team/Backlog/Improve dumps

Improve Wikimedia dumping infrastructure is a collection of ideas about how the current dump creation and distribution infrastructure could be improved to increase stability and add often requested features.

Who uses (or would like to use) dumps? What do they want to do with the data?

 * Researchers doing various type of analysis on user behavior or revision content changes
 * Bot runnners or AWB users generating a list of pages on which to operate
 * Offline reader creators/maintainers (covers a large range, from OpenZim to small mobile apps to 'I have a subset of Wikipedia on my laptop')
 * Companies keeping local mirrors of current content which with value added is presented to their users
 * Dbpedia-style extraction of specific content such as infoboxes, converted to db format and queriable by users

What dump formats would be most universally useful?

 * raw wikitext (what wrapper format?)
 * wikitext parsed to HTML including expansion of all templates
 * HTML expansion of just certain templates e.g. infoboxes; alternatively markup of full HTML such that template content is identifiable
 * sql page/revision/etc tables that can be directly imported

Are there practical file size limits that should be considered?
Even rsyncing large (10s of gigabytes) files gets annoying when interrupted; more smaller (5gb or less?) files are better for that purpose.

Some downloaders have explicitly said they have problems with vry large files and an inability to resume interrupted downloads.

On the other side of the issue, some downloaders prefer to process one large file rather than multiple small files. Tools for recombining these files might alleviate these concerns.

Should we supply tools for requesting and processing the dumps?
We can maintain an index of tools produced by others, but being realistic, we'll probably want to provide a reference set of tools that always works and isn't horribly slow. We need not commit to have it work on all platforms however.

What is needed for incrementals to be useful to mirror maintainers?

 * delete/move/add lists for revision content
 * delete/insert/update sql for all sql tables that are dumped

How often should full dumps be produced? Incremental dumps?
Incrementals: some users want daily changesets. Others want updates even more often than that (every few minutes, once an hour).

User stories
A company wants to mirror English Wikipedia with relatively up to the minute changes (or at least up to the hour) and use it to return search results to its customers, with changes to the format, extra capabilities, etc.

A Wiktionary editor wants to update all pages containing definitions for a word in a given language, that have a certain template.

A Wiktionary editor wants to update all pages with new interwiki links depending on the pages added or removed on other Wiktionary pages.

A researcher wants to examine reversions of revisions across all articles in the main Wikipedia namespace.

Someone wants to create an offline reader based on a static dump of Wikipedia. It would be nice if the output were easily mungeable HTML (not the skin and all that other stuff). They then want to update it once a week with new content and removal of deleted content.

A bot runner wants to check all recent pages for formatting and spelling issues once a day, doing corrections as needed.

Someone wants to download all the articles about X topic and dump them into a local wiki, or an offline reader, or serve them as HTML on their laptop.

Someone wants to mirror the contents of all projects in their language, with a homegrown search across all wiki prjects in that language returning results.

Someone wants to do research on Wikipedia as it existed 2 years ago and as it exists today, setting up two sql dbs with the appropriate content.

Someone wants to grab all the infoboxes and turn their content into a searchable cross-wiki database.