Wikimedia Platform Engineering/MediaWiki Core Team/Backlog/Improve dumps

Improve Wikimedia dumping infrastructure is a collection of ideas about how the current dump creation and distribution infrastructure could be improved to increase stability and add often requested features.

Open questions

 * Who uses (or would like to use) dumps?
 * What do consumers want to do with the dump data (populate a wiki, have a data corpus for non-wiki uses, create an index of some sort for a power user tool, ...)?
 * What dump formats would be most universally useful?
 * Are there practical file size limits that should be considered?
 * What transports are most efficient/easiest to use for downloading dumps?
 * Should we supply tools for requesting and processing the dumps?
 * How often should full dumps be produced?
 * How often should incremental dumps be produced?
 * How long should full and incremental dumps be retained?

User stories
A company wants to mirror English Wikipedia with relatively up to the minute changes (or at least up to the hour) and use it to return search results to its customers, with changes to the format, extra capabilities, etc.

A Wiktionary editor wants to update all pages containing definitions for a word in a given language, that have a certain template.

A Wiktionary editor wants to update all pages with new interwiki links depending on the pages added or removed on other Wiktionary pages.

A researcher wants to examine reversions of revisions across all articles in the main Wikipedia namespace.

Someone wants to create an offline reader based on a static dump of Wikipedia. It would be nice if the output were easily mungeable HTML (not the skin and all that other stuff). They then want to update it once a week with new content and removal of deleted content.

A bot runner wants to check all recent pages for formatting and other issues once a day, doing corrections as needed.

Someone wants to download all the articles about X topic and dump them into a local wiki, or an offline reader, or serve them as HTML on their laptop.

Someone wants to mirror the contents of all projects in their language, with a homegrown search across all wiki prjects in that language returning results.