Wikimedia Platform Engineering/MediaWiki Core Team/Backlog/Improve dumps

Improve Wikimedia dumping infrastructure is a collection of ideas about how the current dump creation and distribution infrastructure could be improved to increase stability and add often requested features.

How to ensure the right to reuse, to fork and preservation?
Mirroring needs to have clear goals; we currently rely on the good will of few poorly-assisted mirrors and on some volunteers who mirror datasets on archive.org.

Criteria need to be identified that allow to select what *must* be made available.

Availability of content needs to be secured and well known/adverted.

Who uses (or would like to use) dumps? What do they want to do with the data?
Actually use Research Data Proposals


 * Researchers doing various type of analysis on user behaviour, revision content changes or text mining.
 * Bot runnners or AWB users generating a list of pages on which to operate
 * Power users generating a list of "problem" pages so that other editors may work on the pages on this list
 * Power users doing various type of analysis and putting the results on the wiki where the results are shown to readers using templates
 * Offline reader creators/maintainers (covers a large range, from OpenZim to small mobile apps to 'I have a subset of Wikipedia on my laptop')
 * Companies keeping local mirrors of current content which with value added is presented to their users
 * Dbpedia-style extraction of specific content such as infoboxes, converted to db format and queriable by users
 * End-users needing an offline copy.
 * Analysis of some data changing over time can only be done on dumps. For example, the categories a page was in last year.  Ideally, dumps would be incremental and more frequent to enable ongoing research and increase the resolution of this data that's otherwise discarded by mediawiki.
 * Data Analysis wants to produce stats on content and content creators where direct processing of online databases (e.g. via hadoop) is less efficient

Cf. Data analysis/mining of Wikimedia wikis.

What dump formats would be most universally useful?

 * ZIM format, most of the users are probably end-users and the ZIM format is the best format for them (quick random access, heavily compressed, portable, multimedia, directly usable, maintained).
 * raw wikitext (what wrapper format?)
 * static HTML dumps (wikitext parsed to HTML including expansion of all templates): missing since, it's too much
 * something that can be consumed by a basic displayer à la WikiTaxi/XOWA, ideally for other wikis as well (would be useful to browse wikiteam dumps)
 * HTML expansion of just certain templates e.g. infoboxes; alternatively markup of full HTML such that template content is identifiable
 * sql page/revision/etc tables that can be directly imported or that can provide easy access to metadata (such as categorylinks, templatelinks etc.) without the need to download and parse a full wikitext dump, in particular for tasks where access to page content is not needed.
 * word lists, DICT
 * media tarballs
 * An incremental file format would be great. Git seems convenient, rsync patch, not sure what the best choice is.  The main reason is speed of export so that ongoing research can be done quickly and resolution of data changing over time (categorylinks) is higher than we have now.
 * Directing a complete, rich version of the recent changes stream into a fast append-mostly analysis data store like Druid. This may be the best of labsdb, dumps, and incremental dumps with the least engineering effort on our part.
 * Something splittable on a consistent separator.I.e. any given record contains all relevant metadata. The current XML hierarchy means that whole files have to be parsed in order to read a single revision.   Flat JSON objects would be fine.  (i.e. schema used in analytics hadoop work)

How to publish the dumps?

 * Additionally an index page for all dumps and all dump files is needed, preferably in a machine-readable format (json, xml (opds), ...)
 * http://download.wikimedia.org is not end-user friendly, something better should be proposed.
 * People don't know (well) about the dumps . The availability of offline dumps should be better advertised.
 * SQL dumps could be published in (compressed) cross-platform CSV format, instead of the current MySQL (mysqldump) format (e.g. categorylinks, externallinks, pagelinks, etc.). The table schema and even a fixed DB engine (InnoDB) is currently embedded in the dump file. This makes it difficult to use alternative engines like Percona's XtraDB or Aria (in MariaDB), or different database products (e.g. PostgreSQL).

Compression

 * Bring to a conclusion the work and investigation on further compression formats.
 * Find a way to deprecate bz2 to save one order of magnitude of disk space (probably requires ensuring wikistats, pywikibot and other dump parsers are able to stream 7z as well, "7zcat"?)?
 * Take advantage of the parallel LZMA compressiona available in the latest version of LZMA utils.
 * ZIM format is LZMA compressed.
 * Something splittable! bz2 is fine, LZO maybe?
 * Consider switching from .7z files (bound to the 7-zip toolset) to .xz files (managed with XZ utils). Since version 5.2.0 XZ utils support multithreaded LZMA2 compression/decompression. Currently, 7-zip has a good compression ratio/performance balance (with multiple threads) but multithreaded decompression is not available. This option could speed up data processing on the client side significantly. Besides, the Python 3 standard library supports xz files but not 7z files. This would enable creating a decompression stream directly from Python code, instead of calling a shell subprocess to pipe the output from 7za (as it is currently customary in most existing tools).

How do we support and embrace third party data distributors?

 * Kiwix is especially important. Parsoid made the ZIM production easier and your.org/mirrorservice.org provided mirrors, but much more needs to be done.
 * Hydriz is in charge of moving the dumps to the Internet Archive via Wikimedia Labs. Would be great to have a public API for the dumps generation progress so that it is easier to find out which dumps have just been completed without using hackish screen-scraping methods.

Are there practical file size limits that should be considered?
Even rsyncing large (10s of gigabytes) files gets annoying when interrupted; more smaller (5gb or less?) files are better for that purpose.

Some downloaders have explicitly said they have problems with vry large files and an inability to resume interrupted downloads.

On the other side of the issue, some downloaders prefer to process one large file rather than multiple small files. Tools for recombining these files might alleviate these concerns.

What transports are most efficient/easiest to use for downloading dumps?
Bittorrent is great (in particular with web seeds) but needs an extra download software (in addition to the browser). HTTP is the easiest but not adapted for big files (higher risk of data corruption). Providing both and a combination of them is important.

Should we supply tools for requesting and processing the dumps?
We can help maintain an index of tools produced by others (cf. ), but being realistic, we'll probably want to provide a reference set of tools that always works and isn't horribly slow. We need not commit to have it work on all platforms, however.

What about applying a series of incrementals to a full dump? What is needed for incrementals to be useful to mirror maintainers?

 * delete/move/add lists for revision content
 * delete/insert/update sql for all sql tables that are dumped
 * The openZIM project has projects (which are partly implemented) to provide this feature, this is an ongoing effort.
 * The htmldumper script updates a file-based HTML dump incrementally by only downloading revisions that are missing locally while paging through the list of current title revisions using the PHP API. This is less efficient than doing the same based on a list of changes, but does work on arbitrary base versions of a dump.
 * The htmldumper script updates a file-based HTML dump incrementally by only downloading revisions that are missing locally while paging through the list of current title revisions using the PHP API. This is less efficient than doing the same based on a list of changes, but does work on arbitrary base versions of a dump.

How often should full dumps be produced? Incremental dumps?
Incrementals: some users want daily changesets. Others want updates even more often than that (every few minutes, once an hour).

Having incremental dumps would reduce the need for frequent full dumps. Also, incremental dumps have linear disk space need, whereas full dumps have quadratic space consumption.

The present frequency of dumps is by far insufficient. Full dumps and sql page/revision/categorylinks/templatelinks/redirect/iwlinks/etc tables should be produced quite often so that lists of "problem" pages can be easily updated every few days, which increases motivation to deal with the pages found and listed.

Well connected end-users want new ZIM files on a daily basis, the other ones (the ones who really needs offline snapshots) don't care so much (they have other more critical issues to fix in their life). Monthly updates seems to be a good compromise between the need to have fresh dumps and IT resource consumption.

How long should full and incremental dumps be retained?
Full dumps: Until the next full dump has completed successfully.

Interdependencies

 * Can we have sinergies with Wikia and other wikifarms which suffer considerable pains in dumps production?
 * How to make sure that WMF keeps having an urgency to keep dumps running even if internal consumers like Wikistats happened to abandon dumps in favour of DB access?

support of wikipedia vs. other mediawiki installs
It would be good if whatever processes are developed are also tested against generic installs of mediawiki. Background: We are presently trying to use OCG/ZIM on our OER wiki at http://oer.educ.cam.ac.uk, and while these run well for wikipedia, we are struggling significantly in terms of using this on our wiki, see.

User stories
A company wants to mirror English Wikipedia with relatively up to the minute changes (or at least up to the hour) and use it to return search results to its customers, with changes to the format, extra capabilities, etc.

A Wiktionary editor wants to update all pages containing definitions for a word in a given language, that have a certain template.

A Wiktionary editor wants to update all pages with new interwiki links depending on the pages added or removed on other Wiktionary pages.

A Wiktionary editor counts template-enclosed interproject links (to Wikispecies and WP) to measure "demand" for entries that would otherwise be redlinks.

A researcher wants to examine reversions of revisions across all articles in the main Wikipedia namespace.

A researcher wants to use MapReduce (via Hadoop) to do analysis of English Wikipedia dumps. Dump files should be easily splittable by Hadoop.

Someone wants to create an offline reader based on a static dump of Wikipedia. It would be nice if the output were easily mungeable HTML (not the skin and all that other stuff). They then want to update it once a week with new content and removal of deleted content.

A bot runner wants to check all recent pages for formatting and spelling issues once a day, doing corrections as needed.

Someone wants to download all the articles about X topic and dump them into a local wiki, or an offline reader, or serve them as HTML on their laptop.

Someone wants to mirror the contents of all projects in their language, with a homegrown search across all wiki prjects in that language returning results.

Someone wants to do research on Wikipedia as it existed 2 years ago and as it exists today, setting up two sql dbs with the appropriate content.

Someone wants to grab all the infoboxes and turn their content into a searchable cross-wiki database.

Someone wants to process Wikipedia articles graph to provide analytical toolkit for browsing and managing Wikipedia categories.

A user don't have a good (uncensored, cheap, fast) broadband access and prefer to have an offline copy.

Configuring Additional Dumps
It would be very useful to be able to define additional types of dumps to be created for projects with special needs, such as JSON and RDF dumps for Wikidata. Adding another dump should be a simple config change to be submitted via gerrit. The configuration for an additional dump (as a JSON file or similar) should define:
 * maintenance script(s) to run, or table to dump
 * output file(s) generated
 * the dump schedule
 * a short name
 * description text (wikitext or HTML? translatable? a wiki page?)

If we want to get fancy, we could add support for dependency tracking between dumps: if dump A gets generated from dump X, we could declare this, and the system could generate the dumps in the correct order, and know that they will be consistent. But that's perhaps asking a bit much for now.

Dump Organization and Discovery
Dumps are currently organized into timestamped directories. But different dumps are created at different times, and perhaps at different intervals, so this kind of organization is misleading at best. We should:
 * drop the misleading "one dir with one timestamp for all dumps" approach
 * have one "timeline" per dump file instead
 * for dumps that are guaranteed to be consistent (one generated from the other), generate a "timeline" of directories with symlinks to the actual files.

Dumps should be easily discoverable. There should be stable, well documented URLs for fetching the most recent dump of a specific type for a given project. Besides providing stable, predictable URLs, there should also be a machine readable inventory. We should have (per project, at least):
 * A JSON file describing all the available dumps, their URL, descriptions, status (scheduled/doing/done/failed), timestamp, etc. The JSON file should be placed at a predictable URL.
 * On top of that, we may want to support a data portal inventory standard like DCAT-AP.

Architecture
Full dumps are going to get larger and larger over time, taking more and more capacity to generate and to store. If parallelized in some smart way, the run time might be kept relatively stable, but the time required to combine the results of many small jobs to a single or small group of downloadable files will increase; it would be nice to minimize this by being clever about the final output format.

Ideally there would not be one host (spof) with an nfs mounted filesystem mounted on all the generating hosts. Dump content would be generated locally on a cluster of hosts and then synced to one or a cluster of hosts for storage and web service.

Users are used to seeing what's going on with the dumps in real time and downloading the latest files the instant they are available, so any new setup would need to tke that into account.

Dump jobs can be interrupted for a number of reasons: MW core bug discovered at deployment time, swapping out one db server for another, network outage, rebooting the host that generates them or the host where they are written, for security updates, etc. I'd love it if we could apply security updates instantly to any of these hosts, as often as needed, so it would be ideal if a dump run could automatically pick up where it left off, losing at most say 30 minutes of run time data. This may impact the file size question raised above. This applies for both sql table dumps and for revision content dumps.

Currently dumps are done on a 'rolling' basis: the wiki with the oldest dump goes next, and a wiki which has never been dumped is considered oldest and so jumps to the top of the queue. We want this fairness in the new setup.

The current rolling dumps can't be run by cron, since they run continuously; their design also doesn't lend itself to startup via puppet. The new setup should do one or the other.

We should be able to customize a list of jobs per wiki for special cases, including e.g. json wikibase repo dumps for wikidatawiki¸ as opposed to what we do now with those running separately by cron and stored in a separate directory tree. Incrementals should be integrated into the process in the same way.

ZIM dumps should be generated in Wikimedia datacenter on time a month and published on http://download.wikimedia.org.org. Kiwix project is currently setting up a few VM in labs to create the ZIM files. http://download.wikimedia.org/kiwix/ should be extended to store all of them. An effort to provide a nice download user interface should be accomplished.

Database schemata
Database schemata are showing genetic drift. SQL dump files (2015-June) do not agree with maintenance/tables.sql (2015-June). This impedance mismatch causes unnecessary manual intervention and cognitive load. Here are the main issues.

Superfluous columns in SQL dump

 * page.sql.gz shows page_counter field, which should be removed (all hit counters were removed from core).T103564
 * page.sql.gz shows page_no_title_convert field, which should be removed (this functionality was moved to page_props.pp_propname).T103571
 * site_stats.sql.gz shows ss_total_views field, which should be removed (all hit counters were removed from core).T103564

Missing columns in SQL dump

 * page.sql.gz is missing page_lang field.T103718

Misdefined columns in SQL dump

 * image.sql.gz shows img_description field, which should be resized.T103575
 * protected_titles.sql.gz show pt_reason field, which should be resized.

(Dis)order of columns in SQL dump

 * categorylinks.sql.gz should reposition cl_sortkey_prefix.T103583
 * imagelinks.sql.gz should reposition il_from_namespace.
 * interwiki.sql.gz should reposition all columns.
 * page.sql.gz should reposition page_links_updated.
 * page_restrictions.sql.gz should reposition pr_id.
 * pagelinks.sql.gz should reposition pl_from_namespace.
 * templatelinks.sql.gz should reposition tl_from_namespace.

(Dis)order of indexes in SQL dump

 * imagelinks.sql.gz shows il_backlinks_namespace index, which should be reordered.T103579
 * templatelinks.sql.gz show tl_backlinks_namespace index, which should be reordered.

Inadequate INSERT command in SQL dump
The INSERT commands in the SQL dump files, do not list the columns. This should be fixed, so that an SQL dump using one column order can be INSERTed seemlessly into a database table using a different column order. SQL dumps should use the format:

INSERT INTO table (col1,col2, ...) VALUES (val1,val2, ...)(val1,val2, ...);

and not, as is currently the case, the format:

INSERT INTO table VALUES (val1,val2, ...)(val1,val2, ...);

Superfluous table in SQL dump

 * interwiki.sql.gz is an empty table (use interwiki.cdb instead).T103589

(Dis)order of rows in SQL dump
Disordered rows can kill INSERT performance for large tables. Here are the worst offenders. Each of these tables should have a declared PRIMARY KEY; and not, as is currently the case, leave it to the DBMS to make one that is opaque. For, there is no guarantee that the opaque index generated by one DBMS will be the same as that generated by another. INSERT performance for enwiki_yyyymmdd_pagelinks.sql.gz is roughly this: in order takes 2 days; out of order takes 2 months (due to billions of disk read/write head movements).
 * imagelinks.sql.gz (enwiki 700m rows)
 * maintenance/tables.sql
 * column order (il_from,il_from_namespace,il_to)
 * UNIQUE INDEX (il_from,il_to)
 * imagelinks.sql.gz
 * column order (il_from,il_to,il_from_namespace)
 * UNIQUE INDEX (il_from,il_to)
 * row order   (il_to,il_from_namespace,il_from)
 * langlinks.sql.gz (enwiki 200m rows)
 * maintenance/tables.sql
 * column order (ll_from,ll_lang,ll_title)
 * UNIQUE INDEX (ll_from,ll_lang)
 * langlinks.sql.gz
 * column order (ll_from,ll_lang,ll_title)
 * UNIQUE INDEX (ll_from,ll_lang)
 * row order   (ll_lang,ll_title,ll_from)
 * pagelinks.sql.gz (enwiki 900m rows)
 * maintenance/tables.sql
 * column order (pl_from,pl_from_namespace,pl_namespace,pl_title)
 * UNIQUE INDEX (pl_from,pl_namespace,pl_title)
 * pagelinks.sql.gz
 * column order (pl_from,pl_namespace,pl_title,pl_from_namespace)
 * UNIQUE INDEX (pl_from,pl_namespace,pl_title)
 * row order   (pl_from_namespace,pl_namespace,pl_title,pl_from)
 * templatelinks.sql.gz (enwiki 500m rows)
 * maintenance/tables.sql
 * column order (tl_from,tl_from_namespace,tl_namespace,tl_title)
 * UNIQUE INDEX (tl_from,tl_namespace,tl_title)
 * templatelinks.sql.gz
 * column order (tl_from,tl_namespace,tl_title,tl_from_namespace)
 * UNIQUE INDEX (tl_from,tl_namespace,tl_title)
 * row order   (tl_from_namespace,tl_namespace,tl_title,tl_from)