SQL/XML Dumps/Phabricator task management & future

From mediawiki.org

Every day it's good to look at the Dumps Generation project workboard in Phabricator and decide what tasks to work on, keeping in mind future work. This document describes how that works currently.

Dumps Generation workboard on Phabricator[edit]

What are all these columns?[edit]

Any new reports will come in to the "Backlog" column, appearing on top. Things that are urgent or small should move to the top of the queue; urgent things should be looked at immediately and small things should be addressed within a few days. If at all possible, investigate enough to reply within one day so that the reporter knows their issue is being addressed. Small items often live in here during their whole life cycle, getting worked on and then resolved within a few days at most.

Some of these tasks may be primarily for other teams to work on; this includes issues that involve updates to MW maintenance scripts in extensions used for dumping specialized datasets, such as CirrusSearch or Content Translation. Such tasks can be moved to the "Other teams" column.

Items in the "Active" column should be those larger or harder items currently being worked on. It's good for the "Active" list not to get too large.

If items in the "Active" column are waiting on something for more than a few days (user verification of a fix, action from another team, the next deployment window in two weeks, etc.), they can be moved to the "Blocked/Stalled/Waiting for Event" column.

Large or hard items that are set to start on as soon as there is some free time from the "Active" items should go into the "Up Next" column. How does one decide which items go here? Ah ha, time to talk about our next topic.

Prioritization[edit]

This is not an exhaustive list of what to do for every task but is meant to describe current thinking about priorities, so that the process is at least not entirely opaque.

Items involving broken data that breaks the dumps should always be looked at as soon as possible. They may wind up "Blocked" once some workaround has been put in place that allows the dumps to continue; actually fixing up bad data in the database is tricky and generally slow to happen.

Beyond "these dumps are broken for some reason", there are also requests for new datasets to be dumped. These tasks will be next in priority. Some time from each work week should be dedicated to moving these along at a steady pace. Members of the team maintaining the service or extension from which the data comes, should be involved in the work. Typically, MW maintenance scripts will be written entirely by members of said team, but they should also be introduced to the relevant puppet manifests and encouraged to submit patches.

Hardware or SRE-related tasks pop up from time to time. These can include everything from replacing or expanding the existing server pools, to enabling IPV6 where it makes sense. These typically happen over a long time frame; for example, from quote to order to racking to first puppet run of a new server may take months. When these tasks are not waiting for action from others, try to give them some time within a week of when they are given back to us for some action.

The bulk of the work is intended to keep dumps running in the same amount of time with ever-increasing amounts of data. The XML/SQL dumps cannot be easily scaled horizontally for the huge (en, wikidata) wikis, so that is where most of the effort is going. Eventually all dumps jobs should be run as small repeatable pieces covering some range of rows or pages, where any server in the pool could take any number of these jobs to run, as directed by a proper job manager such as AirFlow. Current work is directed towards restructuring the XML/SQL dumps so that multiple servers can indeed run parts of the most expensive job for the same wiki, that of dumping page content for all revisions. This is a slow task because extensive multi-process testing must be done, and there's not a proper testbed for that yet.

There are some tasks in the "Backlog" column requesting partial dumps of tables with some private data in them. There are privacy implications for these. It is likely that the data from these tables cannot be published in a form that would be usable for import for making a copy of a wiki project. As such, these datasets are of lower priority, though still on the radar.

Future[edit]

Work on the XML/SQL dumps is done bearing in mind the larger picture. That picture, though a bit vague, looks like this:

  • All jobs (production of metadata for revisions, production of revision content, dumps of various sql tables, production of so-called abstracts of pages, etc) are split into small pieces each of which can be completed in, say, 30 minutes.
  • These small pieces of jobs are farmed out to workers on any host in the pool, via some job manager such as AirFlow.
  • Output files from these small pieces of jobs are stored in an object store such as hadoop and are recombined into larger files and published for public download or for use in WMCS or by stats hosts users.
  • All recombining of files is done by concatenating smaller files together after stripping out intermediate headers. This may not be possible for the 7zip files but it should be doable for the rest.
  • Generation of hash sums for the files to be published for download is also treated as a task to be farmed out to any idle worker, one hash per worker.
  • No NFS servers are involved at all, since all files are written to hadoop and from there stored in a staging area to be rsynced to public areas for WMCS use or download. This involves reading of data from hadoop as well, since older page content dumps are read from in the generation of new dumps.
  • And somewhere during all this, uh oh. We must move MediaWiki to kubernetes, and that means moving the dumps processes as well, which are stateful and really the wrong sort of thing to run inside of a pod. Heavy sigh.

Note that at all times during the progress to this eventual goal, appropriate tests and testbeds for running those tests must be devised.

At this point, if resources were available, the rest of the vague plans for Dumps 2.0 could be revisited.