SQL/XML Dumps/Anatomy of a dumps job

Overview
A job sets up its output filename, the human-readable description of the dump job, the dump job name for inclusion in various report files used by the code and by dumps endusers.

It then sets up the command(s) that must be run in order to produce the output.

And finally, it runs these commands, reporting back if there is an error.

Flow.py
This is a good case to look at because it doesn’t involve multiple output files, it doesn’t run commands in parallel, and it doesn’t need to generate a bunch of temporary files. It also doesn’t depend on input files from any other dump jobs.

Source of the file is here.

Let’s look at the code bit by bit. Don’t worry if you don’t know python; we don’t care so much about the specific syntax as the general idea of what’s being done. All you really need to know is that indentation levels have meaning, since they rather than brakcets are used to determine which statements are part of a code block.

Note that Google broke my indentation in the code, sorry about that!

Basic class setup
class FlowDump(Dump): """Dump the flow pages."""

def __init__(self, name, desc, history=False): self.history = history We dump all flow boards; during the "full" dumps we dump separately all revisions of flow boards. Both dumps use the same MediaWiki maintenance script, with a small difference in command-line arguments. Naturally the output filenames differ as well. Dump.__init__(self, name, desc)

def detail(self): return "These files contain flow page content in xml format." This line is displayed in the list of output files per dump run that users see when they view the web page for the dump run of a given wiki and date. It helps users decide which files they need to download. def get_filetype(self): return "xml"

def get_file_ext(self): return "bz2"

def get_dumpname(self): if self.history: return 'flowhistory' return 'flow' These are the parts of a dump output filename. Output names are of the form

In the case of flow output, example filenames are for regular dumps, and  for the full revision history dumps.

Building a command
def build_command(self, runner, output_dfname): The  will be explained later. For now, know that it is an object representing a typical dump output filename with all of the parts that constitute it. if not os.path.exists(runner.wiki.config.php): raise BackupError("php command %s not found" % runner.wiki.config.php) The path for  is configurable; this was useful when we were moving from PHP 5 to PHP 7, and from HHVM to PHP. flow_output_fpath = runner.dump_dir.filename_public_path( output_dfname) is a  object. Dump directories are of the form Only the public directories are rsynced to other hosts. The private data is not copied anywhere and is not backed up in any way.

This method returns the public directory for the given wiki and dump run date unless the wiki itself is private, in which case the private directory is returned. script_command = MultiVersion.mw_script_as_array(           Runner.wiki.config, "extensions/Flow/maintenance/dumpBackup.php") This method determines, based on the dumps configuration file, whether  should be called with the maintenance script needed for this dump job, or whether php may be called directly. command = [runner.wiki.config.php] command.extend(script_command) command.extend(["--wiki=%s" % runner.db_name,                         "--current", "--report=1000",                          "--output=bzip2:%s" % DumpFilename.get_inprogress_name(flow_output_fpath)]) The in progress name is just the output file name with " " tacked onto the end. We write files out with this suffix so that we can easily clean up all partially written files if a job dies, and so that rsyncs of files to the public-facing servers include only completed files ready for download, saving grief for the users and bandwidth for the servers in the rsync. Remember that the source server for the rsync is an NFS share so we jealously guard our bandwidth. if self.history: command.append("--full") pipeline = [command] series = [pipeline] Our command runner (" ") expects a list of command series which are run in parallel. Each series in the list is a list of command pipelines to be run one after the other; such a list might result in something like

Each pipeline is a set of commands with pipe symbols to be run as a unit.

This means that if we have just one command to run, we must convert it to a pipeline (list of commands) with one entry; then that pipeline into a series with one entry, and then later a commands-in-parallel unit, with one entry. return series

Running the command
def run(self, runner): self.cleanup_old_files(runner.dump_dir, runner) Old output files for this run are removed if this is enabled. In general we don't do this; old output files are complete files and will be left intact, with only missing files (in the case of jobs that write multiple files) filled in on a retry. dfnames = self.oflister.list_outfiles_for_build_command(           self.oflister.makeargs(runner.dump_dir)) Defined in, this lister uses the file type, file ext, and dump name to construct the output file name. If there are multiple files produced with different dumpnames, that's handled here too. If files are produced by processes running in parallel, they are numbered and those names returned. DumpFilename objects are returned, to be more precise. if len(dfnames) > 1: raise BackupError("flow content step wants to produce more than one output file") In this case we want just one output file. output_dfname = dfnames[0] command_series = self.build_command(runner, output_dfname) self.setup_command_info(runner, command_series, [output_dfname]) Defined in, this adds an associative array (dict) to a list we keep of information about commands to be run in this job; it allows us to connect a specific command series to the output files that it will produce, in case of error, so that the information can be relayed back to the caller and the files can potentially be cleaned up as well before retry. error, _broken = runner.run_command([command_series],             callback_stderr=self.progress_callback,                                                               callback_stderr_arg=runner,                                                               callback_on_completion=self.command_completion_callback) Here we have converted the command series into a list of command series to be run in parallel, which has just the one series as an entry.

We jump through all these hoops with special callers not just to catch errors but also to provide a callback that either reports progress as the maintenance script emits progress lines, or checks on the size of the output file in progress and reports on that at regular intervals. This progress information is written to a file in the dump output directory so that it can be used in generating the web page seen by users; if you look at e.g. https://dumps.wikimedia.org/wikidatawiki/ for the most recent dump, assuming it is still in progress, you will see something like, for example, 2020-08-27 09:01:58 in-progress All pages, current versions only. b'2020-08-27 09:39:24: wikidatawiki (ID 45809) 2999 pages (150.8|456.9/sec all|curr), 3000 revs (150.8|152.3/sec all|curr), 97.0%|97.5% prefetched (all|curr), ETA 2020-12-02 13:54:38 [max 1266340344]' This information is made available courtesy of one of the progress callbacks.

If your maintenance script does not produce such output as it runs, then you will want to check the file sizes at regular intervals and write that information out to various status files; this is done by  which is defined in. The tables dump jobs all do this; they save output to a  compressed file using   which automatically handles the choice of the appropriate callback. You can see this in the  file as well.

The command completion callback moves the temporary (“ ”) files into their permanent location, and may optionally check if the files are empty or, for compressed files, if they are complete or have only been partially written (decompression fails). if error: raise BackupError("error dumping flow page files") We hope this never happens :-)