SQL/XML Dumps/Wikibase dumps via cron

All about Wikibase (rdf) dumps via cron
Okay, maybe not all about them, but enough to get you adding your own scripts for new datasets.

Much of this is specific to the WMF puppet repo, but it builds on scripts in the Wikibase extension that are used to dump wikidata entities in various formats. We can just look at one of these formats to see how it works. It may be helpful for you to have a look at SQL/XML_Dumps/"Other"_dump_jobs_via_cron in combination with the current document.

All dumps except for xml/sql dumps run from cron via shell scripts which can be found in our puppet manifests. See   for these. We’ll start by looking at common functions available to all scripts.

Common functions for dump scripts
The script   provides paths to a number of directories, such as the base directory tree for all of these dumps, and the path to the dumps configuration files. It also provides a function for extracting values from the output of a little python script that gets multiple values from those config files, such as paths to useful executables, and the directory tree for temporary files. You should always source  at the top of your script.

Configuration files
We have a few dumps configuration files available; one is for xml/sql dumps and you won’t want that. One is for WMCS instances and you won’t want that either. The one you do want is  in the   dir, available to you in the   variable. This name is historical; everything other than xml/sql dumps was considered “other” and is put in a separate directory tree, and even generated by separate hosts writing to a separate filesystem.

Where to write files, what to name them
We like output files to have the wiki db name, the date in YYYYMMDD format, and the sort of dump being produced in the name, the type of output (json, txt, ttl etc) as well as the compression type as the extension. Typically files are arranged in directories by dump type and then date. For example, cirrussearch dumps are in the subdirectory cirrussearch under a further subdirectory of YYYYMMDD for the date that run was started, with files for all wikis in the same directory. Output files for mediainfo data from commons in rdf format might be in  for example.

Thus, at the beginning of the script you should save the date in the right format. The base directory tree for these dump output files is available to you in  so you can just tack on the dump name and the date as subdirs after that.

Getting config values for your script
We use absolute paths for everything including e.g. gzip and php. This makes things a bit safer and also lets us switch between versions of php on a rolling basis when we start moving to a new version. We also need to be able to get the directory where  and its ilk reside. The   script is available for retrieving these values. You will set up a single argument with the config file sections and setting names you want to retrieve, call the script once, get the output, and then use the  function (included in the   script we saw earlier) to extract each value to its own shell variable. You’ll want to check that the value isn’t empty (that something horrible hasn’t gone wrong with the config file or whatever) by using, also provided in

Example code
Enough blah blah, let’s look at code right from   and see for ourselves. This code does all of the above for you, but it’s a good idea to know what all the pieces are before you start building on top of them.

Reminder that all of this is done for you in wikibasedumps-shared.sh so you need only source that in your dumpwikibase(somenewformat).sh script.

There is more here, but we don't need to look at all of the code for it. Instead, we summarize the interesting parts below.

Shared functions for wikibase dumps
A number of convenience functions are provided for you in  so please make use of them:


 * tosses old log files, typically written to files in   or
 * runs DCAT.php on the output files if there is a config specified for it.
 * generates md5 and sha1 checksums for and output file and stashes them in files to be made available for download.
 * gets the number of batches we’ll need to run, based on the number of processes we run at once and the max page id and some other stuffs.
 * sets the first and last page id we want to retrieve in a specific batch, etc.
 * gets a properly sorted list of all the output files matching some wildcard.
 * gets the byte count of one or more files. This is used as a sanity check to make sure we don’t suddenly have much tinier output files than expected.
 * logs errors and handles retries of a batch.
 * is used in the case that we have run this script manually to continue a previously failed run. It will determine where we left off.
 * is used at the end to move the temporary files we write into their permanent name and location.

These are all pretty self-explanatory, and the calls to them in e.g.  should be good examples.

Per-project functions for wikibase rdf dumps
You’ll be setting up for some other format but these are the sort of things you’ll want to provide.

Let’s look at the example for commons rdf dumps:

Pretty simple stuff.

Now let’s see how all that gets used in