SQL/XML Dumps/"Other" dump jobs via cron

“Other” dump jobs via cron
So far we’ve looked at sql/xml dumps and at Wikibase dumps for wikidata entities. Now let’s look at how other dumps are generated, and how we might use this approach for the watchlist dumps, if we decided not to include them in the sql/xml dumps but to run them separately on a weekly basis.

This isn’t as far-fetched as you might think. The sql output in our sql/xml dumps is typically a dump of an entire table via mysqldump, which can be loaded wholesale for import into a new wiki. The watchlist table dump, while producing data from a table for each wiki, is in a special format lending itself to use as a dataset rather than as a potential import. And most such files are produced in the manner described below.

Puppet manifests for dumps: layout
Everything about the various dumps is puppetized, and so are these. They run from a dedicated snapshot host, and the role applied to that host is role::dumps::generation::worker::dumper_misc_crons_only

In general, we have roles for dumps generation workers, which are the snapshot hosts that run the python and MediaWiki scripts, and for dumps generation servers, which are the dumpsdata NFS servers.

The profiles break down the same way; there are profiles for the workers, with the prefix profile::dumps::generation::worker, and profiles for the NFS servers, with the prefix, can you guess? That’s right, profile::dumps::generation::server. All of the worker hosts have the profile profile::dumps::generation::worker::common in common, and then they have a secondary profile according to what they run. In our case, the snapshot host we are interested in has the profile profile::dumps::generation::worker::cronrunner for running these “other” dumps via cron.

If we drill down into that, we can see that the one class it applies is snapshot::cron and you notice right away a difference in the naming scheme. At the class level, workers get classes starting with the name “snapshot”, and the NFS servers get classes with the name “dumps”.

Note that other servers also get classes with the name “dumps”, including the public facing web server and the server that supplies NFS service of the dumps to stat1007.

Digging into the cron class
If you look at the beginning of the code block of the cron class, you’ll see the creation of the file /usr/local/etc/dump_functions.sh. You might remember this file from our discussion of the Wikibase dumps. These functions are there for you to use so that you know where to write output, among other things.

After that we have each cron job listed in random order. Please don’t snark about that too much. Some of these are quite complex jobs, like the Wikibase ones, and some are quite simple. Let’s look at a simple case.

A simple “other” dump: shorturls
snapshot::cron::shorturls is a tiny little class for a tiny self-contained dump. Let’s have a look.

{{CodeCommentary|type=code|lang=puppet|start=1|icon=eyes|pos=14|content= class snapshot::cron::shorturls(   $user      = undef,    $filesonly = false, ) { $cronsdir = $snapshot::dumps::dirs::cronsdir $repodir = $snapshot::dumps::dirs::repodir $confsdir = $snapshot::dumps::dirs::confsdir

if !$filesonly { cron { 'shorturls': ensure     => 'present', environment => 'MAILTO=ops-dumps@wikimedia.org', user       => $user, command    => "cd ${repodir}; python3 onallwikis.py --wiki metawiki --configfile ${confsdir}/wikidump.conf.dumps:monitor --filenameformat 'shorturls-$($d}.gz' --outdir '${cronsdir}/shorturls'  --script extensions/UrlShortener/maintenance/dumpURLs.php 'compress.zlib://{DIR}'",            minute      => '5',            hour        => '8',            weekday     => '1',        }    } } }}

Cleaning up old dumps
We don’t keep dumps forever. Disks are cheap but they still take power and space, and rack space is NOT cheap. So at some point we delete old dumps of all kinds to make way for more recent ones.

In the case of “other” dumps run from cron, there’s a class for this: dumps::web::cleanups::miscdumps. From the name, you can tell it must run on the dumpsdata and/or web server hosts, and not on the snapshots. This is correct! Each host that gets copies of the dump output files also cleans up old ones on its own. We could rsync --delete but that can be very dangerous when it goes wrong. This is the safe and easily configurable approach.

Let’s have a look.

{{CodeCommentary|type=code|lang=puppet|start=1|icon=eyes|pos=11|content= class dumps::web::cleanups::miscdumps(   $isreplica = undef,    $miscdumpsdir = undef, ) { file { '/usr/local/bin/cleanup_old_miscdumps.sh': ensure => 'directory', path  => '/usr/local/bin/cleanup_old_miscdumps.sh', mode  => '0755', owner => 'root', group => 'root', source => 'puppet:///modules/dumps/web/cleanups/cleanup_old_miscdumps.sh', }

}}

{{CodeCommentary|type=code|lang=puppet|start=34|content= {{ZWS}}   if ($isreplica == true) { $addschanges_keeps = '40' } else { $addschanges_keeps = '7' }

# adds-changes dumps cleanup; these are in incr/wikiname/YYYYMMDD for each day, so they can't go into the above config/cron setup $cleanup_addschanges = "find ${miscdumpsdir}/incr -mindepth 2 -maxdepth 2 -type d -mtime +${addschanges_keeps} -exec rm -rf {} \\;" }}

{{CodeCommentary|type=code|lang=puppet|start=42|icon=eyes|pos=8|content= {{ZWS}}   cron { 'cleanup-misc-dumps': ensure     => 'present', environment => 'MAILTO=ops-dumps@wikimedia.org', command    => "${cleanup_miscdumps} ; ${cleanup_addschanges}", user       => root, minute     => '15', hour       => '7', require    => File['/usr/local/bin/cleanup_old_miscdumps.sh'], } } }}

The output directory though…
The top-level output directory (like e.g. “watchlists”) should be either created by your script, which is maybe not the best thing in case your script gets confused and tries to create a directory in some weird place, or you can have it created by puppet.

Yes, there’s a puppet manifest for that! dumps::generation::server::dirs is the place to add your new dir. Define it in the list of declarations of dirs, and then add it to the stanza commented with “subdirs for various generated dumps”, and it will be automatically generated for you.

But the downloaders!
Your downloaders need to know how to find your shiny new dumps. Otherwise the whole exercise is a bit like navel-gazing. There’s an index.html file where you can add an entry This file is long overdue for restructuring, but until then, just add your entry somewhere in the list, probably not at the top though. Don’t add an index.html file in the subdirectory where your datasets go, unless you are prepared to generate it dynamically with links to each file as it’s created.

Rsyncing?
No need to worry about this. We rsync everything in the root of the “other” dumps tree on the dumpsdata NFS server out to the web servers, so if you write it, it will appear. Just make sure not to store some private data over there temporarily, thinking it will remain private!