SQL/XML Dumps/Running a dump job

Running dump jobs
At some point you actually want to run one or more dumps jobs, for testing if nothing else. We’ve talked about the list of jobs that is assembled, and how just the jobs requested to run are marked to run, and how a given job runs. Today, let’s look at the worker.py script that is run from the command line, along with all of its options.

worker.py
This, like all python dump scripts, is a python3 script. Because debian stretch has python 2 as the default, we will have to invoke python3 explicitly:

ariel@snapshot1009:/srv/deployment/dumps/dumps/xmldumps-backup$ python3 ./worker.py --help

We won’t discuss all of the options here, just the most useful ones.

Sample commands
We talked about “stage files” a while back. These are all located in /etc/dumps/stages and they contain all of the worker.py commands that are run automatically, so let’s look at one of those. We have stage files for full (page content with all revisions) and partial dump runs, so let’s look at what’s required for a partial run.
 * 1) slots_used numcommands on_failure error_notify command

List of space-separated fields, with the command last since it contains spaces 1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --job xmlstubsdump; /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --job tables Notice the use of skipdone, exclusive, log, prereqs, the job name and the start date. 6 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --job xmlstubsdump ,xmlstubsdumprecombine; /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --job tables You see here that the config file is specified with the config section “bigwikis” which includes special settings for these wikis that run 6 parallel processes at once. 1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --job articlesdump 6 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --job articlesdump,articlesdumprecombine Nothing too exciting here. 1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --job articlesmultistreamdump 1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --job articlesmultistreamdump,articlesmultistreamdumprecombine
 * 1) stubs and then tables so inconsistencies between stubs and tables aren't too huge
 * 1) stubs, recombines, tables for big wikis
 * 1) regular articles
 * 1) regular articles, recombines for big wikis
 * 1) regular articles multistream
 * 1) regular articles, recombines for big wikis multistream

1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --job metacurrentdump 6 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --job metacurrentdump,metacurrentdumprecombine More boring entries. 1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --skipjobs metahistorybz2dump,metahistorybz2dumprecombine,metahistory7zdump,metahistory7zdumprecombine,xmlflowhistorydump 6 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --skipjobs metahistorybz2dump,metahistorybz2dumprecombine,metahistory7zdump,metahistory7zdumprecombine,xmlflowhistorydump Note how we specify certain jobs to be skipped. Thus, they are not marked to run, and when the rest of the jobs are complete, the entire dump is marked as complete.
 * 1) articles plus meta pages
 * 1) articles, recombine plus meta pages for big wikis
 * 1) all remaining jobs except for the history revs

Configuration
I’ve thought about replacing these standard python config format files with yaml or json. And that’s as far as it’s gotten: thinking about it.

The production configuration settings are in /etc/dumps/confs/wikidump.conf.dumps so let’s look at some of these settings. Note that the file is generated from a puppet template, so I can't link you to a copy of the completed config in our repo.

There may be other settings added from time to time; check the docs and the puppet manifests for details!