SQL/XML Dumps/Running a dump job

Running dump jobs
At some point you actually want to run one or more dumps jobs, for testing if nothing else. We’ve talked about the list of jobs that is assembled, and how just the jobs requested to run are marked to run, and how a given job runs. Today, let’s look at the worker.py script that is run from the command line, along with all of its options.

worker.py
This, like all python dump scripts, is a python3 script. Because debian stretch has python 2 as the default, we will have to invoke python3 explicitly.

ariel@snapshot1009:/srv/deployment/dumps/dumps/xmldumps-backup$ python3 ./worker.py --help

We won’t discuss all of the options here, just the most useful ones. --dryrun     Just print the commands that would be run --verbose    Print commands as they are run

--configfile Path to dumps config file, followed optionally by a config section name with extra settings. Known names are: bigwikis, wd (wikidata), en (enwiki) --date       We run dumps on the 1st and 20th of the month. All jobs will therefore have a date 202x0y01 or 202x0y20. If you don’t do this, the run will have today’s date and new directories for output will be created, etc.

--job        Which job or jobs to run, comma-separated. If you leave this blank, you will get a list of all known jobs. Note that if you need to rerun all the sql table dumps (yes, this has happened), you can just specify “tables” instead of             each job one at a time. Default: run everything --skipjobs   Comma-separated list of jobs NOT to run, in case you didn’t specify --job --skipdone   Do not rerun any jobs that completed successfully (default: DO rerun them) --prereqs    If the prerequisite job  for any specified job is missing, run it first. Example: the articlesdump job requires the xmlstubsdump job to run first.

--addnotice  Add a file “notice.txt” in the dumps run directory for this wiki and date which will be inserted into the index.html file for that wiki and dump run in case there is a known problem. --delnotice  Remove any such notice.

--exclusive  Lock the wiki for this run date so nothing else runs a job. This means that different jobs for the same wiki cannot be run at the same time...yet. But it             also means that you won’t have multiple processes trying to run the same job.

--log        Write all progress and other messages to a logging facility as determined by              the config file.

--cutoff     Provide a date in YYYYMMDD format, and get the name of the next wiki with no dump run for the specified job(s) for that date, and the oldest previous run. Age in this case is figured out first by the name of the dump run directory and secondly by timestamp in case of a tie. If there are no such wikis, the script exists with no output.

Sample commands
We talked about “stage files” a while back. These are all located in /etc/dumps/stages and they contain all of the worker.py commands that are run automatically, so let’s look at one of those. We have stage files for full (page content with all revisions) and partial dump runs, so let’s look at what’s required for a partial run.
 * 1) slots_used numcommands on_failure error_notify command

List of space-separated fields, with the command last since it contains spaces 1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --job xmlstubsdump; /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --job tables Notice the use of skipdone, exclusive, log, prereqs, the job name and the start date. 6 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --job xmlstubsdump ,xmlstubsdumprecombine; /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --job tables You see here that the config file is specified with the config section “bigwikis” which includes special settings for these wikis that run 6 parallel processes at once. 1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --job articlesdump 6 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --job articlesdump,articlesdumprecombine Nothing too exciting here. 1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --job articlesmultistreamdump 1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --job articlesmultistreamdump,articlesmultistreamdumprecombine
 * 1) stubs and then tables so inconsistencies between stubs and tables aren't too huge
 * 1) stubs, recombines, tables for big wikis
 * 1) regular articles
 * 1) regular articles, recombines for big wikis
 * 1) regular articles multistream
 * 1) regular articles, recombines for big wikis multistream

1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --job metacurrentdump 6 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --job metacurrentdump,metacurrentdumprecombine More boring entries. 1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --skipjobs metahistorybz2dump,metahistorybz2dumprecombine,metahistory7zdump,metahistory7zdumprecombine,xmlflowhistorydump 6 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --skipjobs metahistorybz2dump,metahistorybz2dumprecombine,metahistory7zdump,metahistory7zdumprecombine,xmlflowhistorydump Note how we specify certain jobs to be skipped. Thus, they are not marked to run, and when the rest of the jobs are complete, the entire dump is marked as complete.
 * 1) articles plus meta pages
 * 1) articles, recombine plus meta pages for big wikis
 * 1) all remaining jobs except for the history revs

Configuration
I’ve thought about replacing these standard python config format files with yaml or json. And that’s as far as it’s gotten: thinking about it.

The production configuration settings are in /etc/dumps/confs/wikidump.conf.dumps so let’s look at some of these settings. [wiki] dblist=/srv/mediawiki/dblists/all.dblist privatelist=/srv/mediawiki/dblists/private.dblist closedlist=/srv/mediawiki/dblists/closed.dblist skipdblist=/etc/dumps/dblists/skip.dblist flowlist=/srv/mediawiki/dblists/flow.dblist All of the lists of wiki databases of various sorts go here. Some are maintained by us (skipdblist) but most are not. skipdblist is used for dump runs that do “all the regular wikis” but not big or huge wikis (no wikis that need multiple processes running at once). dir=/srv/mediawiki Where the mediawiki repo is. adminsettings=private/PrivateSettings.php I think we used to parse this directly. Not any more! I should just remove the adminsettings entry. tablejobs=/etc/dumps/confs/table_jobs.yaml These are the tables we will dump via mysqldump. Need a new table? Just add it to the list. multiversion=/srv/mediawiki/multiversion

If you have a wikfarm with our sort of setup, this is the path to the location of MWScript.php. [output] public=/mnt/dumpsdata/xmldatadumps/public private=/mnt/dumpsdata/xmldatadumps/private temp=/mnt/dumpsdata/xmldatadumps/temp templatedir=/etc/dumps/templs Those are python (NOT PUPPET!) templates for pieces of html files. index=backup-index.html webroot=http://download.wikimedia.org fileperms=0o644

[reporting] adminmail=ops-dumps@wikimedia.org This is who gets emails on dump failure. Who wants to be on this alias with me? mailfrom=root@wikimedia.org smtpserver=localhost staleage=900 There’s a job that cleans up stale locks every so often, in case a process died or was shot, leaving its lock files around. skipprivatetables=1 Deprecated, we don’t dump private tables ever. [database] max_allowed_packet=32M Mysql/mariadb setting. No easy way to keep it in sync with mariadb config, we have to do it manually :-( [tools] php=/usr/bin/php7.2 mysql=/usr/bin/mysql mysqldump=/usr/bin/mysqldump gzip=/bin/gzip bzip2=/bin/bzip2 sevenzip=/usr/bin/7za lbzip2=/usr/bin/lbzip2 checkforbz2footer=/usr/local/bin/checkforbz2footer writeuptopageid=/usr/local/bin/writeuptopageid recompressxml=/usr/local/bin/recompressxml Full paths to everything. For php, this lets us specify different php for different dump groups if we want. writeuptopageid and recompressxml are part of the collection of c utils for working with MW xml dump files. [cleanup] keep=10 Used to be useful, third parties might still use it (if any); now we clean up via a separate cron job [chunks] chunksEnabled=0 retryWait=30 Are we writing page content files in page ranges? Honestly the chunk name is awful and will be changed $someday. [otherformats] multistream=1
 * 1) 15 minutes is long enough to decide a lock is expired, right?

[misc] sevenzipprefetch=1 maxRetries=3

[stubs] minpages=1 maxrevs=100000 We write stubs to a flat file and then read it and pass it to gzip, which is gross. We want around maxrevs revisions in each such temp file to be nice to the db servers. [bigwikis] checkpointTime=720 chunksEnabled=1 chunksForAbstract=6 chunksForPagelogs=6 dblist=/etc/dumps/dblists/bigwikis.dblist fixeddumporder=1 keep=8 lbzip2forhistory=1 lbzip2threads=3 recombineHistory=0 revsMargin=100 revsPerJob=1500000 skipdblist=/etc/dumps/dblists/skipnone.dblist All wikis that require multiple processes to run (and not enwiki, wikidatawiki) use these settings. The ‘6’ you see around is how many output files, and therefore how many processes, for various jobs. The dblist file is different so that running through all of the bigwikis until there are none left to do, means running just through that list. lbzip2 uses multiple cores, but we don’t want to use 6 cores because there will be other (input) processes running too, so we use 3 as a good compromise. We used to produce giant bz2 files with all the history in them; now we don’t. You want full history, download a bunch of smaller files. …
 * 1) generic settings for big wikis

[arwiki] pagesPerChunkHistory=340838,864900,1276577,1562792,2015625,1772989
 * 1) specific settings for wiki arwiki

[commonswiki] pagesPerChunkHistory=10087570,13102946,15429735,17544212,19379466,18705774 Per wiki settings, which are picked up automatically by the config file reader. For “big wikis” such as these, all of the special settings are in the “bigwikis” section except for the number of pages in each output file, which obviously varies per wiki.
 * 1) specific settings for wiki commonswiki

The rest of the config file has more per-wiki specific settings so we won't look at those here.

There may be other settings added from time to time; check the docs and the puppet manifests for details!