SQL/XML Dumps/Daily life with the dumps

Daily life with the (sql/xml) dumps
This presentation was given on August 6, 2020.

What we try to do via current architecture (xml/sql dumps)

 * minimize the impact of memory leaks from php or specific mediawiki scripts
 * catch all exceptions, retry everything
 * make all jobs tiny so that restarts of anything broken is cheap and fast (still in progress)
 * make sure no job can hang; either it fails or it runs, nothing in between
 * make restarts automatic: everything retries after various intervals unless stuff is so broken a human must intervene, or it's late in the run and no more restarts of  the main scheduler will happen
 * keep load on the dbs as light as possible
 * analytics wants their stubs near 1st of the month so do that
 * keep tables and metadata dumped "near" each other to minimize inconsistency issues
 * keep bandwidth use on main nfs server minimal so reads/writes for dumps generation are priority
 * rsync files out to public servers with minimal delay (couple hours? 1 hour?)
 * provide lots of ways for downloaders to check status of any run and its outputs
 * complete all full dump runs by end of day 19th of the month, all partial runs by end of month
 * provide small files for downloaders that want them but also large recombined files
 * multiple workers running at once, so all the locking stuff that goes with that

Daily check that everything's ok

 * Hey, are the dumps running? How do they look?
 * Check the web page for sql/xml dumps: dumps.wikimedia.org/backup-index.html
 * The top level page dumps.wikimedia.org has a list of links to various types of dumps, you can use it to get there if you forget.
 * Are the dumps idle?
 * If so, you will see a message "Dump process is idle" at the top.
 * Is the date near the end of the month (partial run) or near the 20th (full run)?
 * If not, there may be a problem; look for dumps marked failed or aborted.
 * If the dumps are running, is the output current?
 * Check the timestamps of the top line of a few wiki entries; are they within 1-2 hours of the current time?
 * Note that these timestamps are UTC.
 * If some timestamps are several hours old, there may be problems with the dump run.
 * Check progress specifically of three big wikis: enwiki, commonswiki, wikidatawiki
 * Typically enwiki and commonswiki complete within the given time even if there are reruns of jobs
 * Typically wikidatawiki needs manual intervention to complete the full run before the 20th of the month. (This is being fixed.)
 * Check ops-dumps@wikimedia.org email alias
 * Any mail from the exception checker ? This reports exceptions logged from the last day. There should be none.
 * Any mail from the job watcher ? This reports runs with current locks but old output, possibly indicating hung or aborted jobs.
 * Check phabricator Dumps Generation project
 * Any new bug reports? They may not have my name on them, so checking the workboard is the only way to be aware of new reports
 * https://phabricator.wikimedia.org/tag/dumps-generation/
 * Check xmldatadumps-l mailing list, sometimes (rarely) people report issues there; also check for held messages
 * https://lists.wikimedia.org/postorius/lists/xmldatadumps-l.lists.wikimedia.org/

Typical things you may want to find out
On which hosts do enwiki or wikidatawiki dumps run? How about the other wikis?
 * Wikitech is usually up to date: https://wikitech.wikimedia.org/wiki/Dumps/Snapshot_hosts#Current_setup
 * If you prefer checking the deployed code, our puppet manifests have the info:
 * https://github.com/wikimedia/puppet/blob/production/hieradata/hosts/snapshot1006.yaml and
 * https://github.com/wikimedia/puppet/blob/production/hieradata/hosts/snapshot1009.yaml (but these locations could change)
 * If you prefer checking the hosts themselves:
 * Run from laptop if you have root on the snapshot hosts: https://gist.github.com/apergos/ffd4408c05a72a46f2bcd5897e27c7e3
 * Run from  if you have sudo there:
 * Note that this is noisier than the gist, and requires you to remember the name of the crontab and the field number. meh.
 * I keep wikidatawiki on 6, misc (non-xml/sql dumps) on 8, the rest may move around
 * I keep wikidatawiki on 6, misc (non-xml/sql dumps) on 8, the rest may move around

What dumps are currently running on what hosts?
 * If you want to see all jobs running on all hosts:
 * https://github.com/apergos/misc-wmf-crap/blob/master/whats_running.sh
 * If you want to see specific wikis currently running:
 * Get on the primary dumps nsf server, and  to   and
 * Alternatively, get on any snapshot host and  to   and
 * There should be a lock file lock_YYYYMMDD and its contents include the process id of the dumper and the snapshot hostname

Which dumps nfs server is primary for xml/sql dumps?
 * Wikitech is usually up to date: https://wikitech.wikimedia.org/wiki/Dumps/Dumpsdata_hosts#Hardware
 * If you prefer checking in production, ssh to any snapshot host running xml/sql dumps, and do
 * I keep  as the primary and   as the nfs store for misc (non-xml/sql) dumps.

Checking on a specific exception:

 * Dumps logs live in  on primary dumpsdata host
 * The exceptions get written out with "error from command" and the specific mediawiki command that failed
 * Generally it's a matter of rerunning the MediaWiki command by hand from the testbed snapshot host as the dumpsgen user
 * You usually want to try to find the page(s) or revision(s) that are a problem with the current code
 * Tools for narrowing down the stubs fed to page content generation, if needed:
 * on snapshot hosts
 * You can also use this to whittle down prefetch files used for page content
 * Most commands for dumping metadata or page content take a page range
 * is your friend
 * If the command you run generates a list of other python commands, you will need to take these and pass them  as well to see how the underlying MediaWiki maintenance script is called with all its args

Examples of things that have broken in the past:

 * exceptions thrown elsewhere in MediaWiki (RevisionRecord etc) that WikiExporter didn't catch
 * wrong content dumped for a page (for several years!)
 * messed up autoload etc, some class not found
 * hung when the prefetcher xmlreader died leaving all other processes waiting for i/o
 * wrong xml output (order of elements changed or new attribute added by accident)
 * slow because content loaded for all revisions for metadata (stubs) dumps
 * slow because query is suddenly slow and we need to FORCE some index
 * slow because prefetches are mostly failing or being skipped
 * rsyncs took forever because the load on the dumpsdata primary host got large because multiple copies of the exception checker stacked up there
 * the "monitor" script (which really just writes out the top level index.html and cleans up stale locks) fails to run

And the all-time favorite:
 * DB server goes away -> MediaWiki throws exception -> dump job dies.

What should happen: MediaWiki dumps maintenance script should reload LB config and try again up to a max number of retries (5? with wait intervals in bet ween retries) and only then give up.

Unfortunately reloading is somewhere in the bowels rather than easily accessible in the maintenance script.

Part One end note
A lot of this stuff, and more, is documented at https://wikitech.wikimedia.org/wiki/Dumps

Congrats, you made it through the first half. And now, a break and some cute cat pictures because that’s how we roll!

https://commons.wikimedia.org/wiki/File:Sleeping_cat_on_her_back.jpg

https://commons.wikimedia.org/wiki/File:Lynx_kitten.jpg

BREAK
Ten minutes, snacks!

Part Two: adding a dumps job
Why do this? People ask for it, and it's a good way to understand all the moving parts.

Reminder:
A dump consists of a dump run per wiki.

Each dump run for a wiki consists of a number of dump jobs in a specific order.

A dump job is one mysqldump or MediaWiki maintenance script run to produce one or more related output files.

Flow of a dump job
The classes and methods for running a dump job are primarily in runner.py and supporting utilities are in runnerutils.py, both in the dumps/ subdirectory.

At the beginning of a dump run, we do some tasks: which currently just sets up the temp files where we will collect hashes of output files

At the beginning of a job, we do some tasks: which writes a dumpruninfo file in text and json formats, containing the status of all dump jobs. You can see an example of one here: https://dumps.wikimedia.org/elwiki/20200801/dumpruninfo.txt and https://dumps.wikimedia.org/elwiki/20200801/dumpruninfo.json

Then we run the job.

After a job is completed, we do some tasks: copies the files with hashes to their permanent location, adds symlinks in the wiki/latest directory pointing to the new output files, updates the RSS feed with these new files, gets rid of old symlinks and RSS feed data, and updates the dumpruninfo files.

After a dump run is completed, we do some tasks: Do one last update of the dumpruninfo file, and one final cleanup of dead ‘latest’ symlinks and old RSS feed data.

All of this is taken care of automatically when we add a new job, but it's good to know how the job fits into the existing structure.

File listers
A dump job will produce some output files… unless it is very, very broken.
 * Some of the output files should be listed for download, but perhaps not all.
 * Some of the output files should be cleaned up on retry, but perhaps not all.
 * Some partially-written files (typically with ".inprog" at the end) should be cleaned up on retry.
 * Some other job may want the output files from this job as its input.
 * This job may need to check if its output files are truncated or empty, and respond accordingly.
 * This job may need to provide a list of the output files to its build_command method.

If any of these require special treatment from the default OutputFileLister methods, we'll need tosubclass it and override where needed.

The quick way to dump new tables
If we can dump the entire table using mysqldump (no private fields, no rows we should skip), then we can just… add the table name and related information to the configuration file in puppet. Example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/527505/4/modules/snapshot/files/dumps/table_jobs.yaml

What are we adding?
 * table name as it appears in the db
 * type: public (no other type is supported, this will eventually be removed. private used to be ok)
 * job name: typically the table name with underscores removed
 * description: this text will show up in the dump index.html page for this item in any dump run; we like entries that fit on one line best

The slow way to dump new tables
When we can't dump all the fields and we might need to skip some of the rows as well, we usually ask MediaWiki to do the work for us.

There may be additional privacy considerations as well, and it is good to get someone on Security who deals with MediaWiki security, data retention and privacy issues, to look at this with you.

Example: watchlists

What should we be able to publish? (Schema here: https://www.mediawiki.org/wiki/Manual:Watchlist_table) Entries in a watchlist that exist (if the page exists, there should be nothing scary in the title). That's it. So for each entry when walking through a watchlist we would write it out if. This likely means we get a list of xml entries, which is typically how we handle dump output from MediaWiki.

Here’s a strawdog xml schema: blah some title ...

Remember that these dumps are for PUBLIC consumption and never as a means of backup.

They are also support for the right to fork, though in the specific case, watchlists cannot be hooked up again to users. A user who moves to a fork and no longer has access to their account on the WMF projects might be able to reconstruct their watchlist by wading through the piles of entries, depending on the wiki size and contributor base, but I wouldn't count on it.

We would need to add some code to   and    to support a watchlist section; currently only dumps of pages, page logs and abstracts are supported.

We could look at the page logs to see how data is retrieved via a pager, if it is, and then adjust for the watchlist table.

Scale
How many entries are in all the watchlists on enwiki, wikidatawiki, commonswiki? We need to think about performance on the biggest wikis immediately and bear in mind that this number will only get bigger over time.

Remember that we will have to do a title exists check for every entry. We could cache these so that if we ask again for the same title in a short period of time we get it from cache instead of from the db, that's reasonable. A short period of time for us can be up to a minute; it could really be up until the end of the run, for consistency's sake, but how much cruft can we really shove into the WANObjectCache?

Implementing jobs
We need to override some methods in jobs.py, which are there only as stubs.
 * What is the file extension of our output?
 * What is the base name (dumpname) of our output file?
 * What kind of content is it, xml, sql, other?
 * What is a one line description of the content produced?
 * How do we construct a command to run the MediaWiki maintenance script?
 * How do we run the dump job itself, calling build_command, running the maintenance script and dealing with errors?

We can look at apijobs.py for a bare minimum example of how this works. Config for api jobs is at https://github.com/wikimedia/operations-dumps/blob/master/xmldumps-backup/default_api.yaml The class only overrides the necessary methods to get the job done and has nothing extra.

Xml jobs
Some xml jobs have so much code that they are in their own modules.

For the watchlist let's assume that we can just shove it in with the rest: https://github.com/wikimedia/operations-dumps/blob/master/xmldumps-backup/dumps/xmljobs.py

Metadata (stubs) dumps are here, abstracts (small snippets from each page) are here, and so are page logs.

Testing
Testing a change to a MediaWiki script can be done first on a local installation; I have a setup with subsets of wikidata, elwiki and tenwiki all configured differently for dumping. I also have scripts that will run on select wikis a dump with a given date and compare all output to the previously saved output, providing a list of differences if there are any.

I log all sql queries run on my laptop so when testing MW code that may involve changes to existing queries, I go spelunking in these logs in between runs as well.

'''Next it should be tested on deployment-prep. To get there''': ssh deployment-snapshot01.deployment-prep.eqiad.wmflabs

Unfortunately testing there is difficult. When everything in beta is working properly, one still must copy the changed files into  manually, verify that the newly copied files have not yet been overwritten, and then run the command to check the output. Every half an hour the code is overwritten by an automatic sync update, which means that your changes will disappear.

In deployment-prep, verifying the output is a matter of eyeballing the content.

Where files live
Configuration files of all sorts are in.

Code is in.

The main script you will want to run most of the time is worker.py. A sample run, as the dumpsgen user from that directory: python3 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.labs --skipdone enwikinews

Output goes to In production, these are mounted nfs shares from a dumpsdata100x server. In deployment-prep, they are just funnily-named directories on the local instance.

Logs
If you want logging you must specify it as an arg to worker.py. This is done automatically for the dumps running out of cron in production.

Logs live in a private part of the tree which, in production, is not rsynced anywhere. In deployment-prep there are no rsyncs of anything and none of the data is truly private anyways.

If you are curious, you can look at  for the logs for a given run. You can also look at the files in  in production which aggregate everything somewhat messily.

Merge
Once testing has been done here, the change is ready for merge; we want to keep an eye on new code during the first production run just to be sure nothing was missed.

Testing a change to the python dump scripts
The same procedure applies as above, with the exception that updated python scripts can be scp-ed to the snapshot instance in deployment-prep and tested without worrying about them being overwritten by a random scap.

If there is uncertainty about performance, manual runs can be done on the  testbed host, as the dumpsgen user, with a different configuration file that writes output to a separate directory tree on the NFS share. There, the output can be reviewed at one's leisure, and scripts can be rerun as often as one likes without impacting the production run.

Deployment
MediaWiki changes are deployed via the train; Wikibase changes are deployed in a backport window or other deployment windows. Emergency deploys (dumps are broken! omg!) go out in accordance with https://wikitech.wikimedia.org/wiki/Deployments/Emergencies

Puppet changes (generally configuration updates, rsync tweeks and so on) go whenever they are fully tested and the deployer can stick around for awhile to make sure the change works as expected. For non-SRE folks there are designated puppet windows for small changes.

The python dumps scripts are deployed by scap deploy 'message goes here' run as a deployment user from   on the current deployment server. Don't forget to  to see what's currently deployed, then   to get the updates, then   again to see what's new and what you ought to put in your scap message. The same goes for deployment-prep.

Things not covered
So you want to refactor how rsync pushes out dump files to the secondary NFS server and the public-facing servers. Good Luck With That (TL;DR: it sucks).

You want to deploy a new host? It's not hard but it's beyond the scope of these notes. See Wikitech and poke me for review or help.

You want to switch host roles around so some other host is running wikidata? Also not hard as long as you're not switching while processes are currently dumping it, but still beyond the scope of these notes.

Upgrading to buster? HAHAHAHAHAHA. You need MediaWiki on buster first and that's going to be a lot of careful work. Any local caching mechanism, any php extension, could behave differently on a new version of Debian; we've seen this in the past. It's going to be a long slow slog.

Doing manual runs of wikidata page content dumps at the same as the dump run is going, so we get done by the 20th, because Wikidata Will Eat Us All Alive (for breakfast!)? A bit tedious and soon no longer necessary. See me if you're really interested.

Part Two end note
Congratulations! You’ve made it to the end of part two, and so as a reward here are your cute puppy webcams, because that, too, is how we roll.

https://www.youtube.com/watch?v=XLwYWCzlfIw

https://www.youtube.com/watch?v=5EhQ5i60BwU

Part Three: Questions Questions Questions
You know where to find me (apergos@irc, ariel _at_ wikimedia.org for email), or leave a message on a phabricator task. Hit me up!