SQL/XML Dumps/Stubs, page logs, abstracts

How (not) to generate stubs, page logs, and abstract dumps
TL;DR: It's complicated. Too complicated.

And now, here's the long version.

Among the other items we dump are metadata for each page and revision (“stubs”), the logs of what actions have been taken on pages (moves, deletions, patrolling, page creation and so on), and abstracts, which are small snippets of each page taken from the introductory section of the content.

As we do for all dumps, we try to break these into small pieces, so that if any one piece fails, we can retry it a few times before giving up completely. Giving up would mean that we have to restart the job from the beginning, and for larger wikis, even though the output files are broken into pieces, each piece may still take days to generate.

Let’s see how we do this for page logs, the simplest of the cases. The basics of the code flow are the same for the other two jobs.

Generation of page logs
Let’s look at xmllogs.py.

{{CodeCommentary|type=code|lang=python|start=27|content= ​   outfiles = {'logs': {'name': outfile{{))}} for filetype in outfiles: }} {{CodeCommentary|type=comment|content= We write into a file in the dumps temp dir and tack “_tmp” onto the end of the name for good measure. }} {{CodeCommentary|type=code|lang=python|start=29|content= ​       outfiles[filetype]['temp'] = os.path.join(            FileUtils.wiki_tempdir(wikidb, wikiconf.temp_dir),            os.path.basename(outfiles[filetype]['name']) + "_tmp") }} {{CodeCommentary|type=comment|content= Dryrun? We won’t compress at all. Because we won’t be actually running anything, heh. }} {{CodeCommentary|type=code|lang=python|start=32|content= ​       if dryrun: outfiles[filetype]['compr'] = [None, outfiles[filetype]['name']] else: }} {{CodeCommentary|type=comment|content= Yes we could be gzipping a file which ends in .bz2 or .txt, but that’s the caller’s problem for using a crap output file name :-P }} {{CodeCommentary|type=code|lang=python|start=35|content= ​           outfiles[filetype]['compr'] = [gzippit_append, outfiles[filetype]['name']] ​ }} {{CodeCommentary|type=comment|content= This gets the path to the maintenance script to run, with the path to MWScript.php prepended if needed (this is determined by the dumps config file). }} {{CodeCommentary|type=code|lang=python|start=37|content= ​   script_command = MultiVersion.mw_script_as_array(wikiconf, "dumpBackup.php") }} {{CodeCommentary|type=comment|content= We set up (part of) the command to be run. The “--logs” says to dump page logs rather than some other type of thing. }} {{CodeCommentary|type=code|lang=python|start=38|content= ​   command = [wikiconf.php] + script_command

command.extend ​ }} {{CodeCommentary|type=comment|content= We write an xml header at the beginning of the file, with no other output. The header is gzipped all by itself. We write an xml footer at the end of the file, with no other output. It too is gzipped all by itself. We don’t want headers or footers in the middle, which contains all of the data. That too is a separate gzipped stream. When all of these are concatenated together, that is a valid gzip object which gzip tools can decompress with no special parameters required. }} {{CodeCommentary|type=code|lang=python|start=45|content= do_xml_stream(wikidb, outfiles, command, wikiconf,                 start, end, dryrun, 'log_id', 'logging',                  50000, 100000, ' \n', verbose=verbose, header=True) do_xml_stream(wikidb, outfiles, command, wikiconf,                 start, end, dryrun, 'log_id', 'logging',                  50000, 100000, ' \n', verbose=verbose) do_xml_stream(wikidb, outfiles, command, wikiconf,                 start, end, dryrun, 'log_id', 'logging',                  50000, 100000, ' \n', verbose=verbose, footer=True) ​ ​ }}

Pretty simple, right? But now we get to look at xmlstreams.py which is where do_xml_stream is defined. Here we go!

Yes, it is gross, I admit it
For the last bit of gross, note that xmlllogs.py is called as a script from the xmljobs.py module. So we have:

worker.py -&gt; potentially multiple copies of xmllogs.py -&gt; multiple runs of dumpBackup.php for each copy, where “-&gt;” is the “why yes we are forking out” operator :-P