SQL/XML Dumps/Writing maintenance scripts

From mediawiki.org

Guidelines for writing maintenance scripts to add to sql/xml dumps[edit]

If you are writing a maintenance script in order to add a dump job to the sql/xml dumps, please keep in mind the following:

Reruns/Retries[edit]

If the dumps job may take a long time (more than an hour for the biggest wiki), the job should be broken down into small pieces that can each be re-run independently, in case of a failure. These re-runs can be handled inside the script itself, with a number of maximum retries of a given piece passed in as an argument. This may necessitate storing the output from a piece in some temporary location until the piece completes and the output is verified, and only then appending the output to the proper output file.

Bear in mind that dumps jobs only get slower as wikis get larger, so if your job is on the borderline today, next year it will be too slow :-)

If you break your job up internally into batches based on the number of rows in some table, or something similar, you should make this a command line argument. This will facilitate testing in development environments where the testing db may have much smaller tables.

Error handling[edit]

Exceptions from elsewhere in MediaWiki should be caught and the dump code block retried where it makes sense. Only if the maximum number of retries has been reached for some part of the job or there is some uncorrectable error (bad arguments, can't find some utility specified in the configuration, etc), should the script throw an error and die.

Errors should be reported to stderr; this is done by the usual throw new MWException(__METHOD__ . 'my nice message here' ); If you are catching an exception thrown elsewhere, please include its details via "my message (" . $e->getMessage() . ")" where $e is the caught exception.

If you are working with a particular page or revision, please include the page or revision id in any exception you throw; it will help us to debug any issues later.

Errors emitted to stderr in this fashion will automatically end up in logstash.

Progress messages, console output[edit]

Any job that takes more than just a few minutes should emit regular progress messages, which would include some raw number of rows or pages or whatever entity processed, of the total number, and optionally an ETA to completion based on those numbers. This is typically done by introducing a function like the following:

	protected function progress( $string ) {
		if ( $this->reporting ) {
			fwrite( $this->stderr, $string . "\n" );
		}
	}

in youe maintenance script's main class.

Output generated by your script will be grabbed via our dump infrastructure and written out to a little html snippet which is then made available in the html page for that wiki, date, and dump run. Thus users and dumps maintainers can check the progress of a dump by just looking at the web page.

Any other output that your script produces should either be suppressible by a --quiet flag, or (preferred) be produced only when a --verbose flag is passed.

Output files[edit]

Output files should be compressed, optimally with gzip. If files can be very large (> 2gb with gz compression), consider using bzip2 instead.

The path of any output file should be specified via a command line argument. Dump output files have particular locations and naming conventions, handled by the dump infrastructure, and your script should not be concerned with this.

Intermediate files should be written into a temporary directory which, once again, is managed by the dump infrastructure. You should provide a command line argument for such files, if your script produces them.

Invoking other commands[edit]

Any executables including php itself, if called from within your script, must be passed in as command line arguments. The path to PHP may vary by version or for other reasons across wikis or clusters, so it may not be hardcoded into the script. The same goes for any other utilities you may need, though you should keep these to a minimum. The path to some utilities is already available in the dumps infrastructure configuration; if the utility you want to invoke has not been used before, it will have to be added. See the "tools" section of the config file template.

Resource use[edit]

Database connections[edit]

You should only connect to database servers in the "dump" db group, reserved for slow dumps-related queries; see (TO BE ADDED) for more about database groups. The following snippet should get you a connection to an appropriate replica db server:

lb = wfGetLBFactory()->newMainLB();
db = lb->getConnection( DB_REPLICA, 'dump' );

Dumps maintenance scripts should never write to the database.

Database queries[edit]

Any query more complex than a simple SELECT should have sign-off from a DBA. You can do some performance checking in advance by doing an EXPLAIN on the query, although this is not foolproof. The surest method of determining what the Mariadb query optimizer will do is to run SHOW EXPLAIN against the query as it executes, on a large wiki, but this should not be done without coordination with the DBAs, and probably only on a database in the inactive datacenter, in the dump group. Once you know what the query will do, you can attempt to eliminate filesorts and temp tables and so on. For more on optimizing your queries, see Database_optimization.

Memory and CPU use[edit]

Dumps maintenance scripts run on hosts with many other similar processes going at the same time. Avoid memory leaks by making your jobs short; don't use more than one cpu; in general, be nice to the server. All of the processes your script may run simultaneously, including decompression of input files or compression of output streams, must use no more than 2GB of memory, as a hard limit. Please try to keep use well below this. Make sure that any reads of files or any utilities you use that process files, do not read the entire file into memory unless you know that the file will always be tiny.

Content reuse for speed[edit]

Dumps maintenance scripts that retrieve revision content must use a "prefetch" mechanism, retrieving content of revisions from the previous dump for the wiki, if those revisions were included in the dump. We do this to avoid overloading the database servers, and because even with decompression of the previous dump, it's much faster.

Privacy concerns[edit]

Before dumping a new dataset for public download, make sure that all fields are either already public, or that you have gotten the appropriate privacy review.

If you are dumping records some of which may be in whole or in part private, do not duplicate any logic in your script to determine whether a column or row is viewable by the public. Instead, invoke the appropriate MediaWiki classes and methods that do this already. This avoids data leaks when your code gets out of sync with MediaWiki core or extension code.

General PHP style[edit]

Dumps maintenance scripts should be written in accordance with MediaWiki PHP style guidelines. In particular, this includes inline documentation, relied on extensively to generate the MediaWiki PHP documentation pages.

Unit/integration tests[edit]

Any new code should be accompanied by appropriate tests. A test db is available for maintenance script unit tests, with minimal content. If it does not contain the specific content you need, you'll have to add it. (Note: add link to example data and how ingested here!)