Toolserver:Job scheduling


 * For the newtask command on Solaris, see batch project (outdated)

Job scheduling is the primary method by which tools should be started on the Toolserver. Jobs (i.e., tools) are submitted to the scheduler, which then starts the job on an appropriate host, based on factors like current load. Using batch scheduling means you don't need to worry about where to start a job, or whether the job should be started during off-peak hours, etc. Job scheduling can be used for any sort of tools, whether they're one-off jobs, tools like bots which need to run permanently, or regular jobs run from cron.

While it's possible to run jobs on a server directly, without using job scheduling, this is strongly discouraged, since it makes it harder for the Toolserver administrators to manage server resources and load.

Job scheduling works using queues. When jobs are submitted, they are placed in a queue. When there are sufficient free system resources to execute a job, it is removed from the queue and starts running. If the system is busy, there might be no free resources, and jobs will be queued until more resources become available. At present, it's very unlikely that jobs will be queued in this way, since we have plenty of free resources.

Note: the cron-related examples on this page use cronie, not crontab. See cron for more information.

Quick start
You should read the rest of this page to become familiar with the job scheduler, but here are some examples to get you started.

Converting an existing cron job to use the scheduler
Example: You have a tool, mytool.py, which runs from cronie at 0300 UTC every day: 0 3 * * * $HOME/mytool.py

To run this tool under the job scheduler, change it to: 0 3 * * * cronsub -s mytool $HOME/mytool.py

"mytool" is how you tell cronsub your tool is named. The output from the tool will be written to $HOME/mytool.out.

Converting a Phoenix tool to use the scheduler
Example: You have a tool, mytool.py, which runs under Phoenix so it's automatically (re)started: */10 * * * * phoenix $HOME/phoenix-mytool $HOME/mytool.py

To run this tool under the job scheduler, change it to: */10 * * * * cronsub -sl mytool $HOME/mytool.py

The output from the tool will be written to $HOME/mytool.out.

Receiving mail when the job starts or finishes
To receive mail when the job finishes, add this line: To receive mail when it starts as well:
 * 1) ! /usr/bin/python
 * 2) $ -m ae
 * 1) $ -m bae

Submitting jobs
To submit a job, use the qsub command: $ qsub -l h_rt=0:30:00 -j y -o $HOME/mytool.out $HOME/mytool.py Your job 80570 ("mytool.py") has been submitted The scheduler will place the job in the default queue, and eventually (probably immediately) run it on a suitable host. Once the job has finished, it will be removed from the system.

-l h_rt=0:30:00 specifies the runtime limit of the job (hh:mm:ss), in this case 30 minutes. To avoid broken tools using too many resources, you should set this to the expected maximum runtime; if the job runs any longer, it will be killed. If you don't specify a limit, the default is 6 hours.

-j y places the normal output (stdout) and error output (stderr) of the job in the same file, and -o $HOME/mytool.out</tt> specifies the location of that file. (By default, stdout and stderr go to separate files, which is usually not what you want.)

Rather than specifying arguments to qsub</tt> every time the job is run, you can instead put them in the script itself, using special directives starting with #$</tt>: ... rest of script ...
 * 1) ! /usr/bin/python
 * 2) $ -l h_rt=0:30:00
 * 3) $ -j y
 * 4) $ -o $HOME/mytool.out

If you want to receive mail when a job finishes, use -m e</tt>. To receive mail when a job starts and when it finishes, use -m be</tt>.

If you want a warning before your job is killed, specify s_rt</tt> with a value lower than h_rt</tt>, for example: $ qsub -l h_rt=1:00:00 -l s_rt=0:55:00 slowjob.py This will send a SIGUSR1 signal after 55 minutes, which you can catch to perform cleanup before the job ends. After 1 hour, SIGKILL will be sent.

Submitting jobs from cronie
While it's sometimes useful to run a single job from the command line, most tools need to run regularly, using cron. We provide a script called cronsub</tt> to make this easier. cronsub</tt> is a wrapper around qsub</tt> which provides some additional functionality.

Example: if you wanted test.py</tt> to be submitted at 0300h UTC every day, you could use an entry like this in your cronietab: 0 3 * * * cronsub mytool $HOME/mytool.py The first argument (mytool</tt>) will be used as the name of the job, and the second argument is the command to run. The output file will be set to $HOME/ .out</tt>, in this case $HOME/mytool.out</tt>.

Among other things, cronsub</tt> will prevent a job from running if a job of the same name already exists. This means that if your job is queued, or takes longer to run than expected, a second duplicate job won't be started.

You can specify qsub</tt> arguments to <tt>cronsub</tt>: cronsub mytool -l h_rt=0:30:00 $HOME/mytool.py ... but generally it's easier to use <tt>#$</tt> lines in the script itself.

submit.toolserver.org
We have set up a pair of redundant hosts to act as SGE job submission servers. These work by sharing each user's cronietab between both hosts, and executing jobs on whichever server is working. This avoids that problem where jobs run from cronie on one login server (such as willow) will fail to run if that host is down, even when other login servers are available.

To use the new hosts, log into submit.toolserver.org and set up a cronietab (*not* a crontab) as normal.

Note that these hosts are *only* for submitting SGE jobs, not for running tools on.

Submitting long-running jobs
Some tools, like bots, are meant to run continuously, and restart if they exit. These tools are not suitable for running in the default queue (<tt>all.q</tt>); instead, we provide a separate queue called <tt>longrun</tt>. To start a job in the <tt>longrun</tt> queue: $ qsub -q longrun $HOME/longtool.py However, a better way to start such tools is using <tt>cronsub</tt>. Since <tt>cronsub</tt> won't start duplicate jobs, you can try to start your long-running tools regularly (for example, every 10 minutes); if the job is running, nothing will happen, but if it has exited for some reason, it will be restarted. An example of using <tt>cronsub</tt> this way might be: */10 * * * * cronsub -sl longtool $HOME/longtool.py This will run <tt>cronsub</tt> every 10 minutes. The <tt>-l</tt> argument instructs <tt>cronsub</tt> to start the job in the <tt>longrun</tt> queue.

Running jobs under screen
Many tools need to run under <tt>screen</tt> because they require a terminal to work. You can still run these jobs using the job scheduler, but you need to create a small wrapper script to start <tt>screen</tt>. For <tt>mytool.py</tt>, such a script might be called <tt>mytool.sh</tt> and look like this:

exec screen -D -m -S mytool python $HOME/mytool.py
 * 1) ! /bin/sh
 * 2) screen doesn't produce any output, so use /dev/null to avoid creating empty files
 * 3) $ -j y
 * 4) $ -o /dev/null

You can then use <tt>cronsub</tt>, as above, to submit the job from <tt>cronie</tt>: */10 * * * * cronsub -l mytool $HOME/mytool.sh

This will create a screen session named <tt>mytool</tt> and start <tt>mytool.py</tt> inside it. Tools that run under <tt>screen</tt> are almost always meant to run forever, so here we used the <tt>longrun</tt> queue (<tt>cronsub -l</tt>).

You can attach to the <tt>screen</tt> session using <tt>screen -r myprog</tt> as normal, but first you need to check which host the job is running on:

$ qstat | grep mytool 80463 0.55500 mytool    rriver       r     11/17/2010 03:48:26 longrun@willow.toolserver.org      1

Here the job is running on <tt>willow</tt>, so that's where you need to run <tt>screen -r</tt>. This is useful because by attaching to the screen you can see whether the program needs interaction, what's happening and what are/were the errors.

Scheduling SQL queries
When writing batch jobs that perform SQL queries, the most important resource is often available SQL capacity rather than CPU or memory. In this case, it is possible to specify that your job needs to run an SQL query on one or more clusters:

mysql -h sql-s1 -BNe 'select count(*) from revision' enwiki_p
 * 1) ! /bin/sh
 * 2) $ -N sqltest
 * 3) $ -l sqlprocs-s1=1

The line <tt>#$ -l sqlprocs-s1=1</tt> indicates that this script needs 1 execution slot on the sql-s1 cluster. If free slots are available, the job will run immediately; otherwise, it will wait for a slot to become available. You can also configure this on the <tt>qsub</tt> command line:

% qsub -l sqlprocs-s1=1 sql.sh

Currently, 10 SQL slots are configured for each server, and each query running for longer than 60 seconds counts as using a slot. Replication lag is currently not taken into account, but this will probably change soon.

Note: For long-running jobs (as opposed to jobs which run once then exit), do not reserve any SQL slots; since the program runs continuously, it will take the slots forever and prevent other jobs from running.

Allowing jobs to be automatically restarted or migrated
By default, when a cluster node crashes or reboots, all jobs on it are terminated and will not be restarted, because it's not always safe to restart a job that was previously running. If you would like your job to be restarted when this happens, you can start it as a restartable job using <tt>-r y </tt>. There is no need to do this for jobs in the <tt>longrun</tt> queue, since jobs in that queue are restartable by default.

Migration allows jobs to be moved between nodes while they're running, which improves load distribution and results in better performance. Migration relies on checkpointing -- the ability of a job to save its state and resume when restarted.

We do not provide any automatic checkpointing system; if you wish your job to be migrated, you need to implement this yourself. Examples of jobs that are suitable for migration include: Most jobs in the <tt>longrun</tt> queue are probably suitable for migration, but it is not be enabled by default. To mark a job as a checkpointing (migratable) job, start it with the <tt>-ckpt default</tt> argument.
 * Jobs which work by removing work items from a queue and processing them; when migrated, the job just starts from the top of the queue
 * Jobs which are event-based and wait for work to do, e.g. most IRC bots or recentchanges bots
 * Jobs which regularly save their working state and can resume from the saved state if they are restarted

Binaries
Jobs are assumed to be textual scripts of some sort, e.g. shell scripts, Python, Perl, etc. However, it is possible to submit a binary executable as a job: $ qsub -b y $HOME/mybinary However, if you do this you cannot specify arguments to <tt>qsub</tt> in the executable. You might find it easier to create a shell script that wraps the executable.

Array jobs
Array jobs allow you to run multiple copies of an identical job. This allows you to, for example, submit multiple jobs to process several input files. While you could just use <tt>qsub</tt> with a separate script for every input file, using array jobs is easier, and allows more intelligent scheduling of your jobs.

To submit an array job, use the <tt>-t</tt> argument to <tt>qsub</tt>. For example, if you have input files named <tt>input.1</tt> ... <tt>input.10</tt>, you could use a script like this:

$HOME/process_input $HOME/input.$SGE_TASK_ID
 * 1) ! /bin/sh
 * 2) $ -t 1-10

The <tt>$SGE_TASK_ID</tt> environment variable is set to the task id (from 1 to 10) of each job.

If you request a resource, like <tt>sqlprocs</tt>, the resource is reserved separately for every task. This means if you request 1 SQL slot on s1, and submit an array job with 10 tasks, you will use 10 SQL slots, and none will be left for other users. To avoid this, you can limit concurrent execution of the jobs using <tt>-tc</tt>:

$HOME/process_input $HOME/input.$SGE_TASK_ID
 * 1) ! /bin/sh
 * 2) $ -t 1-10
 * 3) $ -tc 3

This will run no more than 3 of the 10 jobs at once.

(Actually, since SQL slots are limited to 5 per cluster per user, this particular example wouldn't be a problem in practice.)

Limiting concurrent jobs
Sometimes, you might want to limit the number of jobs you run at once, e.g. if you run a large number of jobs that access remote sites, and want to prevent overloading the site. If you use array jobs (above), you can do this with <tt>-tc</tt>; otherwise, you can use the <tt>user_slot</tt> resource. This is a resource with an arbitrary limit of 10 per user, which you can request like this:


 * 1) $ -l user_slot=2

If you run 15 tasks which each request 2 user slots, no more than 5 of them (10 user slots' worth) will be run at once.

You cannot request user slots for jobs in the <tt>longrun</tt> queue, because jobs in this queue are meant to run forever rather than being scheduled.

Managing jobs
To list all your running jobs, use <tt>qstat</tt>:

job-ID prior   name        user         state submit/start at     queue           slots ja-task-ID ---  80576 0.56000 mytool.py   rriver       r     11/17/2010 08:16:10 all.q@wolfsbane     1

This indicates that the job is running (r) in <tt>all.q</tt> (the default queue) on <tt>wolfsbane</tt>.

Deleting jobs
To delete jobs, use the <tt>qdel</tt> command:

% qdel

If the job is currently running (rather than queued), this will terminate it.

Suspending and unsuspending jobs
Suspending a job allows it to be temporarily paused, and then resumed later. To suspend a job:

% qmod -sj

The job will be paused by sending it SIGSTOP, and will have the 's' state in <tt>qstat</tt>.

To unsuspend the job and let it continue running:

% qmod -usj

Advanced features
Sun Grid Engine has several more advanced features, such as array jobs (automatically submitting the same job many times with different arguments), and job dependencies (specifying that a job cannot run until a different job has completed). For more information on these, see the documentation.

Troubleshooting
If you receive an error like "<tt>/usr/bin/env: No such file or directory</tt>", it's probably due to your line endings. Do not use Windows (CRLF) line endings; use Unix (LF) line endings instead.