Toolserver:Job scheduling


 * For the newtask command on Solaris, see batch project

Job scheduling is the primary method by which tools should be started on the Toolserver. Jobs (i.e., tools) are submitted to the scheduler, which then starts the job on an appropriate host, based on factors like current load. Using batch scheduling means you don't need to worry about where to start a job, or whether the job should be started during off-peak hours, etc. Job scheduling can be used for any sort of tools, whether they're one-off jobs, tools like bots which need to run permanently, or regular jobs run from cron.

While it's possible to run jobs on a server directly, without using job scheduling, this is strongly discouraged, since it makes it harder for the Toolserver administrators to manage server resources and load.

Job scheduling works using queues. When jobs are submitted, they are placed in a queue. When there are sufficient free system resources to execute a job, it is removed from the queue and starts running. If the system is busy, there might be no free resources, and jobs will be queued until more resources become available. At present, it's very unlikely that jobs will be queued in this way, since we have plenty of free resources.

Quick start
You should read the rest of this page to become familiar with the job scheduler, but here are some examples to get you started.

NB: These examples use the cronsub command. On Linux (nightshade) you need to use /usr/local/bin/cronsub instead.

Converting an existing cron job to use the scheduler
Example: You have a tool, mytool.py, which runs from cron at 0300 UTC every day: 0 3 * * * $HOME/mytool.py

To run this tool under the job scheduler, change it to: 0 3 * * * cronsub -s mytool $HOME/mytool.py

The output from the tool will be written to $HOME/mytool.out.

Converting a Phoenix tool to use the scheduler
Example: You have a tool, mytool.py, which runs under Phoenix so it's automatically (re)started: 0,10,20,30,40,50 * * * * phoenix $HOME/phoenix-mytool $HOME/mytool.py

To run this tool under the job scheduler, change it to: 0,10,20,30,40,50 * * * * cronsub -sl mytool $HOME/mytool.py

The output from the tool will be written to $HOME/mytool.out.

Running a tool only on Solaris or Linux
To force your tool to only run on Solaris, add this near the top (assuming a Python script): Or to force it to run under Linux:
 * 1) ! /usr/bin/python
 * 2) $ -l arch=sol-amd64
 * 1) ! /usr/bin/python
 * 2) $ -l arch=lx24-amd64

Receiving mail when the job starts or finishes
To receive mail when the job finishes, add this line: To receive mail when it starts as well:
 * 1) ! /usr/bin/python
 * 2) $ -m ae
 * 1) $ -m bae

Submitting jobs
To submit a job, use the qsub command: $ qsub $HOME/mytool.py Your job 80570 ("mytool.py") has been submitted The job ID is 80570, and the job name is "mytool.py". The scheduler will place the job in the default queue, and eventually run it on a suitable host. Once the job has finished, it will be removed from the system.

You can use qstat to see the job running: willow% qstat job-ID prior   name        user         state submit/start at     queue           slots ja-task-ID ---  80576 0.56000 mytool.py   rriver       r     11/17/2010 08:16:10 all.q@wolfsbane     1 If the job produced output, this will be saved in $HOME/mytool.py.o80570</tt> for normal output (stdout), and $HOME/mytool.py.e80570</tt> for errors (stderr). Having two separate files with effectively random names is not always very helpful, so you can force the output to go to a single file when submitting the job: $ qsub -j y -o $HOME/mytool.out $HOME/mytool.py The -j y</tt> argument forces all output to go to a single file, and -o</tt> specifies the location of the file.

Rather than specifying arguments to qsub</tt> every time the job is run, you can instead put them in the script itself, using special directives starting with #$</tt>: ... rest of script ...
 * 1) ! /usr/bin/python
 * 2) $ -j y
 * 3) $ -o $HOME/mytool.out

If you want to receive mail when a job finishes, use -m e</tt>. To receive mail when a job starts and when it finishes, use -m be</tt>.

By default, jobs are limited to 6 hours runtime. If a job runs for longer than this, it will be killed. This is done to prevent runaway jobs accidentally using a large amount of system resources. If you expect your job to need longer than 6 hours to complete, you can request more time using -l</tt>: $ qsub -l h_rt=24:00:00 slowjob.py  # allow up to 24 hours runtime You cannot request more than 120 hours (5 days).

If you want a warning before your job is killed, specify s_rt</tt> with a value lower than h_rt</tt>, for example: $ qsub -l h_rt=1:00:00 -l s_rt=0:55:00 slowjob.py This will send a SIGUSR1 signal after 55 minutes, which you can catch to perform cleanup before the job ends. After 1 hour, SIGKILL will be sent.

Since jobs can be started on any host, it's possible that they will be started on either a Linux or Solaris server. If your tool can only run on Solaris, you can request that it only be started on a Solaris host: $ qsub -l arch=sol-amd64 soljob.py You can also request a Linux host using -l arch=lx24-amd64</tt>, but we will be converting the last Linux host to Solaris in January 2011, so this is not recommended.

Submitting jobs from cron
While it's sometimes useful to run a single job from the command line, most tools need to run regularly, using cron. We provide a script called cronsub</tt> to make this easier.

Example: if you wanted test.py</tt> to be submitted at 0300h UTC every day, you could use an entry like this in your crontab: 0 3 * * * cronsub mytool $HOME/mytool.py The first argument (mytool</tt>) will be used as the name of the job, and the second argument is the command to run.

Among other things, cronsub</tt> will prevent a job from running if a job of the same name already exists. This means that if your job is queued, or takes longer to run than expected, a second duplicate job won't be started.

NB: on Linux (<tt>nightshade</tt>), you need to use <tt>/usr/local/bin/cronsub</tt> instead.

Submitting long-running jobs
Some tools, like bots, are meant to run continuously, and restart if they exit. These tools are not suitable for running in the default queue (<tt>all.q</tt>); instead, we provide a separate queue called <tt>longrun</tt>. To start a job in the <tt>longrun</tt> queue: $ qsub -q longrun $HOME/longtool.py However, a better way to start such tools is using <tt>cronsub</tt>. Since <tt>cronsub</tt> won't start duplicate jobs, you can try to start your long-running tools regularly (for example, every 10 minutes); if the job is running, nothing will happen, but if it has exited for some reason, it will be restarted. An example of using <tt>cronsub</tt> this way might be: 0,10,20,30,40,50 * * * * cronsub -l longtool $HOME/longtool.py This will run <tt>cronsub</tt> every 10 minutes. The <tt>-l</tt> argument instructs <tt>cronsub</tt> to start the job in the <tt>longrun</tt> queue.

Running jobs under screen
Many tools need to run under <tt>screen</tt> because they require a terminal to work. You can still run these jobs using the job scheduler, but you need to create a small wrapper script to start <tt>screen</tt>. For <tt>mytool.py</tt>, such a script might be called <tt>mytool.sh</tt> and look like this:

exec screen -D -m -S mytool python $HOME/mytool.py
 * 1) ! /bin/sh
 * 2) screen doesn't produce any output, so use /dev/null to avoid creating empty files
 * 3) $ -j y
 * 4) $ -o /dev/null

You can then use <tt>cronsub</tt>, as above, to submit the job from <tt>cron</tt>: 0,10,20,30,40,50 * * * * cronsub -l mytool $HOME/mytool.sh

This will create a screen session named <tt>mytool</tt> and start <tt>mytool.py</tt> inside it. Tools that run under <tt>screen</tt> are almost always meant to run forever, so here we used the <tt>longrun</tt> queue (<tt>cronsub -l</tt>).

You can attach to the <tt>screen</tt> session using <tt>screen -r myprog</tt> as normal, but first you need to check which host the job is running on:

$ qstat | grep mytool 80463 0.55500 mytool    rriver       r     11/17/2010 03:48:26 longrun@willow.toolserver.org      1

Here the job is running on <tt>willow</tt>, so that's where you need to run <tt>screen -r</tt>.

Scheduling SQL queries
When writing batch jobs that perform SQL queries, the most important resource is often available SQL capacity rather than CPU or memory. In this case, it is possible to specify that your job needs to run an SQL query on one or more clusters:

mysql -h sql-s1 -BNe 'select count(*) from revision' enwiki_p
 * 1) ! /bin/sh
 * 2) $ -N sqltest
 * 3) $ -l sqlprocs-s1=1

The line <tt>#$ -l sqlprocs-s1=1</tt> indicates that this script needs 1 execution slot on the sql-s1 cluster. If free slots are available, the job will run immediately; otherwise, it will wait for a slot to become available. You can also configure this on the <tt>qsub</tt> command line:

% qsub -l sqlprocs-s1=1 sql.sh

Currently, 10 SQL slots are configured for each server, and each query running for longer than 60 seconds counts as using a slot. Replication lag is currently not taken into account, but this will probably change soon.

Note: For long-running jobs (as opposed to jobs which run once then exit), do not reserve any SQL slots; since the program runs continuously, it will take the slots forever and prevent other jobs from running.

Allowing jobs to be automatically restarted or migrated
By default, when a cluster node crashes or reboots, all jobs on it are terminated and will not be restarted, because it's not always safe to restart a job that was previously running. If you would like your job to be restarted when this happens, you can start it as a restartable job using <tt>-r y </tt>. There is no need to do this for jobs in the <tt>longrun</tt> queue, since jobs in that queue are restartable by default.

Migration allows jobs to be moved between nodes while they're running, which improves load distribution and results in better performance. Migration relies on checkpointing -- the ability of a job to save its state and resume when restarted.

We do not provide any automatic checkpointing system; if you wish your job to be migrated, you need to implement this yourself. Examples of jobs that are suitable for migration include: Most jobs in the <tt>longrun</tt> queue are probably suitable for migration, but it is not be enabled by default. To mark a job as a checkpointing (migratable) job, start it with the <tt>-ckpt default</tt> argument.
 * Jobs which work by removing work items from a queue and processing them; when migrated, the job just starts from the top of the queue
 * Jobs which are event-based and wait for work to do, e.g. most IRC bots or recentchanges bots
 * Jobs which regularly save their working state and can resume from the saved state if they are restarted

Binaries
Jobs are assumed to be textual scripts of some sort, e.g. shell scripts, Python, Perl, etc. However, it is possible to submit a binary executable as a job: $ qsub -b y $HOME/mybinary However, if you do this you cannot specify arguments to <tt>qsub</tt> in the executable. You might find it easier to create a shell script the wraps the executable.

Array jobs
Array jobs allow you to run multiple copies of an identical job. This allows you to, for example, submit multiple jobs to process several input files. While you could just use <tt>qsub</tt> with a separate script for every input file, using array jobs is easier, and allows more intelligent scheduling of your jobs.

To submit an array job, use the <tt>-t</tt> argument to <tt>qsub</tt>. For example, if you have input files named <tt>input.1</tt> ... <tt>input.10</tt>, you could use a script like this:

$HOME/process_input $HOME/input.$SGE_TASK_ID
 * 1) ! /bin/sh
 * 2) $ -t 1-10

The <tt>$SGE_TASK_ID</tt> environment variable is set to the task id (from 1 to 10) of each job.

If you request a resource, like <tt>sqlprocs</tt>, the resource is reserved separately for every task. This means if you request 1 SQL slot on s1, and submit an array job with 10 tasks, you will use 10 SQL slots, and none will be left for other users. To avoid this, you can limit concurrent execution of the jobs using <tt>-tc</tt>:

$HOME/process_input $HOME/input.$SGE_TASK_ID
 * 1) ! /bin/sh
 * 2) $ -t 1-10
 * 3) $ -tc 3

This will run no more than 3 of the 10 jobs at once.

(Actually, since SQL slots are limited to 5 per cluster per user, this particular example wouldn't be a problem in practice.)

Limiting concurrent jobs
Sometimes, you might want to limit the number of jobs you run at once, e.g. if you run a large number of jobs that access remote sites, and want to prevent overloading the site. If you use array jobs (above), you can do this with <tt>-tc</tt>; otherwise, you can use the <tt>user_slot</tt> resource. This is a resource with an arbitrary limit of 10 per user, which you can request like this:


 * 1) $ -l user_slot=2

If you run 15 tasks which each request 2 user slots, no more than 5 of them (10 user slots' worth) will be run at once.

You cannot request user slots for jobs in the <tt>longrun</tt> queue, because jobs in this queue are meant to run forever rather than being scheduled.

Managing jobs
To list all your running jobs, use <tt>qstat</tt>:

job-ID prior   name        user         state submit/start at     queue           slots ja-task-ID ---  80576 0.56000 mytool.py   rriver       r     11/17/2010 08:16:10 all.q@wolfsbane     1

This indicates that the job is running (r) in <tt>all.q</tt> (the default queue) on <tt>wolfsbane</tt>.

Deleting jobs
To delete jobs, use the <tt>qdel</tt> command:

% qdel

If the job is currently running (rather than queued), this will terminate it.

Suspending and unsuspending jobs
Suspending a job allows it to be temporarily paused, and then resumed later. To suspend a job:

% qmod -sj

The job will be paused by sending it SIGSTOP, and will have the 's' state in <tt>qstat</tt>.

To unsuspend the job and let it continue running:

% qmod -usj

Advanced features
Sun Grid Engine has several more advanced features, such as array jobs (automatically submitting the same job many times with different arguments), and job dependencies (specifying that a job cannot run until a different job has completed). For more information on these, see the documentation.