Toolserver:Job scheduling


 * For the newtask command on Solaris, see batch project

Batch job scheduling allows user to submit background jobs (e.g. cron tasks) and have them automatically run a suitable (less busy) host. Batch jobs will be familiar to many people who have used university, research or mainframe systems. The main difference between interactive jobs and batch jobs is that when you run an interactive job, it starts immediately, runs to completion, then exits. A batch job is submitted to the job server; when sufficient system resources are available, the job server starts the job on a suitable (idle) server. The job might be suspended during execution if load is too high, and will resume when resources are available again. After submitting a batch job, you can log out and come back later when the job has finished to examine its output. If you like, you can ask to receive mail when the job starts or finished.

Batch scheduling can be used with any sort of job: regularly scheduled jobs (e.g. tools which run from cron), jobs which run continuously, jobs which need to be started from a CGI script, or tools which are run occasionally and are not time-critical. While a batch job will normally be scheduled for execution immediately, if no system resources are free, it could be delayed. (This is currently very unlikely on the Toolserver.)

For users, the main advantage of batch jobs is that you do not need to worry about where to start a job, or whether the job needs to use a particular nice level, and so on; the job server will handle that for you. As we add more job execution servers to the cluster, your jobs will automatically take advantage of the new resources with no changes needed from you.

For admins, batch jobs give us tighter control over resources allocation and usage, and allow us to see more clearly how the Toolserver is being used.

The batch job software we have chosen is Sun Grid Engine. Full documentation for users is available here; some common examples are described below.

Queues and jobs
The basic method by which jobs are scheduled is the queue. A queue is a list of jobs which have been submitted. Each queue has a certain number of execution slots; when free slots are available, jobs are moved into them and begin running. When no more slots are available, jobs are queued, and will start when an execution slot is available.

There are two queues on the Toolserver: all.q, the default queue, and longrun which should be used for long-running jobs (those which are meant to run forever).

Quick start
Here we will demonstrate how to migrate a typical tool -- a script that runs from cron -- to SGE. However, you should probably still at least skim the rest of this page after reading this.

This procedure works for any script file or binary, e.g. shell script, Perl, Python, PHP, etc. In this example we will use a Python script, "mytool.py".

First, you need to create a shell script that starts the job. The first line of the script should be #! /bin/sh, and the following lines should be SGE directives. These are lines starting with "#$", and modify how SGE runs the job. Our example script to start mytool.py will be called mytool.sh.


 * Add the header: #! /bin/sh</tt>.
 * Add the following line: #$ -j y</tt>. This sends normal output and errors from the job to the same file, which is usually what you want.
 * If you want to receive a mail when the job ends, add this line: #$ -m e</tt>. If you also want to receive mail when the jobs starts, use this line instead: #$ -m be</tt>.
 * Decide where to send the output from the job: #$ -o $HOME/mytool.out</tt>. If you don't want to save the output, use this: #$ -o /dev/null</tt>.
 * If your job runs SQL queries, request SQL execution slots. For example, if you run a single query at once on s1 (enwiki), add: #$ -l sqlprocs-s1=1</tt>.  If the job is complex and the number of SQL slots isn't known in advance, you can either estimate it, or leave this out entirely.
 * Finally, add a line to start the actual job: exec $HOME/mytool.py</tt>.

Note: For long-running jobs (as opposed to jobs which run once then exit), do not reserve any SQL slots; since the program runs continuously, it will take the slots forever and prevent other jobs from running.

Here is how mytool.sh</tt> might look:

exec $HOME/mytool.py
 * 1) ! /bin/sh
 * 2) $ -j y
 * 3) $ -o $HOME/sometool.out
 * 4) $ -l sqlprocs-s1=1,sqlprocs-s4=1

Next, edit the crontab entry for the job and change it to use cronsub</tt> with the shell script you just created. For a regularly scheduled job, the cron entry might look like this: 0 3 * * * $HOME/mytool.py To use SGE, the entry should be changed to: 0 3 * * * cronsub mytool $HOME/mytool.sh For a long-running job using Phoenix, the entry might look like this: 0,5,10,15,20,25,30,35,40,45,50,55 * * * * phoenix $HOME/phoenix-program python $HOME/whatever/program.py This should be changed to: 0,5,10,15,20,25,30,35,40,45,50,55 * * * * cronsub -l mytool $HOME/mytool.sh

The -l</tt> argument places the job into the longrun</tt> queue, which should be used for jobs that run forever.

(NB: on Linux, you need to use /usr/local/bin/cronsub</tt> instead.)

The name of the job, here "mytool", is used to make the job easier to identify in listings, and to prevent two copies of the same tool running at once. Make sure you use a different name for each tool. Since cronsub</tt> will not start another job if a job with the same name already exists, you can regularly try to re-start a long-running job to ensure it's started if it crashes, or if the system reboots.

Once you're done, your job is now set up to run under SGE. When the job starts, you can run qstat</tt> to observe that it's running, and see which host it was started on.

Submitting a job
To submit a job, use the qsub</tt> command:

% qsub -N test test.sh

This submits the shell script "test.sh" as a job with the name "test". Giving the job a name is not required, but is recommended so you can easily identify the job. If you want to run a binary instead of a shell script, use <tt>qsub -b y</tt>.

If you want to receive mail when a job completes, use the <tt>-m e</tt> option:

% qsub -N test -m e test.sh

To receive mail when a job starts and when it finishes, use <tt>-m be</tt>.

When your job is finished, its output (stdout) will be written to the file <tt>test.oX</tt>, where test is the job name and X is the job ID. The job's standard error is written to <tt>test.eX</tt>. You can override this using the <tt>-o </tt> and <tt>-e </tt> options to <tt>qsub</tt>.

If you want the job to be scheduled immediately, rather than being queued, use the <tt>-now y</tt> option. If the job is unable to be scheduled immediately (because there are no available resources) it will fail.

To submit a job to a specific queue, use the <tt>-q</tt> option: % qsub -q longrun -N test test.sh

Displaying jobs
To display your jobs, use the <tt>qstat</tt> command:

% qstat job-ID prior   name       user         state submit/start at     queue                          slots ja-task-ID -      8 0.55500 test       rriver       r     09/15/2009 16:15:51 all.q@willow.toolserver.org        1

Here, the job test is running (state r), in the all.q queue (the default) on willow.

Suspending and unsuspending jobs
To suspend a job:

% qmod -sj

The job will be paused by sending it SIGSTOP, and will have the 's' state in <tt>qstat</tt>.

To unsuspend the job and let it continue running:

% qmod -usj

Deleting jobs
To delete jobs, use the <tt>qdel</tt> command:

% qdel

Interactive jobs
An interactive job is a special kind of job which, instead of running a command, requests a shell on an idle system. To start an interactive job, use <tt>qlogin</tt>:

% qlogin Your job 11 ("QLOGIN") has been submitted waiting for interactive job to be scheduled ... Your interactive job 11 has been successfully scheduled. Establishing builtin session to host willow ... Sun Microsystems Inc.  SunOS 5.10      Generic January 2005 %

To exit the interactive session, exit the shell (e.g. by typing CTRL+D).

Special queues
The default queues, all.q and longrun, include all login servers in the cluster. If you have a job which can only be run on a particular type of host, you can request that it only be executed on a particular operating system.

To run a job on Solaris only:

% qsub -l arch=sol-amd64 -N test test.sh

To run a job on Linux only:

% qsub -l arch=lx24-amd64 -N test test.sh

You can also specify that a job runs on a particular server:

% qsub -l hostname=willow -N test test.sh

We strongly recommend that you write jobs which can run on either kind of server, and use the default queue. This will provide the most flexibility when scheduling your job. (In particular, we are unlikely to add more Linux servers, so writing jobs which also run on Solaris will increase the resources available to your jobs.)

Long-running jobs -- those which are meant to run forever and restart if they exit -- should be started in the longrun queue instead of all.q: % qsub -q longrun -N test test.sh

Embedding options in the script
Instead of specifying options on the <tt>qsub</tt> command line, it is possible to embed these options in the script, using comment lines starting with <tt>#$</tt>. For example,

<rest of script...>
 * 1) ! /bin/sh
 * 2) Name the job "testing".
 * 3) $ -N testing
 * 4) Send email when job finishes.
 * 5) $ -m e
 * 6) Store output in a different place.
 * 7) $ -o /home/jsmith/testing.out
 * 8) Send errors to the normal output file instead of a separate error file.
 * 9) $ -j y

Scheduling SQL queries
When writing batch jobs that perform SQL queries, the most important resource is often available SQL capacity rather than CPU or memory. In this case, it is possible to specify that your job needs to run an SQL query on one or more clusters:

mysql -h sql-s1 -BNe 'select count(*) from revision' enwiki_p
 * 1) ! /bin/sh
 * 1) $ -N sqltest
 * 2) $ -l sqlprocs-s1=1

The line <tt>#$ -l sqlprocs-s1=1</tt> indicates that this script needs 1 execution slot on the sql-s1 cluster. If free slots are available, the job will run immediately; otherwise, it will wait for a slot to become available. You can also configure this on the <tt>qsub</tt> command line:

% qsub -l sqlprocs-s1=1 sql.sh

Currently, 10 SQL slots are configured for each server, and each query running for longer than 60 seconds counts as using a slot. Replication lag is currently not taken into account, but this will probably change soon.

Note: For long-running jobs (as opposed to jobs which run once then exit), do not reserve any SQL slots; since the program runs continuously, it will take the slots forever and prevent other jobs from running.

Running jobs from cron
It is possible to invoke <tt>qsub</tt> from cron in order to schedule jobs regularly; this is preferable to simply running the job directly from cron, as it will handle resource allocation, and execute the job on an idle host.

However, if the job runs for a long time, or spends a long time in the queue, you need to avoid scheduling the job multiple times. (For example, if you run a job every 10 minutes, and it is queued for 15 minutes, another job will be scheduled before the first job has run.) Additionally, <tt>qsub</tt> needs some environment variables, such as <tt>$SGE_ROOT</tt>, which are not set in cron by default.

To avoid both these problems, we provide a script called <tt>cronsub</tt>, which you can run from cron like this:

0 3 * * * cronsub -s myjob $HOME/myjob.sh

On Linux, you need to specify the full path to <tt>cronsub</tt>:

0 3 * * * /usr/local/bin/cronsub -s myjob $HOME/myjob.sh

This will create a job called <tt>myjob</tt> which executes <tt>$HOME/myjob.sh</tt>; however, if a job with that name already exists, it will do nothing. The -s argument causes output to be sent to <tt>$HOME/myjob.out</tt> instead of the SGE default files (<tt>myjob.o </tt> and <tt>myjob.e </tt>).

As an alternative to providing a script file, you can embed the script file in crontab, using %:

0 3 * * * cronsub -s myjob % $HOME/dosomething % echo "Done!"

Each % is treated as a new line; lines after the first (the <tt>cronsub</tt> command) are sent as input to <tt>qsub</tt>, which treats it as a script to execute.

<tt>cronsub</tt> does not allow you to pass any additional options to <tt>qsub</tt>, but you can specify these in the script itself using the <tt>#$</tt> syntax described above.

The executable file you pass to <tt>cronsub</tt> needs to be a plain shell script that will execute on the default system shell. The <tt>#!</tt> line is ignored. Binary files will not work either.

Since <tt>cronsub</tt> will not start a job if a job with the same name is already running, it can be used to replace Phoenix; simply try to start the job from cron every 10 minutes, and if the job has exited for some reason (e.g. if the server rebooted), it will be restarted. Long-running jobs like this should be started with <tt>cronsub -l</tt>, which will place the job in the longrun queue.

Checkpointing / restartable jobs
Checkpointing jobs are jobs which can be interrupted (killed) and restarted on another node. Checkpointing allows SGE to migrate jobs to another node if the current node becomes overloaded. Restartable jobs are jobs which should be restarted (instead of aborted) if the node they are running on crashes or reboots. You should try to create checkpointing or restartable jobs if possible.

The Toolserver supports "user-defined" checkpointing, which assumes that jobs will write state data required to restart later themselves.

Examples of jobs that can use checkpointing include:


 * Long-running bots which spend most of their time waiting for an event to act on (e.g. IRC bots, recentchanges bots, etc)
 * Jobs which read a list of tasks from a file or database and remove tasks once completed
 * Jobs which save their state while running and can use this state to continue processing if they are restarted

Examples of jobs that can be restartable include:


 * Jobs which finish quickly and can be restarted from the beginning without a problem
 * Jobs which should not be interrupted while running, but which can be safely restarted if they are
 * IRC bots, if you don't want the bot to occasionally disconnect and reconnect to the server (as they might with checkpointing)
 * Any jobs in the <tt>longrun</tt> queue are probably suitable for restarting, and are restartable by default

Any job which can be interrupted and then restarted is suitable for checkpointing. However, if you want to be sure that the job will never be voluntarily migrated to another node (e.g. because it would restart processing from the beginning), do not use a checkpointing job. If the job should not be voluntarily migrated, but should be restarted on node failure, create it as a restartable job.

To indicate that your job supports checkpointing, start it with <tt>qsub -ckpt default</tt>, or include <tt>#$ -ckpt default</tt> in the job's start script.

To indicate that a job is restartable, start it with <tt>qsub -r y</tt>, or include <tt>#$ -r y</tt> in the job's start script.

Note: jobs started in the <tt>longrun</tt> are restartable by default. To disable this, start the job with <tt>-r n</tt>.

Advanced features
Sun Grid Engine has several more advanced features, such as array jobs (automatically submitting the same job many times with different arguments), and job dependencies (specifying that a job cannot run until a different job has completed). For more information on these, see the documentation.