Toolserver:User:River/Job scheduling

Job scheduling (also called Sun Grid Engine, or SGE) is the preferred method to run tools on the Toolserver. Users submit jobs to a queue, and the scheduler runs them on an appropriate host (which might not the host they were submitted on). While it's possibly to run jobs without using the job scheduler, this is discouraged.

The usual command to submit jobs is qsub, but we provide a simplified wrapper called cronsub. To run a tool "mytool.py" using cronsub: $ cronsub -s mytool $HOME/mytool.py The first argument (mytool) is the job name, and the rest of the line is the command to start the tool. The output from the command will be written to $HOME/mytool.out.

By default, jobs are limited to 6 hours runtime, to prevent runaway tools from using too many system resources. If your tool needs more time than this to complete, you can add a line like this to the script itself: ... which would require 8 hours. You cannot request more than 120 hours (5 days).
 * 1) $ -l h_rt=8:00:00

If your tool is meant to run forever, then instead of specifying h_rt, you should use cronsub -l: $ cronsub -sl mytool $HOME/mytool.py

To see your running jobs, use qstat: $ qstat job-ID prior   name        user         state submit/start at     queue           slots ja-task-ID ---  80576 0.56000 mytool.py   rriver       r     11/17/2010 08:16:10 all.q@wolfsbane     1

That's all you need to know to start using the job scheduler. You might want to read the rest of this page, which documents some more advanced features.

Running jobs under screen
Many tools need to run under screen because they require a terminal to work. You can still run these jobs using the job scheduler, but you need to create a small wrapper script to start screen</tt>. For mytool.py</tt>, such a script might be called mytool.sh</tt> and look like this:

exec screen -D -m -S mytool python $HOME/mytool.py
 * 1) ! /bin/sh
 * 2) screen doesn't produce any output, so use /dev/null to avoid creating empty files
 * 3) $ -j y
 * 4) $ -o /dev/null

You can then use cronsub</tt>, as above, to submit the job from cron</tt>: 0,10,20,30,40,50 * * * * cronsub -l mytool $HOME/mytool.sh

This will create a screen session named mytool</tt> and start mytool.py</tt> inside it. Tools that run under screen</tt> are almost always meant to run forever, so here we used cronsub -l</tt>.

You can attach to the screen</tt> session using screen -r myprog</tt> as normal, but first you need to check which host the job is running on:

$ qstat | grep mytool 80463 0.55500 mytool    rriver       r     11/17/2010 03:48:26 longrun@willow.toolserver.org      1

Here the job is running on willow</tt>, so that's where you need to run screen -r</tt>.

Mail notifications
To receive mail when a job finishes or is aborted, add this line to the script: To receive mail when the job starts as well:
 * 1) $ -m ae
 * 1) $ -m bae

Output file
If you want to change the output file (by default $HOME/toolname.out</tt>), remove the -s</tt> argument to cronsub</tt>, and add these lines to the script:
 * 1) $ -j y
 * 2) $ -o $HOME/some/path/output.txt

Scheduling SQL queries
When writing batch jobs that perform SQL queries, the most important resource is often available SQL capacity rather than CPU or memory. In this case, it is possible to specify that your job needs to run an SQL query on one or more clusters:

mysql -h sql-s1 -BNe 'select count(*) from revision' enwiki_p
 * 1) ! /bin/sh
 * 2) $ -l sqlprocs-s1=1

The line <tt>#$ -l sqlprocs-s1=1</tt> indicates that this script needs 1 execution slot on the sql-s1 cluster. If free slots are available, the job will run immediately; otherwise, it will wait for a slot to become available.

Currently, 20 SQL slots are configured for each server, and each query running for longer than 60 seconds counts as using a slot.

Note: For long-running jobs (as opposed to jobs which run once then exit), do not reserve any SQL slots; since the program runs continuously, it will take the slots forever and prevent other jobs from running.

Array jobs
An array job is a collection of identical jobs which act on different data. An example of an array job might be a job which runs the same report on several wikis.

To create an array job, add this to the script: This will create 15 tasks, numbered from 1 to 15. The task number is available in the script as <tt>$SGE_TASK_ID</tt>; you will need some way to map this to the task's input, e.g. taking the Nth line of a text file.
 * 1) $ -t 1-15

If you want to limit the number of tasks in an array job which run at the same time, use <tt>-tc</tt>:
 * 1) Limit to 5 concurrent asks
 * 2) $ -tc 5

Limiting concurrent jobs
Sometimes, you might want to limit the number of jobs you run at once, e.g. if you run a large number of jobs that access remote sites, and want to prevent overloading the site. If you use array jobs (above), you can do this with <tt>-tc</tt>; otherwise, you can use the <tt>user_slot</tt> resource. This is a resource with an arbitrary limit of 10 per user, which you can request like this:


 * 1) $ -l user_slot=2

If you run 15 tasks which each request 2 user slots, no more than 5 of them (10 user slots' worth) will be run at once.