Toolserver:Job scheduling


 * For the newtask command on Solaris, see batch project (outdated)

Job scheduling is the primary method by which tools should be started on the Toolserver. Jobs (i.e., tools) are submitted to the scheduler, which then starts the job on an appropriate host, based on factors like current load and resources needed. Using batch scheduling means you don't need to worry about where to start a job, or whether the job should be started during off-peak hours, etc. Job scheduling can be used for any sort of tools, whether they're one-off jobs, tools like bots which need to run permanently, or regular jobs run from cron.

While it's possible to run jobs on a server directly, without using job scheduling, this is strongly discouraged, since it makes it harder for the Toolserver administrators to manage server resources and load.

Introduction
The Toolserver uses Sun Grid Engine 6.2 (SGE) for scheduling jobs.

First you submit a job and specify a list of resources that your jobs will need. If there is a host that has sufficient resources available as requested your job will be started there. If the system is busy, there might be no free resources, and jobs will be queued until more resources become available. Jobs having a short maximum runtime and needing less memory are very unlikely to be queued for more than just few minutes in this way, except if there is some maintenance.

If a host crashes or shuts down all jobs will be restarted on other hosts of the host is unavailable for more than an hour.

Different to former documentation you don't have to care about different queues. SGE will handle this for you. You only have to add some information about maximum runtime, expected memory usage and other resources needed by your job on submit.

Submitting jobs
To submit a job, use the qsub or qcronsub command.

qcronsub has excactly the same syntax as qsub. The only difference is, that the job is not submitted if there is a job having the same name is already running or queued. At the examples below qcronsub is always used.

$ qcronsub -l h_rt=0:30:00 -l virtual_free=100M $HOME/mytool.py Your job 80570 ("mytool.py") has been submitted The scheduler will place the job in the queue, and eventually (probably immediately) run it on a suitable host. Once the job has finished, it will be removed from the system. -l resource=value arguments request specific resources that this job need during runtime. The output stream will be saved in a files located at ~/ .o  and the error steam to '~/ .e . If nothing was send to these stream by your job the files will be automatically removed.

obligatory resources
Resouces are added by using -l argument to qcronsub. You can uses this argument multiple times for requesting multiple resources.


 * h_rt=0:30:00
 * specifies the runtime limit of the job ([dd:]hh:mm:ss), in this case 30 minutes. You should set this to the expected maximum runtime; if the job runs any longer, it will be killed. For tools like irc bots that should never stop you can specify -l h_rt=INFINITY. Wikibots (like interwikibots) should not have a unlimited execution time. Even jobs with a long runtime of a week or a month will be handeled differently than infinite jobs by grid engine.


 * virtual_free=50M</tt>
 * specifies the peak memory usage of the job during runtime, in this case 50 Megabyte. The maximum available memory you can request per job is 1000M. SGE will ensure that all jobs scheduled on one host have not requested more than 1G virtual_free in sum (Account limits).
 * Currently this is a soft limit. But jobs using more memory than requested will be killed if a server runs out of memory.

h_rt</tt> and virtual_free</tt> are the only resources you always must add to your request.

Currently also jobs not requesting these two resources are accepted but this will change in future. In these cases -l h_rt=6:00:00 -l virtual_free=100M</tt> is used as default values.

In general you must request all resources that your job needs. If you don't, your job may be scheduled on a host not having this resource. All resources you request must be maximum or peak values. E.g. if you run a bot that normally has a memory usage of about 100Mb but in error cases this could raise to 500Mb then you have to request -l virtual_free=500M</tt>. It is not a problem if your job uses less resources than requested (requested resources aren't block fully exclusively). Jobs needing much resources or having a long runtime have a lower priority for scheduling and get less cpu time if the system is busy.

optional resources
These resources must only be requested if they are needed by your jobs.


 * sql-s1-user=1</tt>
 * sql-s2-user=1</tt>
 * sql-s3-user=1</tt>
 * sql-s4-user=1</tt>
 * sql-s5-user=1</tt>
 * sql-s6-user=1</tt>
 * sql-s7-user=1</tt>
 * These resources indicate that the job will use a database connection to database cluster 1/2/3/4/5/6 or 7 which also contains writable user databases. This correspond to the sql server dns aliases sql-sX-user. If your jobs needs more than one db connection at the same time you can request up to 4 connection to one cluster for a job. If the database is in read-only modus this resource is not available.


 * sql-s1-user-readonly=1</tt>
 * sql-s2-user-readonly=1</tt>
 * sql-s3-user-readonly=1</tt>
 * sql-s4-user-readonly=1</tt>
 * <tt>sql-s5-user-readonly=1</tt>
 * <tt>sql-s6-user-readonly=1</tt>
 * <tt>sql-s7-user-readonly=1</tt>
 * These resources indicate that the job will use a database connection to database cluster 1/2/3/4/5/6 or 7 which also contains user databases. This correspond to the sql server dns aliases sql-sX-user. If your jobs needs more than one db connection at the same time you can request up to 4 connection to one cluster for a job. This resource is available even if the database is in read-only modus. In this care you can only read data or use temporary tables (Note: temporary tables can only be created on user databases. New user databases cannot be created on read-only modus.)


 * <tt>sql-s1-rr=1</tt>
 * <tt>sql-s2-rr=1</tt>
 * <tt>sql-s3-rr=1</tt>
 * <tt>sql-s4-rr=1</tt>
 * <tt>sql-s5-rr=1</tt>
 * <tt>sql-s6-rr=1</tt>
 * <tt>sql-s7-rr=1</tt>
 * These resources indicate that the job will use a database connection to database cluster 1/2/3/4/5/6 or 7. This correspond to the sql server dns aliases sql-sX-rr. If your jobs needs more than one db connection at the same time you can request up to 4 connection to one cluster for a job. No user databases are available and you cannot use temporary tables because there might be not database you have access to.


 * <tt>sql=1</tt>
 * These resources indicate that the job will use a database connection to the user database cluster. This correspond to the sql server dns aliases sql. If your jobs needs more than one db connection at the same time you can request up to 4 connection to one cluster for a job. This resource is only available if the database is not in read-only modus.


 * <tt>sql-toolserver=1</tt>
 * These resources indicate that the job will use a database connection to the database cluster containing the toolserver database. This correspond to the sql server dns aliases sql-toolserver. If your jobs needs more than one db connection at the same time you can request up to 4 connection to one cluster for a job. You only have read access to this server.


 * <tt>sql-mapnik=1</tt>
 * These resources indicate that the job will use a database connection to the osm database cluster This correspond to the PostgreSQL server dns aliases sql-mapnik. If your jobs needs more than one db connection at the same time you can request up to 4 connection to one cluster for a job. For more information about how connection to this server read OpenStreetMap.


 * <tt>temp_free=100M</tt>
 * This resource must be used if your job needs temporary space. The environment variable $TMPDIR contains the directory you have to use. This location differs you each job. Please never suggest that this is /tmp
 * The tmpdir directory can be used at very high speed in general. For short running jobs grid engine may give you a directory that is mapped into RAM instead of HardDisk.


 * <tt>fs-user-store=1</tt>
 * You need to request this resource if you job needs access to the user-store directory which is always mounted at /mnt/user-store.


 * <tt>user_slot=1</tt>
 * This resource is limit to 10 slots for each user. It has no specific meaning and can be used for limiting the number ob job that are executed in parallel by a single user.
 * E.g. if you have different scripts that all edit wiki pages and you would like to have them run sequential, so that only one job runs at the same time, you can request <tt>-l user_slot=10</tt> for each job. If one job is running it consumes all available ten user_slots and all other job requesting this resources are queued until the first job has finished.

$ qsub -l h_rt=1:00:00 -l s_rt=0:55:00 slowjob.py
 * <tt>s_rt=1</tt>
 * Differenlty to h_rt this is a soft runtime limit. Your script won't be kill be a SIGUSR1 signal will be send to your process.
 * If you want a warning before your job is killed, specify <tt>s_rt</tt> with a value lower than <tt>h_rt</tt>, for example:
 * This will send a SIGUSR1 signal after 55 minutes, which you can catch to perform cleanup before the job ends. After 1 hour, SIGKILL will be sent.

Job requesting other resources than specified above are rejected.

arguments to qsub/qcronsub

 * <tt>-N  Jobname</tt>
 * This specifies the name of your job. The name is also used by <tt>qcronsub</tt> to check if a job is already running. By default this will be your script file name.


 * <tt>-m  ae</tt>
 * If you want to receive mail when a job finishes, use <tt>-m e</tt>. To receive mail when a job starts and when it finishes, use <tt>-m be</tt>. The argument can my any comincation of


 * b - Mail is sent at the beginning of the job.
 * e - Mail is sent at the end of the job.
 * a - Mail is sent when the job is aborted or rescheduled.
 * s - Mail is sent when the job is suspended.
 * n - No mail is sent.


 * <tt>-j y</tt>
 * This merges there standard error steam into the standard output stream, so that both stream are written to the same file ~/ .o .


 * <tt>-o  $HOME/mytool.out</tt>
 * If you don't want that your standard output stream is written so a file called ~/ .o  you can define a different file name.


 * <tt>-e  $HOME/mytool.out</tt>
 * If you don't want that your standard error stream is written so a file called ~/ .e  you can define a different file name.


 * <tt>-b y</tt>
 * In this case the script you submit is not copied to the execution server. This is useful for binary files or files that expect to be placed in a specifiy directory. The filename of the script you submit must be a absolute path in this case. The file permission must be set, so that it is executable by the user.
 * If the file is run by an interpreter, this might be slower because its read from a NFS drive.


 * <tt>-a  20150101010000</tt>
 * This specifies that your should will not be executed until the specified time is elapsed. The format used as datetime values is  [[CC]YY]MMDDhhmm[.SS] . Jobs that are submitted to the grid engine and have an execution time that it more than half an hour in the future, will have i higher priority to be scheduled after it is eligible for execution. This might be useful for jobs that must update wiki pages at e.g midnight.


 * <tt>-wd  $HOME</tt>
 * This sets the working directory ($PWD) of your script. Scripts using relative paths for reading other files rely on this.

resource definition@script
Rather than specifying arguments to <tt>qcronsub</tt> every time the job is run, you can instead put them in the script itself, using special directives starting with <tt>#$</tt>: ... rest of script ...
 * 1) ! /usr/bin/python
 * 2) $ -l h_rt=0:30:00
 * 3) $ -j y
 * 4) $ -o $HOME/mytool.out

submit.toolserver.org
We have set up a pair of redundant hosts to act as SGE job submission servers. These work by sharing each user's cronietab between both hosts, and executing jobs on whichever server is working. This avoids that problem where jobs run from cronie on one login server (such as willow) will fail to run if that host is down, even when other login servers are available.

To use the new hosts, log into submit.toolserver.org and set up a cronietab (*not* a crontab) as normal.

Note that these hosts are *only* for submitting SGE jobs, not for running tools from cron.

Array Jobs
Array jobs contains multiple single tasks. Every task runs as a normal job using all resources as requested to the job. So for grid engine there is no difference if you submit one array job containing ten tasks or ten single job.

Array job can be very useful if you have to run one job multiple times with different parameters. When the task is started the environment variable <tt>$SGE_TASK_ID</tt> contains the current task numer.


 * <tt>-t 10</tt>
 * Array job are submit using <tt>-t</tt> parameter. The value represents the number of tasks. You can also specify ranges or more complex values. In this example ten task are submited containing task number 1-10.
 * <tt>-tc 5</tt>
 * This is the numer of task that are can be executed at the same time. By default this value if 50. Limiting the number of tasks executed in parallel can be useful if more task would cause to many locks on the same resource. (e.g. if all tasks are massively writing to the same file)

Scheduling SQL queries
When writing batch jobs that perform SQL queries, the most important resource is often available SQL capacity rather than CPU or memory. In this case, it is possible to specify that your job needs to run an SQL query on one or more clusters:

mysql -h enwiki-p.rrdb -wcBNe 'select count(*) from revision' enwiki_p
 * 1) ! /bin/sh
 * 2) $ -N sqltest
 * 3) $ -l sql-s1-rr=1

The line <tt>#$ -l sql-s1-rr=1</tt> indicates that this script needs 1 one database connection on the sql-s1-rr cluster. If free slots are available, the job will run immediately; otherwise, it will wait for a slot to become available. You can also configure this on the <tt>qcronsub</tt> command line:

% qcronsub -l sqlprocs-s1=1 sql.sh

Allowing jobs to be automatically restarted or migrated
By default, when a cluster node crashes or reboots, all jobs on it are terminated and will be restarted. Because it's not always safe to restart a job that was previously running, you can start it as a non restartable job using <tt>-r n </tt>.

Migration allows jobs to be moved between nodes while they're running, which improves load distribution and results in better performance. Migration relies on checkpointing -- the ability of a job to save its state and resume when restarted.

We do not provide any automatic checkpointing system; if you wish your job to be migrated, you need to implement this yourself. Examples of jobs that are suitable for migration include: Most jobs in the <tt>longrun</tt> queue are probably suitable for migration, but it is not be enabled by default. To mark a job as a checkpointing (migratable) job, start it with the <tt>-ckpt default</tt> argument.
 * Jobs which work by removing work items from a queue and processing them; when migrated, the job just starts from the top of the queue
 * Jobs which are event-based and wait for work to do, e.g. most IRC bots or recentchanges bots
 * Jobs which regularly save their working state and can resume from the saved state if they are restarted

Best practices
Former documentation on this page was different. There you had to choose the right queue by yourself. Now this has changed, so that sge will choose the best queue instance based on your resource requirements. Also the script that should be used for submitting jobs from cron has changed (<tt> q consub</tt> instead of <tt>cronsub</tt>)

But in general you should willow some rules:


 * Do not request resources that are not specified above even if they are available. These resources may be deleted at any time. Don't care about additional resources added automatically during job submission.
 * Do not use <tt>-clear</tt> as argument for qsub/qcronsub. This deactivates some jsv scripts which are used to grantee backward comparability for some time after configuration/resources may be changed.
 * In general reading and writing from and to NFS files system can be a bottleneck if you only write those files in very small parts.
 * If these files are only temporary during job runtime always use $TMPDIR and request the resources <tt>-l tmp_free=XXXM</t> as mentioned above.
 * This can be also useful if you write huge files with little throughput. E.g. is your are (un)compressing files you can write the result to $TMPDIR and move them to your home directory or user-sore, so that they are written at once.