Toolserver:Job scheduling

From mediawiki.org

This page was moved from the Toolserver wiki.
Toolserver has been replaced by Toolforge. As such, the instructions here may no longer work, but may still be of historical interest.
Please help by updating examples, links, template links, etc. If a page is still relevant, move it to a normal title and leave a redirect.

For the newtask command on Solaris, see batch project (outdated)

Job scheduling is the primary method by which tools should be started on the Toolserver. Jobs (i.e., tools) are submitted to the scheduler, which then starts the job on an appropriate host, based on factors like current load and resources needed. Using batch scheduling means you don't need to worry about where to start a job, or whether the job should be started during off-peak hours, etc. Job scheduling can be used for any sort of tools, whether they're one-off jobs, tools like bots which need to run permanently, or regular jobs run from cron.

While it's possible to run jobs on a server directly, without using job scheduling, this is strongly discouraged, since it makes it harder for the Toolserver administrators to manage server resources and load.

Introduction[edit]

The Toolserver uses Sun Grid Engine 6.2 (SGE) for scheduling jobs.

First you submit a job and specify a list of resources that your jobs will need. If there is a host that has sufficient resources available as requested your job will be started there. If the system is busy, there might be no free resources, and jobs will be queued until more resources become available. Jobs having a short maximum runtime and needing less memory are very unlikely to be queued for more than just few minutes in this way, except if there is some maintenance.

If a host crashes or shuts down and is unavailable for more than one hour all jobs will be migrated to other hosts.

Different to former documentation you don't have to care about different queues. SGE will handle this for you. You only have to add some information about maximum runtime, expected memory usage and other resources needed by your job on submit.

Submitting jobs[edit]

To submit a job, use the qsub or qcronsub command.

qcronsub has the same syntax as qsub with the difference that jobs with the same name are not resubmitted if already running or queued. All examples below use qcronsub.

$ qcronsub -l h_rt=0:30:00 -l virtual_free=100M "$HOME/mytool.py --my-arguments"
Your job 80570 ("mytool.py") has been submitted

The scheduler will place the job in the queue, and eventually (probably immediately) run it on a suitable host. Once the job has finished, it will be removed from the system. -l resource=value arguments request specific resources that this job needs during runtime. The output stream will be saved in a files located at ~/<jobname>.o<job_id> and the error steam to '~/<jobname>.e<job_id>. If nothing was sent to these streams by your job the files will be automatically removed.

Mandatory resources[edit]

Resources are added by using -l argument to qcronsub. You can use this argument multiple times for requesting multiple resources.

  • h_rt=0:30:00
    specifies the runtime limit of the job (hh:mm:ss), in this case 30 minutes. You should set this to the expected maximum runtime; if the job runs any longer, it will be killed. For tools like irc bots that should never stop you can specify -l h_rt=INFINITY. Wikibots (like interwikibots) should not have a unlimited execution time. Even jobs with a long runtime of a week or a month are in principle handled differently than infinite jobs by grid engine, although long jobs (more than ~200 hours currently, from anecdotal evidence) will probably end up in the longrun queue on willow, which has less resources. Note that changing the job runtime after submition is not allowed, otherwise you could have your job running at a queue with very high CPU priority forever; therefore you need to be careful and provide a limit that comfortably fits the runtime of your job.
  • virtual_free=50M
    specifies the peak memory usage of the job during runtime, in this case 50 Megabyte. The maximum available memory you can request per job is 1000M per host. SGE will ensure that all jobs scheduled on one host have not requested more than 1G virtual_free in sum (Account limits#Memory), so if the new job would bring you above 1G on host A the job will be sent to another host even if A has enough free memory.
    Currently this is a soft limit. But jobs using more memory than requested will be killed if a server runs out of memory.
  • arch=sol
    specifies the operating system to use
    • '*' – any available operating system (recommended)
    • lx – linux servers (currently yarrow and nightshade)
    • sol – solaris servers (currently wolfsbane, ortelius, willow, clematis and hawthorn) (default)

h_rt and virtual_free are the only resources you always must add to your request.

Currently also jobs not requesting these two resources are accepted but this will change in future. In these cases -l h_rt=6:00:00 -l virtual_free=100M is used as default values. See job statistics section for viewing the resources used in past jobs (which you can use estimate future ones).

In general you must request all resources that your job needs. If you don't, your job may be scheduled on a host not having this resource. All resources you request must be maximum or peak values.

E.g. if you run a bot that normally has a memory usage of about 100Mb but in error cases this could raise to 500Mb then you have to request -l virtual_free=500M. It is not a problem if your job uses less resources than requested (requested resources aren't blocked fully exclusively). Jobs needing much resources or having a long runtime have a lower priority for scheduling and get less CPU time when the system is busy.

Optional resources[edit]

These resources must only be requested if they are needed by your jobs.

  • sql-s1-user=1
  • sql-s2-user=1
  • sql-s3-user=1
  • sql-s4-user=1
  • sql-s5-user=1
  • sql-s6-user=1
  • sql-s7-user=1
    These resources indicate that the job will use a database connection to database cluster 1/2/3/4/5/6 or 7 which also contains writable user databases. This correspond to the sql server dns aliases sql-sX-user. If your jobs needs more than one db connection at the same time you can request up to 4 connection to one cluster for a job. If the database is in read-only modus this resource is not available.
  • sql-s1-user-readonly=1
  • sql-s2-user-readonly=1
  • sql-s3-user-readonly=1
  • sql-s4-user-readonly=1
  • sql-s5-user-readonly=1
  • sql-s6-user-readonly=1
  • sql-s7-user-readonly=1
    These resources indicate that the job will use a database connection to database cluster 1/2/3/4/5/6 or 7 which also contains user databases. This correspond to the sql server dns aliases sql-sX-user. If your jobs needs more than one db connection at the same time you can request up to 4 connection to one cluster for a job. This resource is available even if the database is in read-only modus. In this case you can only read data or use temporary tables (Note: temporary tables can only be created on user databases. New user databases cannot be created on read-only modus.)
  • sql-s1-rr=1
  • sql-s2-rr=1
  • sql-s3-rr=1
  • sql-s4-rr=1
  • sql-s5-rr=1
  • sql-s6-rr=1
  • sql-s7-rr=1
    These resources indicate that the job will use a database connection to database cluster 1/2/3/4/5/6 or 7. This correspond to the SQl server DNS aliases sql-sX-rr. If your jobs needs more than one db connection at the same time you can request up to 4 connection to one cluster for a job. No user databases are available and you cannot use temporary tables because there might be not database you have access to.
  • sql-user-a=1
  • sql-user-b=1
  • sql-user-y=1
  • sql-user-z=1
    These resources indicate that the job will use a database connection to the user database cluster. This correspond to the sql server dns aliases sql-user-X. If your jobs needs more than one db connection at the same time you can request up to 4 connection to one cluster for a job. This resource is only available if the database is not in read-only modus.
  • sql-toolserver=1
    These resources indicate that the job will use a database connection to the database cluster containing the toolserver database. This correspond to the sql server dns aliases sql-toolserver. If your jobs needs more than one db connection at the same time you can request up to 4 connection to one cluster for a job. You only have read access to this server.
  • sql-mapnik=1
    These resources indicate that the job will use a database connection to the OSM database cluster. This correspond to the PostgreSQL server DNS aliases sql-mapnik. If your jobs needs more than one db connection at the same time you can request up to 4 connection to one cluster for a job. For more information about how connection to this server read OpenStreetMap#first_steps.
  • temp_free=100M
    This resource must be used if your job needs temporary space. The environment variable $TMPDIR contains the directory you have to use. This location differs you each job. Please never suggest that this is /tmp
    The tmpdir directory can be used at very high speed in general. For short running jobs grid engine may give you a directory that is mapped into RAM instead of HardDisk.
  • fs-user-store=1
    You need to request this resource if you job needs access to the user-store directory which is always mounted at /mnt/user-store.
  • user_slot=1
    This resource is limit to 10 slots for each user. It has no specific meaning and can be used for limiting the number of job that are executed in parallel by a single user.
    E.g. if you have different scripts that all edit wiki pages and you would like to have them run sequential, so that only one job runs at the same time, you can request -l user_slot=10 for each job. If one job is running it consumes all available ten user_slots and all other job requesting this resources are queued until the first job has finished.
  • s_rt=1
    Differently to h_rt this is a soft runtime limit. Your script won't be kill, but a SIGUSR1 signal will be send to your process.
    If you want a warning before your job is killed, specify s_rt with a value lower than h_rt, for example:
$ qsub -l h_rt=1:00:00 -l s_rt=0:55:00 slowjob.py
This will send a SIGUSR1 signal after 55 minutes, which you can catch to perform cleanup before the job ends. After 1 hour, SIGKILL will be sent.

Job requesting other resources than specified above are rejected.

arguments to qsub/qcronsub[edit]

  • -N Jobname
    This specifies the name of your job. The name is also used by qcronsub to check if a job is already running. By default this will be your script file name.
  • -m ae
If you want to receive mail when a job finishes, use -m e. To receive mail when a job starts and when it finishes, use -m be. The argument can my any combination of
  • b - Mail is sent at the beginning of the job.
  • e - Mail is sent at the end of the job.
  • a - Mail is sent when the job is aborted or rescheduled.
  • s - Mail is sent when the job is suspended.
  • n - No mail is sent.
  • -j y
    This merges the standard error stream into the standard output stream, so that both streams are written to the same file ~/<jobname>.o<job_id>.
  • -o $HOME/mytool.out
    If you don't want that your standard output stream is written to a file called ~/<jobname>.o<job_id> you can define a different file name.
  • -e $HOME/mytool.out
    If you don't want that your standard error stream is written to a file called ~/<jobname>.e<job_id> you can define a different file name.
  • -b y
    In this case the script you submit is not copied to the execution server. This is useful for binary files or files that expect to be placed in a specific directory. The filename of the script you submit must be a absolute path in this case. The file permission must be set, so that it is executable by the user.
    If the file is run by an interpreter, this might be slower because it's read from a NFS drive.
  • -a 20150101010000
    This specifies that your job should not be executed until the specified time is elapsed. The format used as datetime values is [[CC]YY]MMDDhhmm[.SS]. Jobs that are submitted to the grid engine and have an execution time that is more than half an hour in the future, will have a higher priority to be scheduled after it is eligible for execution. This might be useful for jobs that must update wiki pages at e.g midnight.
  • -wd PATH
  • This sets the working directory ($PWD) of your script. Scripts using relative paths for reading other files rely on this. You can't use "$HOME" here.

resource definition@script[edit]

Rather than specifying arguments to qcronsub every time the job is run, you can instead put them in the script itself, using special directives starting with #$:

#! /usr/bin/python
#$ -l h_rt=0:30:00
#$ -j y
#$ -o $HOME/mytool.out
... rest of script ...

submit.toolserver.org[edit]

We have set up a pair of redundant hosts to act as SGE job submission servers. These work by sharing each user's cronietab between both hosts, and executing jobs on whichever server is working. This avoids that problem where jobs run from cronie on one login server (such as willow) will fail to run if that host is down, even when other login servers are available.

To use the new hosts, log into submit.toolserver.org and set up a cronietab (not a crontab) as normal.

Note that these hosts are only for submitting SGE jobs, not for running tools from cron.

Array Jobs[edit]

Array jobs contains multiple single tasks. Every task runs as a normal job using all resources as requested to the job. So for grid engine there is no difference if you submit one array job containing ten tasks or ten single jobs.

Array jobs can be very useful if you have to run one job multiple times with different parameters. When the task is started the environment variable $SGE_TASK_ID contains the current task number.

  • -t 10
    Array jobs are submitted using -t parameter. The value represents the number of tasks. You can also specify ranges or more complex values. In this example ten tasks are submitted containing task number 1-10.
  • -tc 5
    This is the number of tasks that can be executed at the same time. By default, this value is 50. Limiting the number of tasks executed in parallel can be useful if more tasks would cause many locks on the same resource. (e.g. if all tasks are massively written to the same file)

Examples[edit]

Scheduling SQL queries[edit]

When writing batch jobs that perform SQL queries, the most important resource is often available SQL capacity rather than CPU or memory. In this case, it is possible to specify that your job needs to run a SQL query on one or more clusters:

#! /bin/sh
#$ -N sqltest
#$ -l sql-s1-rr=1

mysql -h enwiki-p.rrdb -wcBNe 'select count(*) from revision' enwiki_p

The line #$ -l sql-s1-rr=1 indicates that this script needs 1 one database connection on the sql-s1-rr cluster. If free slots are available, the job will run immediately; otherwise, it will wait for a slot to become available. You can also configure this on the qcronsub command line:

% qcronsub -l sqlprocs-s1=1 sql.sh

Running interwiki bots from cron[edit]

Let's assume you would like to run an interwiki bot[1] that should work on the last 100 recent changes of enwiki. Your directory containing the SVN snapshot is /home/username/pywikipediabot. The expected maximum runtime is 3 days. Pywikipedia interwiki bots running too long have a memory problem and scripts should be updated regular from svn.

*/10 * * * * qcronsub -b y -wd /home/username/pywikipediabot -l h_rt=72:0:0 -l virtual_free=500M /home/username/pywikipediabot/interwiki.py -lang:en -auto -cleanup -quiet -ns:0 -recentchanges:100

Allowing jobs to be automatically restarted or migrated[edit]

By default, when a cluster node crashes or reboots, all jobs on it are terminated and will be restarted. Because it's not always safe to restart a job that was previously running, you can start it as a non restartable job using -r n .

Migration allows jobs to be moved between nodes while they're running, which improves load distribution and results in better performance. Migration relies on check-pointing -- the ability of a job to save its state and resume when restarted.

We do not provide any automatic check-pointing system; if you wish your job to be migrated, you need to implement this yourself. Examples of jobs that are suitable for migration include:

  • Jobs which work by removing work items from a queue and processing them; when migrated, the job just starts from the top of the queue
  • Jobs which are event-based and wait for work to do, e.g. most IRC bots or recentchanges bots
  • Jobs which regularly save their working state and can resume from the saved state if they are restarted

Most jobs in the longrun queue are probably suitable for migration, but it is not be enabled by default. To mark a job as a checkpointing (migratable) job, start it with the -ckpt default argument.

Best practices[edit]

Former documentation on this page was different. There you had to choose the right queue by yourself. Now this has changed, so that SGE will choose the best queue instance based on your resource requirements. Also the script that should be used for submitting jobs from cron has changed (qcronsub instead of cronsub)

But in general you should follow some rules:

  • Do not request resources that are not specified above even if they are available. These resources may be deleted at any time. Don't care about additional resources added automatically during job submission.
  • Do not use -clear as argument for qsub/qcronsub. This deactivates some jsv scripts which are used to guarantee backward comparability for some time after configuration/resources may be changed.
  • In general reading and writing from and to NFS files system can be a bottleneck if you only write those files in very small parts.
    If these files are only temporary during job runtime always use $TMPDIR and request the resources -l tmp_free=XXXM as mentioned above.
    This can be also useful if you write huge files with little throughput. E.g. if your are (un)compressing files you can write the result to $TMPDIR and move them to your home directory or user-sore, so that they are written at once.

Managing jobs[edit]

To list all your running jobs, use qstat:

job-ID  prior   name        user         state submit/start at     queue           slots ja-task-ID 
---------------------------------------------------------------------------------------------------
  80576 0.56000 mytool.py   rriver       r     11/17/2010 08:16:10 all.q@wolfsbane     1

This indicates that the job is running (r) in all.q on wolfsbane.

State Description
r running
qw queued/waiting
d deleted
E Error

Using qstat -j {job-ID} you can get more information on a particular job.

Deleting jobs[edit]

To delete jobs, use the qdel command:

% qdel <job_id>

If the job is currently running (rather than queued), this will terminate it.

Suspending and unsuspending jobs[edit]

Suspending a job allows it to be temporarily paused, and then resumed later. To suspend a job:

% qmod -sj <job_id>

The job will be paused by sending it SIGSTOP, and will have the 's' state in qstat.

To unsuspend the job and let it continue running:

% qmod -usj <job_id>

Server usage and job statistics[edit]

In order to be able to properly estimate the resources needed by a job, use:

% qacct -j <job_id> -d <days> [-o <owner>]

Statistics about the jobs started during the last d days will be displayed.

Advanced features[edit]

Sun Grid Engine has several more advanced features, such as array jobs (automatically submitting the same job many times with different arguments), and job dependencies (specifying that a job cannot run until a different job has completed). For more information on these, see the SGE documentation.

Grid status[edit]

An overview of running jobs on the Toolserver grid is available here.

  1. Assuming you were allowed to do so. A bot needing this setup would likely violate the rule about continuous interwikibots.