Toolserver:Jobserver

The jobserver is a centralised system that allows long-running jobs (i.e. tools) to be started easily, and restart on reboot, or if they crash.

The jobserver only runs on Solaris, so it's only available on willow (not nightshade).

Simple usage
Start a new job:

$ job add -e $HOME/myjob.pl New job ID is 3. $ job list ID NAME        USER             STATE      RSTATE     CMD 3 myjob.pl    jsmith           enabled    running    /home/jsmith/myjob.pl

Stop a running job:

$ job disable 3 $ job list ID NAME        USER             STATE      RSTATE     CMD 3 myjob.pl    jsmith           disabled   stopped    /home/jsmith/myjob.pl

Delete a job: $ job delete 3 $ job list No jobs defined.

If the job command contains spaces, you will need to give the job a name:

$ job add -n myjob $HOME/myjob.pl

Otherwise, the default job name will be the command with everything up to the last slash removed.

By default, the job server will restart the job on reboot. If it exits or crashes, it will not be restarted. Currently this behaviour cannot be defined.

A crash is defined as the job, or any process it starts, dumping core or exiting with a crash signal (e.g. SIGSEGV). If the first job process exits, the job server will stop the entire job.

Stopping jobs
By default, the jobserver will stop a job by sending SIGTERM, waiting 30 seconds, then sending SIGKILL to any remaining processes. Most tools will exit sensibly when given SIGTERM, but some might require special handling. For these jobs, you can set a stop method:

$ job set 3 stop=$HOME/stop_tool.pl

The jobserver will execute your stop method instead of sending a signal. It will still send SIGKILL after 30 seconds if the job hasn't exited.

To remove a stop method:

$ job set 3 stop=

Scheduled jobs
Instead of having a job that runs continuously, you can create a scheduled job that runs once at scheduled intervals. The jobserver will not automatically restart these jobs once they exit.

To add a new scheduled job:

$ job add -S "every monday at 03:00" $HOME/test.sh New job ID is 0. $ job list ID NAME        USER             STATE      RSTATE     CMD 0 test.sh     river            scheduled  stopped    /home/river/test.sh

To schedule an existing job:

$ job schedule 4 "every minute"

To unschedule a job:

$ job unschedule 4

Note that since the schedule specification contains spaces, it has to be written in "quotes".

The following are valid schedule specifications:

spec                     example ========================= ============== every minute every hour at        every hour at 15 every day at :   every day at 03:00 every  at : every sunday at 03:00

Unlike cron, the jobserver will never start two copies of the same job. If a job is already running when its scheduled start time comes around again, nothing will happen.

If you want a scheduled job to run immediately, you can start it with the job start  command. This will not affect its schedule, but if the next scheduled time comes around and it's still running, it won't be started again.

Maintenance state
If a continuous job (not a scheduled job) fails to stay running, it is not restarted, but placed in the maintenance state. Once in this state, it will never be restarted even on reboot.

$ job add -e $HOME/test.sh New job ID is 2. $ job list ID NAME     USER     STATE     RSTATE    CMD 2 test.sh  rriver   enabled   running   /home/rriver/test.sh $ job list ID NAME     USER     STATE         RSTATE    CMD 2 test.sh  rriver   maintenance   stopped   /home/rriver/test.sh

If a job is placed in the maintenance state, you should fix whatever problem kept it from working, then clear the error:

$ job clear 2 $ job list ID NAME     USER     STATE     RSTATE    CMD 2 test.sh  rriver   enabled   running   /home/rriver/test.sh

Currently, a job is placed in the maintenance state if it runs for less than 5 minutes before exiting.

Logs
Output (stdout and stderr) from each job is sent to $HOME/.job/job_.log. The job server will automatically rotate the log file when it reaches 1MB in size.

Environment variables
By default, the job server sets the following standard environment variables for new processes:


 * SHELL
 * LOGNAME
 * USER
 * PATH
 * HOME

If you want to set additional variables, you should create a file called $HOME/.environment, and write NAME=VALUE pairs it in, one per line. The jobserver will read this file when starting your job, and set the appropriate variables.

If you load this file in your shell rc file as well, you can share the same variables between jobs and login sessions.

Resource limits
Resource limits allow you to limit resources (such as memory and CPU time) used by a job. Setting resources limits allows you to ensure a job doesn't accidentally use more resources than it actually needs, e.g. due to a bug or unexpected input.

Resource controls are set using the job limit</tt> command:

$ job list ID NAME     USER     STATE      RSTATE    CMD 9 test.sh  rriver   disabled   stopped   /home/rriver/test.sh  $ job limit 9 max-data-size 10485760 $ job limit 9 max-cpu-time 3600 $ job limit 9 max-data-size = 10.00GB max-cpu-time = 1h $ job unlimit 9 max-cpu-time $ job limit 9 max-data-size = 10.00GB

The following resource controls are available:


 * max-file-descriptor : maximum number of file descriptors (e.g. files and network connections) the job can open.
 * max-core-size : (bytes) maximum size of a core dump the job can leave. Set this to 0 to disable core dumps.
 * max-stack-size : (bytes) maximum size of the job's stack.
 * max-data-size : (bytes) maximum amount of memory the job can allocate.
 * max-file-size : (bytes) largest file size the job can create.
 * max-cpu-time : (seconds) maximum amount of CPU time (not wall clock time) the job can consume.

(There are some other less commonly used limits available; see resource_controls(5)</tt> for a list.)

Projects
It is possible to configure which project a job will start in; this is mostly useful for starting jobs in the batch project (which is preferred to using nice</tt> on Solaris):

$ job set 9 project=batch

TODO / feature requests
Add more features here, if you want.


 * Allow user to specify action on exit/crash/fail (github #9)
 * Allow log rotation behaviour to be configured (github #10)
 * ACLs (for MMTs) (github #11)
 * Distributed Jobserver: start jobs across an array of machines
 * A way to limit the max wall clock time of a scheduled job (github #12)
 * A way to see the upcoming jobs in a specified time period (list all jobs set to run in the next day or next week) (github #8)

Bugs
Report any of those here.