Toolserver:Jobserver

The jobserver is a centralised system that allows long-running jobs (i.e. tools) to be started easily, and restart on reboot, or if they crash.

The jobserver only runs on Solaris, so it's only available on willow (not nightshade).

Simple usage
Start a new job:

$ job add -e $HOME/myjob.pl New job ID is 3. $ job list ID NAME        USER             STATE      RSTATE     CMD 3 myjob.pl    jsmith           enabled    running    /home/jsmith/myjob.pl

Stop a running job:

$ job disable 3 $ job list ID NAME        USER             STATE      RSTATE     CMD 3 myjob.pl    jsmith           disabled   stopped    /home/jsmith/myjob.pl

Delete a job: $ job delete 3 $ job list No jobs defined.

By default, the job server will restart the job on reboot. If it exits or crashes, it will not be restarted. Currently this behaviour cannot be defined.

A crash is defined as the job, or any process it starts, dumping core or exiting with a crash signal (e.g. SIGSEGV). If the first job process exits, the job server will stop the entire job.

Stopping jobs
By default, the jobserver will stop a job by sending SIGTERM, waiting 30 seconds, then sending SIGKILL to any remaining processes. Most tools will exit sensibly when given SIGTERM, but some might require special handling. For these jobs, you can set a stop method:

$ job set 3 stop=$HOME/stop_tool.pl

The jobserver will execute your stop method instead of sending a signal. It will still send SIGKILL after 30 seconds if the job hasn't exited.

To remove a stop method:

$ job set 3 stop=

Scheduled jobs
Instead of having a job that runs continuously, you can create a scheduled job that runs once at scheduled intervals. The jobserver will not automatically restart these jobs once they exit.

To add a new scheduled job:

$ job add -S "every monday at 03:00" $HOME/test.sh New job ID is 0. $ job list ID NAME        USER             STATE      RSTATE     CMD 0 test.sh     river            scheduled  stopped    /home/river/test.sh

To schedule an existing job:

$ job schedule 4 "every minute"

To unschedule a job:

$ job unschedule 4

Note that since the schedule specification contains spaces, it has to be written in "quotes".

The following are valid schedule specifications:

spec                     example ========================= ============== every minute every hour at        every hour at 15 every day at :   every day at 03:00 every  at : every sunday at 03:00

Unlike cron, the jobserver will never start two copies of the same job. If a job is already running when its scheduled start time comes around again, nothing will happen.

Logs
Output (stdout and stderr) from each job is sent to $HOME/.job/job_.log. The job server will automatically rotate the log file when it reaches 1MB in size.

TODO / feature requests
Add more features here, if you want.


 * Allow user to specify action on exit/crash/fail
 * Allow log rotation behaviour to be configured
 * ACLs (for MMTs)
 * Provide a way to set the environment for started jobs (or load shell rc file?)
 * Distributed Jobserver: start jobs across an array of machines

Bugs
Report any of those here.