Facebook Open Academy/Cron

A common requirement in infrastructure maintenance is the ability to execute tasks at scheduled times and intervals. On Unix systems (and, by extension, Linux) this is traditionally handled by a cron daemon. Traditional crons, however, run on a single server and are therefore unscalable and create single points of failure. While there are a few open source alternatives to cron that provide for distributed scheduling, they either depend on a specific "cloud" management system or on other complex external dependencies; or are not generally compatible with cron.

Requirements
The Wikimedia Labs has a need for a scheduler that:


 * Is configurable by traditional crontabs;
 * Can run on more than one server, distributing execution between them; and
 * Guarantees that scheduled events execute as long as at least one server is operational.

The ideal distributed cron replacement would have as few external dependencies as possible.

Research
Some interesting avenues of investigations have already been mentioned in the related Bugzilla (which see), as well as possible alternatives and counterarguments.

What are the current solutions that exist and what lessons can be learned from them?
 * Chronos
 * Supports dependencies between jobs
 * Will retry failed jobs
 * One of multiple nodes is elected a master
 * Has many dependencies including Apache Mesos and Zookeeper
 * Cronie
 * If I understand the man page correctly:
 * It only allows jobs to be executed on one chosen server at a time
 * Must manually switch the chosen server if it goes down
 * Requires a network-mounted share for the directory containing the shared crontabs


 * Jenkins
 * Meant for continuous integration not job scheduling
 * Only allows one master (single point of failure)
 * Gearman
 * Framework for distributing tasks
 * Has fault tolerance and job retries
 * Would still require a scheduler and worker application for the APIs to be written
 * Seems to be better suited to executing jobs at arbitrary times rather than being scheduled

Language to use
Set to Python by fiat for expediency
 * Widely distributed, well known by a large development base
 * High availability of libraries
 * Many (most) Linux distributions default to it for system scripts
 * Python 2 or Python 3??? GLM

How to store and distribute the schedule between servers

 * Perhaps using a standby-sparing technique with multiple computers acting as hot spares. JT

How to decide what server does in fact run the command when the time comes and how to synchronize that information

 * Elect quorum leader? Some other method?
 * Generate a random permutation of servers that gets distributed to each node for any given task. Notification of completion of a task is done linearly in order of that permutation. GLM

Are libraries available to solve subproblems?

 * Need to make a survey of
 * Python's Distributed Wiki FC
 * dispy JT
 * Pyro JT

How late can a job run?

 * It's going to take some time for the servers to communicate that they've done a job, and if we're waiting on time outs and a server is scheduled last, it can be a while. GLM

How bad is it if a deleted job still gets ran?

 * If a server gets isolated and the crontab is updated, it'll run all the jobs on the old table in perpetuity for as long as it cannot connect to any of the other servers. GLM

Known TODO

 * Parse crontabs