Facebook Open Academy/Cron

A common requirement in infrastructure maintenance is the ability to execute tasks at scheduled times and intervals. On Unix systems (and, by extension, Linux) this is traditionally handled by a cron daemon. Traditional crons, however, run on a single server and are therefore unscalable and create single points of failure. While there are a few open source alternatives to cron that provide for distributed scheduling, they either depend on a specific "cloud" management system or on other complex external dependencies; or are not generally compatible with cron.

Requirements
The Wikimedia Labs has a need for a scheduler that:


 * Is configurable by traditional crontabs;
 * Can run on more than one server, distributing execution between them; and
 * Guarantees that scheduled events execute as long as at least one server is operational.

The ideal distributed cron replacement would have as few external dependencies as possible.

Research
Some interesting avenues of investigations have already been mentioned in the related Bugzilla (which see), as well as possible alternatives and counterarguments.

What are the current solutions that exist and what lessons can be learned from them?

Language to use
Set to Python by fiat for expediency
 * Widely distributed, well known by a large development base
 * High availability of libraries
 * Many (most) Linux distributions default to it for system scripts

How to store and distribute the schedule between servers

 * Perhaps using a standby-sparing technique with multiple computers acting as hot spares. JT

How to decide what server does in fact run the command when the time comes and how to synchronize that information

 * Elect quorum leader? Some other method?

Are libraries available to solve subproblems?

 * Need to make a survey of
 * dispy JT
 * Pyro JT

Known TODO

 * Parse crontabs