Wikimedia Labs/Tool Labs/Design

Here is the current design / architecture for the Tools Labs with possible implementation details.

Design requirements
(Roughly in order of relative importance)


 * 1) Reliability
 * Running tools must be trusted to keep running
 * 1) Scalability
 * 2) Simplicity
 * We want new community developers to be able to join the Tool Labs and hit the ground running
 * 1) Minimize impact for tools developed on the toolserver
 * The fewer changes needed to existing tools the better
 * 1) Low maintenance
 * Developers should need minimal sysadmin intervention to do their work – with the objective being no human intervention required to make a new tool.

Use cases
A point to be noted is that many (possibly most) tools fall within more than one of those categories; a continuously-running bot may well have a web interface for status reporting or control, and a web service may well have need to run scheduled jobs. There should be no artificial restriction on "kinds" of tools, or what they can do at need.

Interactive

 * Jobs or tasks started interactively by the tool operator
 * Most one-off tasks fall in that category: whether running a query against the databases, or running an on-demand process to do a single task.

Continuous operation

 * This encompasses most bots and automated editing tools


 * Need "always-up" reliability; automatic restart and notification of the maintainers when it fails.
 * Need a method to reliably assess the status of the tool, start, and stop it.

Scheduled jobs

 * Some classes of bots do jobs on a schedule rather that run continuously; additionally, many other kind of tools may have scheduled maintenance/operations as part of their normal operation


 * Flexible scheduling
 * Good reporting; it's important to know whether the job succeeded or failed and why

Web services

 * Many tools have end user-facing interfaces (or are entirely web-based)

Architecture
The basic proposed architecture has three fundamental requirements (four, counting the project database replication which is provided by the Labs infrastructure):

1) Interactive / Development environment
The objective here is to have one (or possibly two, see below) hosts which serve as the login and development environment. Those should be configured as exactly like the actual execution nodes as possible, so that new tools can be developed and tested with certainty that they will function in the same manner.

In practice, there will be a few differences:
 * The login host will not be an execution node


 * Development host?

It may end up useful or necessary to have more than one login host where it is more reasonable to run heavier processes (compilation, tool test runs) to maintain the interactive performance of the primary login host; such a secondary login box would be made to be otherwise strictly identical to the login host.

2) Computing grid

 * Why a grid?
 * Reliability.
 * Having multiple identical nodes all capable of running a job or continuous process means that any one of them can be used in case of failure or running out of resources
 * Continuous processes can be restarted on a different node if they fail
 * Resources can be dedicated and reallocated to specific tools at need with fallback to the "common pool" so that they can keep running
 * Scalability.
 * Solving a lack of resources is as simple as adding a node to the grid

The grid node configuration will be maintained to be identical to the development environment so that tools are, by definition, mobile. A node that breaks because of configuration error can be simply taken out of the queues.

Implementation
Fundamentally, any distributed computing / queue management system would do; but my initial leanings look towards the Open Grid Scheduler, the open source for/successor to the Sun Grid Engine with which the toolserver users are (or should be) already familiar. It meets all the requirements for interactive, scheduled and continuous jobs, is in wide use in scientific and academic circles, and has a robust community of users and developers.

The general design calls for a queue manager (which can be supplemented with a hot-takeover slave for additional reliability), and an arbitrary number of execution nodes.

Originally, all the execution nodes should be identical and placed on a single queue available to all tools. Tools would simply start their jobs by placing them on the queue, and OGS will ensure that it is dispatched to an appropriate node.

As need develops, we can add nodes dedicated to specific tasks (possibly with different requirements) and place them on different queues; yet the ability to fall back on the default queue would remain in case of node failure.

Wrappers will be provided to hide the implementation details for those tools which have no need to fiddle with specialized settings of the queues (which would be the vast majority of them). I would expect that for the typical case, things should be as simple as: tools$ start-continuous the-bot tools$ stop-continuous the-bot tools$ do-this-thing one-off-job tools$ do-that-interactively interactive-thingy

Scheduled jobs may still be started by something as simple and well-understood at ; reliability would not be impacted since the only work the crontab needs to do is enqueue the request (again, a wrapper would be provided for error reporting that would suffice for the vast majority of tools).

Note: Support cronsub and qcronsub qsub wrappers.

3) Web service
The general architecture will function under the same principle as that of the job system and with the same requirements.

I am seeing an arbitrary number of identical Apaches, each configured that they could serve requests for every tool. In front of them, a single public reverse proxy that either distributes incoming requests to the specific web servers behind and can trivially redirected depending on load and/or failure.

One thing to examine is the possibility of using an actual load balancer rather than a simpler reverse proxy setup. Are there many stateful web services that would break in that situation?