Wikimedia Release Engineering Team/Staging Cluster

(From Chris M's description at http://fab.wmflabs.org/T621)

Over the past two years, a growing number of teams have come to rely on having unreleased software available for use on the shared test environment known as the Beta Cluster. The ways that teams use Beta Cluster has become more sophisticated, more diverse, and more important. Disruption of service or unexpected system behavior on Beta Cluster is therefore quite expensive across all of WMF and the development community.

At the same time, we need a shared test environment where we can test big risky system-wide software changes in a realistic test environment.

A little history
Two years ago Beta Cluster was a hodgepodge of symlinks, old versions of core and some extensions, and not much else. There were no policies at all governing its use. Over time we made it so that beta always ran the latest version of the master branch of code. As beta became more valuable, we instituted one policy there: the only code allowed on Beta Cluster was for extensions and features that already existed in the production cluster. Beta Cluster has never been a "wild west" environment, we have always controlled closely what software runs there. My vision was that Beta Cluster would become the system of record for testing WMF software, and this sort of control is required for Beta Cluster to be such a system.

But as our software matured, it became more feasible to implement large-scale system-wide changes. Beta Cluster was the only reasonable environment in which to test such large-scale changes. So we changed the policy: we would allow non-production software on Beta Cluster only if that software had a scheduled release to the production environment. Our first such project in Beta Cluster was CirrusSearch, the next was the Flow extension, then HHVM after that. Besides these, we have also had some experiments with existing Beta Cluster infrastructure like changing the Varnish cache and changing the database master-slave relationship.

Every one of these system-wide changes on Beta Cluster has caused significant disruption, down time, and problems for automated testing.

But that was okay at the time. Beta Cluster has always been primarily a target for automated browser tests, and very few teams used it in any meaningful way early in its history.

But that situation has changed dramatically over the last year or so. As we discovered when HHVM testing brought down all of Beta Cluster for extended periods of time, almost all of WMF relies on Beta Cluster in one way or another to do their work, from testing bleeding-edge software changes, to public demonstrations of new features, to community engagement.

So we need test environments to serve the parties addressed by both of our previous Beta Cluster policies. We need to support the latest unreleased changes to software that already exists in production; but we also need a realistic test environment for *unreleased* software with the potential to disrupt the entire cluster.

Proposal
Conservative Beta Cluster vs Liberal Beta Cluster

So I propose two beta-labs-like test environments. These test environments will not differ in the software features they offer, nor in their nominal response to users, but only in the policies governing their use.

Both environments would:


 * be updated automatically with the master branch of core and extensions
 * have their databases updated automatically as they are today
 * be the target of automated browser test builds
 * be available to the public as Beta Cluster is today
 * serve as the system of record for software testing and software changes

Conservative Beta Cluster would follow our original maintenance policy: no software or systems in place other than those already released to production. This would very likely be the existing Beta Cluster, so as to disrupt the teams using it already as little as possible. The aim of this system would be to supply as reliable a user experience as possible given that the system will be running the master branch of all production code and nothing other than production code.

Liberal Beta Cluster would follow our amended maintenance policy: software with a projected release to production would be in use here. If it is not in production already, liberal Beta Cluster would be the place to make it work. Liberal Beta Cluster might host major changes to cross-cutting systems such as


 * Search (affects Core, MobileFrontend, VisualEditor, etc.)
 * System optimizations (HHVM, new caching schemes, major db changes, etc.)
 * New extensions (especially those requiring untested db updates)
 * Any system-wide config change needing testing
 * Potentially, this system could host experiments not destined for a production release, but that exist for research purposes

Testing HHVM proved that we are stretching the capacity and capability of Beta Cluster beyond what we can reliably support. Having a Conservative Beta Cluster test environment for features bound for production would serve our existing users, while having a Liberal Beta Cluster test environment where we can manipulate the entire system would serve our future users.