Wikimedia Release Engineering Team/CI Futures WG/Report

THIS IS A DRAFT. IT DOES NOT YET RECOMMEND ANYTHING.

= Introduction =

The Wikimedia Foundation (WMF) has a continuous integration (CI) system since almost a decade, managed and maintained by the Release Engingeering (RelEng) team. It consists of several components, such as Gerrit (code review), Zuul (gating), Gearman (distribute jobs to workers), and Jenkins (control workers), and virtual machines as build workers.

The CI system works well, but there are several reasons to consider changing it.


 * The Zuul component is a WMF fork of Zuul version 2.5.1, which is no longer supported by upstream. We will need to replace or upgrade this. The current upstream version of Zuul is entirely different, and runs builds via Ansible rather than Jenkins.
 * The CI system requires a fair bit of routine administration by RelEng (FIXME: give an example).
 * In general, our developers do not relish having to deal with CI. It feels strange, cumbersome, and slow to them.

Discussions about the CI system and what it should be have been ongoing within RelEng for at least two years, but no consensus has been reached organically. Additionally, RelEng (and others) are working on a continuous deployment pipeline, part of which will remove the daily SWAT and weekly train deployments. This will make the CI system even more a critical component of the development workflow for the Wikipedia movement.

In late February 2019, Greg (the RelEng manager) tasked a working group to make a proposal what the future of our CI tooling and software should look like. This is the report of that group.

The working group documented its work at Wikimedia Release Engineering Team/CI Futures WG. For details, see that wiki page.

= Evaluation process =

The working group decided to evaluate various CI software and settled on the following rough process:


 * collect requirements
 * classify the requirements according to importance, with a list of very hard requirements that any candidate needs to fulfill
 * collect a list of candidates
 * evaluate candidates based on requirements, rejecting any candidates that don't fulfill the very hard requirements
 * implement a toy CI project on any candidate that warrants a closer look

The toy CI project consisted of building the Blubber software. It is written by RelEng itself, in Go, and is the simplest realistic project we can think of. The toy project would not involve deploying the software, only building it from source, and running its unit tests.

= Very hard requirements =

The working group settled on the following very hard, non-negotiable requirements:

Must be hostable by the foundation. It's not acceptable to rely on outside services.
WMF wants to host the CI system itself, for various reasons, including the following:


 * we don't want to be dependent on external services
 * it's a core service for our development process, without which we can't do development, so having direct control is beneficial
 * CI builds and deploys software, which has a direct impact on the security of our servers, and using external services would be a questionable choice

Must be free software / open source. "Open core" like GitLab might be good enough.
We would prefer a fully open source, free software version, all other things being equal.

Preferring "open source" is a core value of WMF and the Wikipedia movement (see Wikimedia Foundation Guiding Principles). We have only considered candidates that are open source. However, some of them are "open core", where there are two versions of the software: an open source "community edition", and a proprietary "enterprise edition", where the proprietary version has additional functionality. Depending on the project, an open core approach may mean that the open source version is not fully functional, has worse support, and any external contributions to it may require allowing them to be re-licensed in a closed-source, proprietary manner for the proprietary version.

Open core is sub-optimal, but we've decided that it can be acceptable, depending on the details of how a candidate does open core. The open source version needs to be sufficient for WMF. WMF won't use a proprietary version.

From a pragmatic point of view, if we were to choose an open core candidate for CI, it would need to be fairly easy to migrate from it to another (free-er) system if need be.

The working group, and RelEng in general, understands that open core is controversial and personally deeply distasteful to many in the movement, including at least one member of the working group itself. However, software freedom is not the only important value for the Wikipedia movement. Getting things done without wasting donations on needless work is also important. Choosing an open core solution would be a strategic move to enable more productivity.

Must support git.
We currently use git as our version control system. It works well. Changing the version control system is not an option.

Must have a version we can easily use for evaluation.
The time allotted for evaluation is too short for the working group to from-scratch installations.

Must be understandable without too much effort to our developers so that they can use CI/CD productively.
Whatever we do, our developers will need to learn new stuff. The current CI system is already criticised and switching to something that isn't easier would be a move against the interests of our developers, and would not improve productivity.

Must support self-serve CI, meaning we don't block people if they want CI for a new repo.
We feel that empowering our developers to have more control over how their software is built and tested, without compromising on the safety and security of our production systems, would help our developers be happier with CI, and work more productively for the improvement of our various sites, and the advancement of the Wikipedia movement's goals.

= Evaluations =

Tools are sorted alphabetically.

Argo CD
T218827

Concourse CI
T217595

FIXME.

GitLab CI/CD
T217594

FIXME.

GoCD
T218332

FIXME.

Jenkins X
T218334

FIXME.

Phabricator Harbormaster
The only thing it supports is running the build at an external CI system, like Jenkins.

A couple of quotes from the documentation:

"The current version of Harbormaster can perform some basic build tasks, but has many limitations and is not a complete build platform."

"Currently, the only useful type of build step is 'Make HTTP Request', which you can use to make a call to an external build system like Jenkins."

More information is available at T217901.

sourcehut builds
sourcehut (hosted version at sr.ht) is an AGPL-licensed suite of interoperable tools for code hosting, issue tracking, CI, etc. Individual components, including the build service, can be used standalone. Builds are run inside virtual machines, and described in a YAML manifest in a repository which specifies image type, packages to be installed, and a series of with commands to run. A lightweight web interface is provided. The simplicity of this approach is very appealing, and usability is quite good, but sourcehut is still in early stages and may not be a good technical fit for our the container-centric pipeline goals.

Workflow is simple:


 * push  to a repo (example: d7b657a )
 * build is triggered (example: 42539 )

Example :

More information is available at T217852.

Spinnaker
T218335

FIXME.

Tekton
T217912

FIXME.

Zuul
T218138

Zuul v3 is a significant departure from the 2.x release of Zuul currently in use at the Foundation. It's situated firmly in the OpenStack ecosystem, and relies on a service called Nodepool to provide nodes for executing jobs. Nodepool originally required OpenStack, but now offers a Kubernetes driver, which is a more realistic possibility for hosting on our infrastructure. It's also developed with Gerrit in mind. Configuration is flexible, but somewhat complex and spread between several sources of truth. These include pipeline definitions, central Zuul config, Nodepool config, in-repository job definitions (which occupy a shared namespace), and Ansible playbooks. Jobs are implemented in Ansible. In summary, Zuul v3 seems capable and feature-rich, but configuration-heavy and likely to impose some cognitive overhead on developers.

= The working group's recommendation =

FIXME. All of this, including subsections, needs to be discussed and edited and expanded. Note that we can recommend several options to look at in more detail.

The working group recommends... FIXME.

We should outline how the software we recommend fits together to form a complete CDep pipeline. We should also identify any tradeoffs involved in our specific choices.

Next actions
RelEng and SRE need to discuss how the new software should be deployed and maintained, and which team takes responsibility of what in the new CI system.

RelEng needs to ensure there are several people within the team who understand how the CI system works, and who can investigate and fix any problems.

If RelEng chooses an open core solution, we also recommend RelEng keeps an eye open on possible non-open-core CI systems in the future, re-evaluating the CI software choice annually, switching again if it seems a free-er system is ready for us. It may be a good idea to explore CI tooling on an ongoing basis, such as quarterly. At least on a surface level, if not in-depth evaluations.

It would be good for RelEng to cultivate a constructive working relationship with the upstream project of whatever tools we choose.

The new CI system should be documented well, and the documentation should be maintained.

It would be good to train our developers on an ongoing basis on using the CI system.