Wikimedia Release Engineering Team/CI Futures WG/Report

THIS IS A DRAFT. IT DOES NOT YET RECOMMEND ANYTHING.

= Introduction =

The Wikimedia Foundation (WMF) has a continuous integration (CI) system since almost a decade, managed and maintained by the Release Engingeering (RelEng) team. It consists of several components, such as Gerrit (code review), Zuul (gating), Gearman (distribute jobs to workers), and Jenkins (control workers), and virtual machines as build workers.

The CI system works well, but there are several reasons to consider changing it.


 * The Zuul component is a WMF fork of Zuul version 2.5.1, which is no longer supported by upstream. We will need to replace or upgrade this. The current upstream version of Zuul is entirely different, and runs builds via Ansible rather than Jenkins.
 * The CI system requires a fair bit of routine administration by RelEng (FIXME: give an example. Creating/updating/deleting Jenkins jobs, Zuul configuration and Docker containers? See project:integration/config in Gerrit.).
 * In general, our developers do not relish having to deal with CI. It feels strange, cumbersome, and slow to them.

Discussions about the CI system and what it should be have been ongoing within RelEng for at least two years, but no consensus has been reached organically. Additionally, RelEng (and others) are working on a continuous deployment pipeline, part of which will remove the daily SWAT and weekly train deployments. This will make the CI system even more a critical component of the development workflow for the Wikipedia movement.

In late February 2019, Greg (the RelEng manager) tasked a working group to make a proposal what the future of our CI tooling and software should look like. This is the report of that group.

The working group documented its work at Wikimedia Release Engineering Team/CI Futures WG. For details, see that wiki page.

= Evaluation process =

The working group decided to evaluate various CI software and settled on the following rough process:


 * collect requirements
 * classify the requirements according to importance, with a list of very hard requirements that any candidate needs to fulfill
 * collect a list of candidates
 * evaluate candidates based on requirements, rejecting any candidates that don't fulfill the very hard requirements
 * implement a toy CI project on any candidate that warrants a closer look

The toy CI project consisted of building the Blubber software. It is written by RelEng itself, in Go, and is the simplest realistic project we can think of. The toy project would not involve deploying the software, only building it from source, and running its unit tests.

= Very hard requirements =

The working group settled on the following very hard, non-negotiable requirements:

Must be hostable by the foundation. It's not acceptable to rely on outside services.
WMF wants to host the CI system itself, for various reasons, including the following:


 * we don't want to be dependent on external services
 * it's a core service for our development process, without which we can't do development, so having direct control is beneficial
 * CI builds and deploys software, which has a direct impact on the security of our servers, and using external services would be a questionable choice

Must be free software / open source. "Open core" like GitLab might be good enough.
We would prefer a fully open source, free software version, all other things being equal.

Preferring "open source" is a core value of WMF and the Wikipedia movement (see Wikimedia Foundation Guiding Principles). We have only considered candidates that are open source. However, some of them are "open core", where there are two versions of the software: an open source "community edition", and a proprietary "enterprise edition", where the proprietary version has additional functionality. Depending on the project, an open core approach may mean that the open source version is not fully functional, has worse support, and any external contributions to it may require allowing them to be re-licensed in a closed-source, proprietary manner for the proprietary version.

Open core is sub-optimal, but we've decided that it can be acceptable, depending on the details of how a candidate does open core. The open source version needs to be sufficient for WMF. WMF won't use a proprietary version.

From a pragmatic point of view, if we were to choose an open core candidate for CI, it would need to be fairly easy to migrate from it to another (free-er) system if need be.

The working group, and RelEng in general, understands that open core is controversial and personally deeply distasteful to many in the movement, including at least one member of the working group itself. However, software freedom is not the only important value for the Wikipedia movement. Getting things done without wasting donations on needless work is also important. Choosing an open core solution would be a strategic move to enable more productivity.

Must support git.
We currently use git as our version control system. It works well. Changing the version control system is not an option.

Must have a version we can easily use for evaluation.
The time allotted for evaluation is too short for the working group to evaluate from-scratch installations.

Must be understandable without too much effort to our developers so that they can use CI/CD productively.
Whatever we do, our developers will need to learn new stuff. The current CI system is already criticised and switching to something that isn't easier would be a move against the interests of our developers, and would not improve productivity.

Must support self-serve CI, meaning we don't block people if they want CI for a new repo.
We feel that empowering our developers to have more control over how their software is built and tested, without compromising on the safety and security of our production systems, would help our developers be happier with CI, and work more productively for the improvement of our various sites, and the advancement of the Wikipedia movement's goals.

= Evaluations =

We collected a list with many candidates. Many were excluded due to license (not open source, or even open core), or for not fulfilling other of the very hard requirements we listed. We looked at several options, summarised below.

The list below is sorted alphabetically.

Argo
T218827

Argo comprises a few different projects that have well defined concerns and would work well together to provide a fully functional CI system. Similar to Tekton, it provides Kubernetes CRDs that delegate the workload scheduling and execution to k8s. Unlike Tekton, however, it provides a nice CLI interface, a specialized controller for workflow triggering, a separate project for consuming and propagating external events (Argo Events), and a simple but functional web UI. Benefits and drawbacks include:


 * Benefit: It’s easy to get installed and running. Getting it installed and Blubber building on it took only about 15 minutes or so.
 * Benefit: As a k8s native solution, it’s straightforward to operate given you have knowledge of k8s and.
 * Benefit: The Workflow CRD that Argo provides is simple to understand with its concepts of inputs/outputs and containerized steps, and supports serial or DAG style execution. These workflow manifests could potentially be maintained either directly by teams or generated from our.
 * Benefit: Very little overhead. Again, like Tekton, these CRDs essentially spin off Pods and k8s does the workload scheduling. In addition, Argo supplies two controllers, one for workflow triggering and integration with Argo Events, and one for the UI.
 * Benefit: It supplies a web UI. Granted it’s a very simple read-only UI, but it provides the things that are needed 99% of the time: workflow build status and history, logs, and links to artifacts.
 * Benefit: The team that maintains it seems invested and responsive so far. They are writing a lot of code, giving talks, and participating in k8s office hours—which is where the project was discovered. The evaluator joined the Slack channel to ask some questions and they were respectful, helpful, and responsive.
 * Benefit: The Argo Events gateways provide well defined interfaces for consumption of events from external systems. According to the developers, we have a few decent options for evented Gerrit integration, using either webhooks, kafka, or a custom gateway that would maintain a connection over SSH.
 * Benefit: Multiple external artifact stores are supported and integrated into the UI. The decoupled design could also be considered a benefit as it allows for migration of systems independently.
 * Drawback: The web UI is limited. If complete control over workflow builds (CRUD operations) is what we need, we would need to modify the existing UI or create our own.
 * Drawback: Debugging of operational problems might be difficult for developers given the Argo’s k8s native model, though I’m not sure debugging of //operational// issues by end users is really a requirement.

Concourse CI
T217595

Lars failed to get Blubber built, due to not getting a sufficiently new Go Docker image used. Same image worked fine with GitLab CI/CD. Also, Concourse wants the "fly" command line tool to be used for many operations, which would fit badly with how we expect our CI to be used. It would be possible to build tooling around Concourse to have it work well for our development community, but it seems like it'd be a lot of work. Not recommended.

GitLab CI/CD
T217594

It was remarkably easy to import the Blubber git repository to the gitlab.com instance, and to add a .gitlab-ci.yml file to build Blubber and run the unit tests. A drawback is that GitLab CI/CD is open core, but the open source version should work well for WMF, and be reasonably low-risk for us. We would need to build integration with Gerrit (and possibly Phabricator), to have similar workflows for code review and merging as we currently have. The GitLab API should make that possible, but it's something to consider. Recommended for further evaluation.

GoCD
T218332

This doesn't seem to have a demo instance or other easy way to evaluate it. Installing from the .deb packages they provide worked, without too much pain. Consists of a server component, which also provides the web UI, plus an agent component to be run on each build worker. Got Blubber built and its unit tests to run. Configuration is via the web UI; there might be an API for automation, but did not look into that. A Travis-style CI config in the repo is not supported out of the box, and would have to be built. The web UI has no authentication by default, but can be turned on. Worker build environments need to managed manually, e.g., installing build dependencies, and since the workers are just plain old Unix hosts (bare metal or VM), this may turn out to be a lot of work, as different projects have different, and conflicting, requirements. Not recommended, mainly due to not supporting self-serve CI well.

Jenkins X
T218334

Jenkins X feels like an over-engineered kitchen sink that imposes an obtuse and opinionated workflow that will not fit our needs without a lot of customization effort. It expects an installation per team, and the evaluator does not wish the installation process on anyone (from Release Engineering or any other team). Hard pass.

Phabricator Harbormaster
The only thing it supports is running the build at an external CI system, like Jenkins.

A couple of quotes from the documentation:

"The current version of Harbormaster can perform some basic build tasks, but has many limitations and is not a complete build platform."

"Currently, the only useful type of build step is 'Make HTTP Request', which you can use to make a call to an external build system like Jenkins."

More information is available at T217901.

sourcehut builds
sourcehut (hosted version at sr.ht) is an AGPL-licensed suite of interoperable tools for code hosting, issue tracking, CI, etc. Individual components, including the build service, can be used standalone. Builds are run inside virtual machines, and described in a YAML manifest in a repository which specifies image type, packages to be installed, and a series of with commands to run. A lightweight web interface is provided. The simplicity of this approach is very appealing, and usability is quite good, but sourcehut is still in early stages and may not be a good technical fit for our the container-centric pipeline goals.

Workflow is simple:


 * push  to a repo (example: d7b657a )
 * build is triggered (example: 42539 )

Example :

More information is available at T217852.

Spinnaker
T218335

FIXME.

Tekton
T217912

Tekton is narrow in scope but it seems to do what it does well: It provides a coherent set of Custom Resource Definitions (CRD) necessary to get CI type workloads running on k8s efficiently and quickly. Its narrowness in scope and CRD nature yield these benefits and drawbacks:


 * Benefit: It took very little time and effort to install Tekton CRDs into minikube and get Blubber built using the new Pipeline/Task resources, ~ an hour or so.
 * Benefit: For someone with k8s knowledge, it was perfectly clear what was going on under the hood and the running system was easy to interrogate using,   etc.
 * Benefit: Execution of the task had almost no additional overhead since k8s is doing all the work (i.e. TaskRuns simply spawn Pods).
 * Benefit: The PipelineResource, Pipeline, Task, PipelineRun, TaskRun resources are all very flexible in their design. I could see these being either maintained by teams themselves or being generated by a higher level abstraction that we provide (e.g. a ).
 * Drawback: For a developer having no k8s knowledge, interrogating the running system would not be easy. A Web UI and/or CLI tooling built around  would be straightforward to implement but would have to be implemented nonetheless.
 * Drawback: This is a barebones system that would require us to implement UI and possibly other components (e.g. an Gerrit event-stream handler and reporting, however that’s true for other systems too).

Overall, Tekton comprises an incomplete CI system that would require UI implementation. Therefore it cannot be recommended at this time.

Zuul
T218138

Zuul v3 is a significant departure from the 2.x release of Zuul currently in use at the Foundation. It's situated firmly in the OpenStack ecosystem, and relies on a service called Nodepool to provide nodes for executing jobs. Nodepool originally required OpenStack, but now offers a Kubernetes driver, which is a more realistic possibility for hosting on our infrastructure. It's also developed with Gerrit in mind. Configuration is flexible, but somewhat complex and spread between several sources of truth. These include pipeline definitions, central Zuul config, Nodepool config, in-repository job definitions (which occupy a shared namespace), and Ansible playbooks. Jobs are implemented in Ansible. In summary, Zuul v3 seems capable and feature-rich, but configuration-heavy and likely to impose some cognitive overhead on developers.

= The working group's recommendation =

FIXME. All of this, including subsections, needs to be discussed and edited and expanded. Note that we can recommend several options to look at in more detail.

The working group recommends that RelEng looks deeper into X, Y, and Z, and implements a prototype self-serve CI system, integrated with Gerrit on at least one of them. The working group regrets not having enough time to do that during the evaluation period. The prototype or prototypes would listen to the production Gerrit event stream, do test builds on proposed changes, and on any changes to the master branch. A build would follow build and test configuration as specified in a file at the root of the repository, possibly followed by additional build steps defined by RelEng.

For example, the Blubber repository would have a .foo.yaml file (name may vary between CI systems) which would specify that the build uses the "wmf-golang-1.11" Docker image, the build happens by running "make", that commit stage tests are run with the command "make test". Further, RelEng builds a Docker image using a Dockerfile generated from .pipeline/blubber.yaml, tests the image using commands specified in that file. The image would not necessarily be published anywhere, nor deployed to production.

The prototype would be set up on some suitable infrastructure, ideally provided by WMF for the purpose. It would be run in parallel with the current CI in a way that lets us evaluate the prototype deployment, without having it interfere with normal development.

Next actions
RelEng and SRE need to discuss how the new software should be deployed and maintained, and which team takes responsibility of what in the new CI system.

RelEng needs to ensure there are several people within the team who understand how the CI system works, and who can investigate and fix any problems.

If RelEng chooses an open core solution, we also recommend RelEng keeps an eye open on possible non-open-core CI systems in the future, re-evaluating the CI software choice annually, switching again if it seems a free-er system is ready for us. It may be a good idea to explore CI tooling on an ongoing basis, such as quarterly. At least on a surface level, if not in-depth evaluations.

It would be good for RelEng to cultivate a constructive working relationship with the upstream project of whatever tools we choose.

The new CI system should be documented well, and the documentation should be maintained.

It would be good to train our developers on an ongoing basis on using the CI system.