Wikimedia Release Engineering Team/CI Futures WG/Report

= Introduction =

The Wikimedia Foundation (WMF) has a continuous integration (CI) system since almost a decade, managed and maintained by the Release Engineering (RelEng) team. It consists of several components, such as Gerrit (code review), Zuul (gating), Gearman (distribute jobs to workers), and Jenkins (control workers), and virtual machines as build workers.

The CI system works well, but there are several reasons to consider changing it.


 * The Zuul component is a WMF fork of Zuul version 2.5.1, which is no longer supported by upstream. We will need to replace or upgrade this. The current upstream version of Zuul is entirely different, and runs builds via Ansible rather than Jenkins.
 * The CI system requires a fair bit of routine administration by RelEng (FIXME: give an example. Creating/updating/deleting Jenkins jobs, Zuul configuration and Docker containers? See project:integration/config in Gerrit.).
 * In general, our developers do not relish having to deal with CI. It feels strange, cumbersome, and slow to them.

Discussions about the CI system and what it should be have been ongoing within RelEng for at least two years, but no consensus has been reached organically. Additionally, RelEng (and others) are working on a continuous deployment pipeline, part of which will remove the daily SWAT and weekly train deployments. This will make the CI system even more a critical component of the development workflow for the Wikipedia movement.

In late February 2019, Greg (the RelEng manager) tasked a working group to make a proposal what the future of our CI tooling and software should look like. This is the report of that group.

The working group documented its work at CI Futures WG. For details, see that wiki page.

= Evaluation process =

The working group decided to evaluate various CI software and settled on the following rough process:


 * collect requirements
 * classify the requirements according to importance, with a list of very hard requirements that any candidate needs to fulfil
 * collect a list of candidates
 * evaluate candidates based on requirements, rejecting any candidates that don't fulfil the very hard requirements
 * implement a toy CI project on any candidate that warrants a closer look

The toy CI project consisted of building the Blubber software. It is written by RelEng itself, in Go, and is the simplest realistic project we can think of. The toy project would not involve deploying the software, only building it from source, and running its unit tests.

= Very hard requirements =

The working group settled on the following very hard, non-negotiable requirements:

Must be hostable by the Foundation
WMF wants to host the CI system itself, for various reasons, including the following:


 * we don't want to be dependent on external services
 * it's a core service for our development process, without which we can't do development, so having direct control is beneficial
 * CI builds and deploys software, which has a direct impact on the security of our servers, and using external services would be a questionable choice

It's not acceptable for WMF to depend on outside services for CI, just like it isn't acceptable to rely on outside HTTP servers for hosting wikis.

Must be free software / open source
"Open core" like GitLab might be good enough, but needs to be considered carefully.

We would prefer a fully open source, free software version, all other things being equal.

Preferring "open source" is a core value of WMF and the Wikipedia movement (see Wikimedia Foundation Guiding Principles). We have only considered candidates that are open source. However, some of them are "open core", where there are two versions of the software: an open source "community edition", and a proprietary "enterprise edition", where the proprietary version has additional functionality. Depending on the project, an open core approach may mean that the open source version is not fully functional, has worse support, and any external contributions to it may require allowing them to be re-licensed in a closed-source, proprietary manner for the proprietary version.

Open core is sub-optimal, but we've decided that it can be acceptable, depending on the details of how a candidate does open core. The open source version needs to be sufficient for WMF. WMF won't use a proprietary version.

From a pragmatic point of view, if we were to choose an open core candidate for CI, it would need to be fairly easy to migrate from it to another (free-er) system if need be.

The working group, and RelEng in general, understands that open core is controversial and personally deeply distasteful to many in the movement, including at least one member of the working group itself. However, software freedom is not the only important value for the Wikipedia movement. Getting things done without wasting donations on needless work is also important. Choosing an open core solution would be a strategic move to enable more productivity.

Must support git
We currently use git as our version control system. It works well. Changing the version control system is not an option.

Must have a version we can easily use for evaluation
The time allotted for evaluation is too short for the working group to evaluate from-scratch installations.

Must be comprehensible without too much effort
Whatever we do, our developers will need to learn new stuff. The current CI system is already criticised and switching to something that isn't easier would be a move against the interests of our developers, and would not improve productivity. A CI system that's difficult to understand is an impediment to productivity and developer satisfaction.

Must support self-serve CI
We feel that empowering our developers to have more control over how their software is built and tested, without compromising on the safety and security of our production systems, would help our developers be happier with CI, and work more productively for the improvement of our various sites, and the advancement of the Wikipedia movement's goals.

To clarify, self-serve means our developers should be able to work as much as possible without RelEng being needed. For example, once a new git repository is created, CI should "just happen" and the developers should have some control over how CI builds and tests the project.

= Evaluations =

We collected a list with many candidates. Many were excluded due to license (not open source, or even open core), or for not fulfilling other of the very hard requirements we listed. We looked at several options, summarised below.

The list below is sorted alphabetically.

Argo
Argo comprises a few different projects that have well defined concerns and would work well together to provide a fully functional CI system. Similar to Tekton, it provides Kubernetes CRDs that delegate the workload scheduling and execution to k8s. Unlike Tekton, however, it provides a nice CLI interface, a specialized controller for workflow triggering, a separate project for consuming and propagating external events (Argo Events), and a simple but functional web UI. Benefits and drawbacks include:


 * Benefit: It’s easy to get installed and running. Getting it installed and Blubber building on it took only about 15 minutes or so.
 * Benefit: As a k8s native solution, it’s straightforward to operate given you have knowledge of k8s and.
 * Benefit: The Workflow CRD that Argo provides is simple to understand with its concepts of inputs/outputs and containerized steps, and supports serial or DAG style execution. These workflow manifests could potentially be maintained either directly by teams or generated from our.
 * Benefit: Very little overhead. Again, like Tekton, these CRDs essentially spin off Pods and k8s does the workload scheduling. In addition, Argo supplies two controllers, one for workflow triggering and integration with Argo Events, and one for the UI.
 * Benefit: It supplies a web UI. Granted it’s a very simple read-only UI, but it provides the things that are needed 99% of the time: workflow build status and history, logs, and links to artifacts.
 * Benefit: The team that maintains it seems invested and responsive so far. They are writing a lot of code, giving talks, and participating in k8s office hours—which is where the project was discovered. The evaluator joined the Slack channel to ask some questions and they were respectful, helpful, and responsive.
 * Benefit: The Argo Events gateways provide well defined interfaces for consumption of events from external systems. According to the developers, we have a few decent options for evented Gerrit integration, using either webhooks, kafka, or a custom gateway that would maintain a connection over SSH.
 * Benefit: Multiple external artifact stores are supported and integrated into the UI. The decoupled design could also be considered a benefit as it allows for migration of systems independently.
 * Drawback: The web UI is limited. If complete control over workflow builds (CRUD operations) is what we need, we would need to modify the existing UI or create our own.
 * Drawback: Debugging of operational problems might be difficult for developers given the Argo’s k8s native model, though I’m not sure debugging of operational issues by end users is really a requirement.

✅ Recommended for further evaluation. More information is available at T218827.

Concourse CI
Lars failed to get Blubber built, due to not getting a sufficiently new Go Docker image used. Same image worked fine with GitLab CI/CD. Also, Concourse wants the  command line tool to be used for many operations, which would fit badly with how we expect our CI to be used. It would be possible to build tooling around Concourse to have it work well for our development community, but it seems like it'd be a lot of work.


 * Benefit: Upstream (Pivotal).
 * Drawback:  used for many operations.

❌ Not recommended for further evaluation. More information is available at T217595.

GitLab CI/CD
It was remarkably easy to import the Blubber git repository to the gitlab.com instance, and to add a  file to build Blubber and run the unit tests. A drawback is that GitLab CI/CD is open core, but the open source version should work well for WMF, and be reasonably low-risk for us. We would need to build integration with Gerrit (and possibly Phabricator), to have similar workflows for code review and merging as we currently have. The GitLab API should make that possible, but it's something to consider.


 * Benefit: Ease of use.
 * Benefit: CI configuration in repository.
 * Benefit: Open core version should be good enough for WMF.
 * Benefit: Tested at scale we would need.
 * Drawback: Open core.
 * Drawback: Integration with Gerrit and Phabricator would have to be built.

✅ Recommended for further evaluation. More information is available at T217594.

GoCD
This doesn't seem to have a demo instance or other easy way to evaluate it. Installing from the  packages they provide worked, without too much pain. Consists of a server component, which also provides the web UI, plus an agent component to be run on each build worker. Got Blubber built and its unit tests to run. Configuration is via the web UI; there might be an API for automation, but did not look into that. A Travis-style CI config in the repo is not supported out of the box, and would have to be built. The web UI has no authentication by default, but can be turned on. Worker build environments need to managed manually, e.g., installing build dependencies, and since the workers are just plain old Unix hosts (bare metal or VM), this may turn out to be a lot of work, as different projects have different, and conflicting, requirements.


 * Benefit: Easy installation.
 * Benefit: Web interface.
 * Drawback: Does not support job configuration in the repository.
 * Drawback: Worker build environments need to managed manually.

❌ Not recommended for further evaluation, mainly due to not supporting self-serve CI well. More information is available at T218332.

Jenkins X
Jenkins X feels like an over-engineered kitchen sink that imposes an obtuse and opinionated workflow that will not fit our needs without a lot of customization effort. It expects an installation per team, and the evaluator does not wish the installation process on anyone (from Release Engineering or any other team).


 * Benefit: The Jenkins component is a familiar technology in which we have some investment (e.g. the Release Pipeline).
 * Drawback: A lot of groking and customization would be required by teams themselves since it's a one-pipeline-per-team model.
 * Drawback: Installation was very difficult and components somehow seemed both disparate and bloated.
 * Drawback: Full installation required a lot of resources to run (at bare minimum 4 vCPUs, 4G memory). For a single installation this wouldn't be an issue, but the one-pipeline-per-team model multiplies those requirements greatly.

❌ Not recommended for further evaluation. More information is available at T218334.

Phabricator Harbormaster
The only thing it supports is running the build at an external CI system, like Jenkins.

A couple of quotes from the documentation:

"The current version of Harbormaster can perform some basic build tasks, but has many limitations and is not a complete build platform."

"Currently, the only useful type of build step is 'Make HTTP Request', which you can use to make a call to an external build system like Jenkins."


 * Benefit: Already in Phabricator, that is already used by WMF.
 * Drawback: In very early development stage.

❌ Not recommended for further evaluation. More information is available at T217901.

sourcehut builds
sourcehut (hosted version at sr.ht) is an AGPL-licensed suite of interoperable tools for code hosting, issue tracking, CI, etc. Individual components, including the build service, can be used standalone. Builds are run inside virtual machines, and described in a YAML manifest in a repository which specifies image type, packages to be installed, and a series of with commands to run. A lightweight web interface is provided. The simplicity of this approach is very appealing, and usability is quite good, but sourcehut is still in early stages and may not be a good technical fit for our the container-centric pipeline goals.


 * Benefit: Ease of use.
 * Benefit: Web interface.
 * Drawback: No built-in support for containers.
 * Drawback: In early stage of development.

❌ Not recommended for further evaluation. More information is available at T217852.

Spinnaker
Spinnaker expects to consume images from an existing CI system (Jenkins, Travis, artifact release to Docker registry or GitHub, etc.) for deployment to a cloud platform. As it's focused on delivery / deployment, Spinnaker doesn't seem suited to evaluation under the CI WG's current charter, though it may be relevant to future overall discussions of the pipeline.

❌ Not recommended for further evaluation. More information is available at T218335.

Tekton
Tekton is narrow in scope but it seems to do what it does well: It provides a coherent set of Custom Resource Definitions (CRD) necessary to get CI type workloads running on k8s efficiently and quickly. Its narrowness in scope and CRD nature yield these benefits and drawbacks:


 * Benefit: It took very little time and effort to install Tekton CRDs into minikube and get Blubber built using the new Pipeline/Task resources, ~ an hour or so.
 * Benefit: For someone with k8s knowledge, it was perfectly clear what was going on under the hood and the running system was easy to interrogate using,   etc.
 * Benefit: Execution of the task had almost no additional overhead since k8s is doing all the work (i.e. TaskRuns simply spawn Pods).
 * Benefit: The PipelineResource, Pipeline, Task, PipelineRun, TaskRun resources are all very flexible in their design. I could see these being either maintained by teams themselves or being generated by a higher level abstraction that we provide (e.g. a ).
 * Drawback: For a developer having no k8s knowledge, interrogating the running system would not be easy. A Web UI and/or CLI tooling built around  would be straightforward to implement but would have to be implemented nonetheless.
 * Drawback: This is a barebones system that would require us to implement UI and possibly other components (e.g. an Gerrit event-stream handler and reporting, however that’s true for other systems too).

Overall, Tekton comprises an incomplete CI system that would require UI implementation. Therefore it cannot be recommended at this time.

❌ Not recommended for further evaluation. More information is available at T217912.

Zuul
Zuul v3 is a significant departure from the 2.x release of Zuul currently in use at the Foundation. It's situated firmly in the OpenStack ecosystem, and relies on a service called Nodepool to provide nodes for executing jobs. Nodepool originally required OpenStack, but now offers a Kubernetes driver, which is a more realistic possibility for hosting on our infrastructure. It's also developed with Gerrit in mind. Configuration is flexible, but somewhat complex and spread between several sources of truth. These include pipeline definitions, central Zuul config, Nodepool config, in-repository job definitions (which occupy a shared namespace), and Ansible playbooks. Jobs are implemented in Ansible. In summary, Zuul v3 seems capable and feature-rich, but configuration-heavy and likely to impose some cognitive overhead on developers.


 * Benefit: v2 already at use at WMF.
 * Benefit: Works well with Gerrit.
 * Drawback: Ansible.
 * Drawback: Configuration in several repositories.

✅ Recommended for further evaluation. More information is available at T218138.

= The working group's recommendation =

The working group recommends that RelEng looks deeper into GitLab, Argo, and Zuul v3, and implements a prototype self-serve CI system, integrated with Gerrit on at least one of them. The prototype or prototypes would listen to the production Gerrit event stream, do test builds on proposed changes, and on any changes to the master branch. A build would follow build and test configuration as specified in a file at the root of the repository, possibly followed by additional build steps defined by RelEng.

For example, the Blubber repository would have a  file (name may vary between CI systems) which would specify that the build uses the   Docker image, the build happens by running , that commit stage tests are run with the command. Further, RelEng builds a Docker image using a  generated from , tests the image using commands specified in that file. The image would not necessarily be published anywhere, nor deployed to production.

The prototype would be set up on some suitable infrastructure, ideally provided by WMF for the purpose. It would be run in parallel with the current CI in a way that lets us evaluate the prototype deployment, without having it interfere with normal development.

The need to support mobile development with CI was raised. Primarily this is a concern for building iOS applications, which require Macs running OS X. The working group recommends that dedicated Mac build workers are used, but controlled (perhaps over SSH) from the CI, or else that a separate CI infrastructure for iOS development is kept. The Mac build workers could be hosted at the WMF office or be rented as managed systems from a suitable supplier.

Next actions
RelEng and SRE need to discuss how the new software should be deployed and maintained, and which team takes responsibility of what in the new CI system.

RelEng needs to ensure there are several people within the team who understand how the CI system works, and who can investigate and fix any problems.

RelEng should keep evaluating the CI software choice, looking at whether switching again is a good idea. It may be a good idea to explore CI tooling on an ongoing basis. At least on a surface level, if not in-depth evaluations.

It would be good for RelEng to cultivate a constructive working relationship with the upstream project of whatever tools we choose.

The new CI system should be documented well, and the documentation should be maintained.

It would be good to train our developers on an ongoing basis on using the CI system.