Continuous integration/Architecture/Isolation

Wikimedia's continuous integration system is lacking the ability to securely run code proposed by people outside of the organization. This document proposes an architecture that would spawn isolated virtual machines to run untrusted code in. The concept is similar to Travis CI but based around Jenkins and OpenStack cloud.

A first step towards this goal was the Sandboxing continuous integration RFC document. Some parts are reused here.

The implementation of this architecture document was tracked in Phabricator as T86171.

Historical context

The first continuous integration system at Wikimedia Foundation started during summer 2011 with the installation of PHPUnderControl. Soon after, we needed an overhaul and migrated to Jenkins running on a single baremetal machine: gallium (still in use as of March 2015, though now only as Jenkins master).

Security was not much of an issue since only people having an account in the Wikimedia Labs LDAP were allowed to send code that will trigger a build on the Jenkins server. When Wikimedia Labs opened up public registration to the world, as "temporarily" measure, we restricted test execution to a whitelist of participating developers. That is being pushed to the limit with our growing community.

Not all build steps need to execute code (e.g. linting PHP files). Therefore, to allow at least some feedback to new users, Timo Tijhof implemented a secondary pipeline known as "check" (to complement "test"). The check pipeline is unrestricted and builds jobs no matter who sent it. The test pipeline only runs if the patch author is on our whitelist in Zuul (the CI scheduler). Untrusted users hence have a limited build with only lint checks, lacking the feedback of a full test run.

The user authentication roughly works like this:

Limitations of the current system

Having to manually maintain a whitelist of users often causes community members to be perceived as third class citizen. The process around whitelisting is obscure (one has to figure out where and how to update the configuration file for the whitelist).

Our current Jenkins infrastructure consists of one master and about a dozen slaves. The slave pool contains two production servers and several virtual machines (Labs instances).

These slave servers persist between builds. Therefore, projects are not allowed to specify dependencies in their repository, or install global dependencies and background services, or other actions that survive tearing down the build workspace. We do allow npm-install and composer-install within the workspace. Instead, all environmental requirements are puppetized and usually made available as Debian package for Ubuntu Precise (or Trusty). Such centralization is a bottleneck, slows down the expansion and experimentation, and results in slaves needing to provide requirements and background services for many unrelated projects along side each other.

Similarly, the sequence of commands that make up a job (via Jenkins Job Builder) have to be crafted and approved by a CI sysadmins (selected Wikimedia staff and contractors). That does not scale well. We need developers to be able to define their test environment themselves, and modify their build commands over time. Changing the test command may seem rare in theory, but happens on a daily base at our scale. It's not so much changing as expansion and progression in most cases. (E.g. Add or upgrade additional checkers/linters, make their configuration more strict, etc.)

Builds all share a common set of UNIX credentials and environment variables. There is no secure way to provide individual builds different parameters. This means we can't grant one build the permission to do something another build cannot. A typical example is publishing documentation generated by a build to a production web server over ssh; the ssh key must not be available to the build as it is not safe. Having the test isolated would allow us expose credentials at the host level to the one build only.

Finally, we have the issue of leftovers from previous builds interacting with future builds. We mitigated that by carefully cleaning up the workspace at the start of ever build but that is a tedious task and it is often imperfect and inadequate. Likewise, we had a few cases of race conditions when there are concurrent builds of the same job being run on the same host.

As a summary the limitations we would like to unleash:

Ability to install additional packages on build servers.
Finely tune configuration of the environment;
Arbitrary commands execution to match tests needs.
Configuration delegated to the developers instead of a CI sysadmins.
One-time disposable sandbox for each build to avoid race conditions and ensure a clean testing env.

To achieve this goals, we need to build a sandboxed infrastructure.

Architecture overview

To achieve isolation we would use KVM based virtual machines. Wikimedia already maintains a cloud to hosts volunteers projects: Wikimedia Labs. Wikimedia Labs is based on OpenStack with KVM to provide the virtualization. Moreover, OpenStack Foundation themselves also use Jenkins with an OpenStack cloud of their own to boot disposables sandboxes with Nodepool. By building on top of Wikimedia Labs and OpenStack CI Nodepool, we start our journey on safe grounds.

The new architecture is based on the previous one. As of October 2015, we continue to use Jenkins and Zuul on gallium.wikimedia.org. The architecture introduced two new servers directly in the Labs subnet (labnodepool1001.eqiad.wmnet and scandium.eqiad.wmnet). The first server is used to host Nodepool and communicate directly with the OpenStack API. The second server will hosts two Zuul Mergers – each bound to a virtual IP and dedicated SSD (the git merge operation is mostly I/O bound). At first we will setup a single zuul merger to simplify the migration.

Software

Nodepool

The master piece of the new infrastructure is Nodepool (upstream documentation), a system developed by the OpenStack infrastructure team to set up and maintain a pool of VMs that can be used by the Zuul scheduling daemon.

Nodepool uses over the OpenStack API to spawn new instances and delete them once a job has finished. It also supports creating/refreshing Glance images, ensuring they are reasonably fresh when a new instance is spanwed (eg. puppet has run, git repos are pre-cloned and up to date etc).

At first, a pool of VM is created. Then, they are dynamically registered as a Jenkins slaves using Jenkins' REST API. Each Jenkins slave is given only one executor slot since the VM will only be used for one build and deleted after.

Representation of OpenStack CI instances

The graph above represents 24 hours of Nodepool activity for the OpenStack project. Each instance has four possible states:

Building (yellow): Instance is being spawned and refreshed (puppet run, setup scripts).
Available (green): Instance is pending job assignment.
In use (blue): Build is being run.
Deleting (purple): Build has finished and Zuul asked Nodepool to dispose of the VM.

Nodepool is a python daemon that communicates with four other systems:

Jenkins: Over ZeroMQ, requires Jenkins to have the ZMQ event publisher plugin installed. TCP port is configurable.

Zuul: Nodepool connects to Zuul Gearman server (TCP port 4730).

Database: Nodepool requires an SQL server to keep state of the VMs. MySQL with the InnoDB storage is recommended. Nodepool holds one database connection for each VM. OpenStack recommends to support at least a number of connections equal to twice the number of nodes we expect to use at once.

statsd: Report metrics about the pool of VMs, such as the graph above.

Nodepool is not packaged for Debian/Ubuntu.

Zuul

The central scheduler. We'll add a parameter OFFLINE_NODE_WHEN_COMPLETE which instructs Nodepool to depool the VM from Jenkins when a build has finished and delete it VM. The scheduler listens for Gerrit events and reacts by triggering Gearman functions (such as triggering a build or creating a merge commit).

Zuul merger

When a patch is proposed, Zuul merges it on top of the tip of the target branch or on top of changes which are already enqueued for merging (and hence will be the new tip of the branch). To do so, the scheduler triggers a Gearman function to craft a merge commit, the function is run by a separate process: zuul-merger. Low network latency with Gerrit is important. These merge operations are I/O bound, more over the merges are done sequentially. To spread the load, we would need two separate zuul-merger processes. Each bound to a dedicated SSD disk. Both can run on the same host though. (To be confirmed.)

The git repositories maintained by zuul-merger are exposed via git-daemon on port 9418. We would need each daemon to listen on a different IP address. Hence the server hosting them would need two sub interfaces with dedicated IPs. The zuul-merger handling the Gearman "merge" function will reply back with its IP address, which is then sent as a parameter to the Jenkins build. When executing on a VM, the build fetches the patch from the zuul-merger instance it has been given.

Until October 2015, the git repository was exposed on gallium with a canonical DNS name zuul.eqiad.wmnet. It has been changed to the host fully qualified domain name to let us add several zuul mergers. The URL is passed to the Gearman functions as the ZUUL_URL parameter, it is then used by the Jenkins job to fetch the patch to be tested. In the new architecture, the Zuul merger git repositories would be in the labs subnet. One potential culprit is making sure the production slaves (gallium and lanthanum) have access to these nodes from the labs subnet.

Packaging

The legacy Jenkins master will stay as-is. To migrate it we would have to puppetize its configuration and package the multiple plugins we are using. That is a non-trivial effort which will further delay the isolation project. Moreover, we want to phase out use of Jenkins for CI needs and instead have instances register jobs directly with the Gearman server.

Zuul is not packaged though some efforts have been made with upstream. (Pending at https://review.openstack.org/gitweb?p=openstack-infra/zuul-packaging.git). The zuul-cloner component will be needed on each instance, a package is definitely needed (bug T48552).

Nodepool has been packaged (bug T89142) and we crafted a puppet manifest (bug T89143).

The rest of Zuul is fully puppetized.

Hardware

We had two servers allocated:

labnodepool1001.eqiad.wmnet: hosts Nodepool. The service is lightweight enough and could be mutualized with other services. Though the CI admins have access to it, albeit no root since that is not needed. Later, maybe we could migrate the Zuul server to it.

scandium.eqiad.wmnet: will host the Zuul merger processes. The server has two SSDs in RAID with at least 128GB each. We will start with a single Zuul merger instance. Then later each of the two Zuul merger process would be assigned one SSD. We can afford data loss: on replacement, Zuul merger will simply re-clones the repositories from Gerrit.

These servers are placed inside the labs subnet. Nodepool needs to interact with the OpenStack API directly and connections to labs subnet from production realm are restricted. The Zuul mergers expose git repositories which the spawned VMs need access to. And the VMs cannot access the production realm.

The labs infrastructure might need to be allocated some additional hardware to support all the extra VMs. See the capacity planning section below.

Capacity planning

Number of VMs

In January 2015 Jenkins installation had:

4 Precise instances, each having 5 executors (20).
5 Trusty instances, each having 5 executors (25).
2 Precise production servers with a total of 12 executors.

That is a total of 57 executors. We will probably want to start with a pool of 50 VM. Depending of the time required to rebuild one and have it ready and the rate of consumptions, we might need to increase the pool to support peaks.

In 2015 we have reduced the number of jobs being executed per change, most notably by using test entry points that execute several tests in a single job (e.g. npm test and composer test).

MariaDB

When Nodepool adds or deletes a node, it holds a database connection for each node. Thus on start with a pool of 100 VMs, it will hold 100 connections to the database while the instances are being spawned and set up. Upstream recommends to configure the database server to support at least a number of connections equal to twice the number of nodes you expect to be in use at once. For 100 VMs that will be 200 concurrent connections.

Nodepool has been assigned to m5-master.

Security matrix

Protocol	Source Zone	Source IP	Dest Zone	Dest IP	Dest Port	Description
Misc
TCP	labs subnet	scandium	production	gallium	4730	Zuul merger to Zuul Gearman server
TCP	production	gallium	contintcloud	multiples	22	Jenkins server/client connection to VMs
Nodepool
TCP	labs subnet	labnodepool1001	production	m5-master	3306	Nodepool MariaDB connections
TCP	labs subnet	labnodepool1001	production	gallium	8888	Nodepool / Jenkins ZeroMQ
TCP	labs subnet	labnodepool1001	production	gallium	443	Nodepool / Jenkins REST API
TCP	labs subnet	labnodepool1001	labs subnet	virt????	TBD	Nodepool / wmflabs OpenStack API
Git to Zuul mergers
TCP	contintcloud	multiples	production	gallium	9418	Git connections from disposable VMS to the legacy Git daemon on gallium (transient, not needed if production slaves can reach the new Zuul mergers
TCP	contintcloud	multiples	labs subnet	scandium then VIPs	9418	Git connections from disposable VMS to the Git daemons serving the Zuul merger repositories.
~~TCP~~	~~lanthanum~~	~~10.64.0.161~~	~~labs subnet~~	~~virt????~~	~~9418~~	~~Git connections from legacy production slave to the new Git daemons.~~
TCP	gallium	208.80.154.135	labs subnet	scandium then VIPs	9418	Git connections from legacy production slave to the new Git daemons.
Statsd
UDP	labs subnet	labnodepool1001	production	statsd.eqiad.wmnet	8125	Nodepool metrics to statsd
UDP	labs subnet	scandium	production	statsd.eqiad.wmnet	8125	Zuul merger metrics to statsd
UDP	production	gallium	production	statsd.eqiad.wmnet	8125	Zuul scheduler metrics to statsd

Note: Non-exhaustive list, to be refined.

gallium is in the production network with IP 208.80.154.135. It still has some jobs running that would need to reach scandium over git://.

Todo: Need to list what contintcloud VMs are allowed to do. For npm, pip, gem, packagist, we can use a shared web proxy and thus only allow traffic to it.