Continuous integration/Architecture/Isolation

The Wikimedia continuous integration system is lacking the ability to securely run code proposed by people outside of the organization. This document propose an architecture that would spawn isolated Virtual Machine to run untrusted code in it. The concept is similar to Travis but homegrown and based on the OpenStack continuous infrastructure infrastructure.

A first step was the RFC: sandboxing continuous integration document, parts applying are reused here.

This architecture document is tracked in Phabricator as T86171.

= Context =

History
The continuous integration project at the Wikimedia Foundation started during summer 2011 with the installation of PHPUnderControl. An overhaul promptly followed up and we migrated to Jenkins running on a single production machine: gallium (still in use as of May 2014).

Security was not much of an issue at the beginning since only people having an account in the Wikimedia labs LDAP were allowed to send code that will trigger a run of unit tests on the production system. With the wide opening of Wikimedia Labs and self registration opened up to the world, we came up with a whitelist of participating developers. That has showed its limit as our community kept growing.

Since some tests are not executing any code (ex: linting PHP files), Timo Tijhof implemented two pipelines known as check and test. The check pipeline is unrestricted and will always execute the jobs no matter who send it, the second pipeline only runs when the patch proposer email is in a whitelist defined in Zuul (the CI scheduler). Untrusted people hence have a limited number of tests run for them and lack the useful results of a full test run.

The user authentication overview is roughly:



Limitations with current system
Having to manually maintain a whitelist of users often causes community members to be considered as third class citizen and the process to whitelist them is obscure (one has to figure out where to add the email and which one to use).

The Jenkins infrastructure consists of two production servers and several labs instance. We do not let developers install whatever they might need to tests their code with. Which mean everything needs to be puppetized and available in Ubuntu Precise, often using puppet manifests. Such centralization is a bottleneck which slowdown the addition of new tests since changes have to be crafted and approved by a very small team having appropriate rights.

All jobs are sharing the same UNIX credentials on a host, thus we can not have them to use credentials which can be easily disclosed by running test code. Having the test isolated would let us have the credential at the host level and run tasks requiring those credentials by the host instead of by the job. A typical example is uploading documentation generated by a job to a production host over ssh, the ssh key must not be available to the job.

The tests are being run on production or labs instances, they do not let users finely configure the test environment, for example installation and configuration of additional packages or relying on third party utilities. New additions needs to be requested to the continuous integration team which will make the required changes to the host environment.

Similarly, the sequence of commands being executed is validated when defining jobs (via Jenkins Job Builder) and only Wikimedia staff and contractors can update the jobs. That does not scale anymore and we need any developers to be able to define their test environment easily as well as the ability to list whatever commands match their use case.

Finally we have an issue with leftovers from previous builds interacting with future builds. We mitigated that by carefully cleaning up the jobs workspace before running them but that is a tedious task and it is often missing. In the same vein, we had a few occurrence of race conditions when the same test suite is being run concurrently on the same host.

As a summary the limitations we would like to unleash: ability to install additional packages on the testing servers finely tune configuration of the environment arbitrary commands execution to match tests needs configuration delegated to the developers instead of a handful of person one time disposable sandbox to avoid race conditions and ensure a clean testing env.

To achieve this goals, we need to build a sandboxed infrastructure.

= Architecture Overview =

To achieve isolation we would use KVM based virtual machine. Wikimedia already maintains a virtualization infrastructure to hosts volunteers projects: the WMFLabs, it is based on OpenStack with KVM to provide the virtualization. Moreover, OpenStack continuous integration system rely on an OpenStack cloud to boot disposables sandboxes. By reusing the WMFLabs infrastructure and the OpenStack CI code, we start our journey on safe grounds.



The new architecture is build on top of the current one, namely reusing Jenkins and Zuul on gallium. It introduces two new servers directly in the labs subnet. The first is used to host node pool and communicate directly with the labs OpenStack API. The second server hosts two Zuul merges each bound to a virtual IP and dedicated SSD (the git merge operation is mostly I/O bound).

= Softwares =

NodePool
The master piece of the new infrastructure is Nodepool (upstream documentation), a software developed by OpenStack infrastructure team to setup and maintain a pool of VM which are used by the Zuul scheduling daemon.

Nodepool communicates over the OpenStack API to spawn instances and delete them on a job has been completed. It also supports creating/refreshing glance disk images, ensuring they are reasonably fresh when a new instance is build from it (ex: puppet has run, git repos are pre cloned and up to date etc).

At first, a pool of VM are created, they are then dynamically added as Jenkins slaves using Jenkins REST API. The Jenkins slave has a single executor since the instance would run only one job.



The graph above represents 24 hours of the NodePool activity for the OpenStack project. Each instance has four possible states:


 * Building (yellow): instance is being spawned and refreshed (puppet run, setup scripts)
 * Available (green): instance is pending job assignment
 * In use (blue): job is being run
 * Deleting (purple): job is complete and Zuul asked to dispose the VM

NodePool is a python daemon which communicates with four other systems:


 * Jenkins: over ZeroMQ, require the Jenkins plugin ZMQ event publisher. TCP port set in the plugin.


 * Zuul: NodePool connects to Zuul Gearman server (TCP port 4730).


 * Database: require a SQL server to keep state of the VMs. MySQL with the InnoDB storage is recommended. NodePool hold a database connection for each VM, upstream recommend at least a number of connections equal to twice the number of nodes we expect to be in use at once.


 * statsd: report metrics about the pool of VMs such as the graph above.

NodePool is not packaged for Debian/Ubuntu.

Zuul
The central scheduler, would add a parameter OFFLINE_NODE_WHEN_COMPLETE which instructs NodePool to depool the VM from Jenkins on job completion and delete it. The scheduler listens for Gerrit events and reacts by triggering Gearman functions (such as running a job or creating a merge commit).

Zuul merger
When a patch is proposed, Zuul merges it on top of the tip of the branch or on top of changes which are already enqueued for merging (and hence will be the new tip of the branch). To do so, the scheduler triggers a Gearman function to craft a merge commit, the function is honored by a separate process: zuul-merger. The operations is I/O bound: network latency with Gerrit and the git disk I/O, more over the merges are done sequentially. To spread the load, we would need two separate zuul-merger separate each bound to a dedicated SSD disk. Both zuul-merger can run on the same host though (to be confirmed).

The zuul-merger maintained git repositories are exposed via git-daemon on port 9418. We would need each daemon to listen on a different IP address. Hence the server hosting them would need two sub interfaces with dedicated IP. The zuul-merger handling the function will reply back with its IP address which is then sent as a parameter to the Jenkins job. When executing on a VM, the job fetch the patch from the zuul-merger instance it has been given.

= Hardware =

We would need two new servers:


 * labci001 : to host Nodepool. The service is probably lightweight enough and could be mutualized with other services, though the CI admins will need access to it and probably end up requiring root access on the host to be able to conduct upgrades. We could later on migrate the Zuul server to it.


 * labci002 : host the Zuul mergers process. The server would need two SSD at least 128GB, each of the two Zuul merger process would be assigned a SSD. We can afford data loss, on replacement Zuul merger will simply reclone the repositories from Gerrit.

The two servers should be placed inside the labs subnet. NodePool needs to be able to interact with the labs OpenStack API directly and connections to labs subnet from production realm are restricted. The Zuul mergers expose the git repositories which are fetched by the labs instances which can not access the production realm.

The labs infrastructure might need to be allocated some new hardware to afford hosting the spawned VMs. See the capacity planning section below.

= Capacity planning =

Number of VMs
The current Jenkins installation has:


 * 4 Precise instances each having 5 executors (20)
 * 5 Trusty instances each having 5 executors (25)
 * 2 Precise production servers for a total of 12 executors

That is a total of 57 executors. We will probably want to start with a pool of 50 VM. Depending of the time required to rebuild one and have it ready and the rate of consumptions, we might need to increate the pool to be able to absorb peaks.

We have started reducing the number of jobs being executed per change, most notably by using a test entry point that executes several tests in a single job (ex: npm test, composer test).

MariaDB
When NodePool adds or deletes nodes, it hold open a database connection for each node. Thus on start with a pool of 100 VM, it will hold 100 connections to the database while the instances are being setup. Upstream recommends to configure the database server to support at least a number of connections equal to twice the number of nodes you expect to be in use at once. For 100VM that will be 200 concurrent connections.

= Security matrix =

Note: Non exhaustive list, to be refined.

TODO Need to list what `contintcloud` VMs are allowed. For npm, pip, gem, packagist, we can use a shared web proxy and thus only allow traffic to it.