Wikimedia Release Engineering Team/Seakeeper proposal

From MediaWiki.org
Jump to navigation Jump to search

Background[edit]

Engineering productivity is a highlighted Medium Term Plan priority of Wikimedia Foundation’s Technology Department. The work in support of this priority is lead by the Engineering Productivity team, with support and co-leadership from SRE’s Service Operations team. Together, they have been hard at work on a number of projects aimed at improving practices and supporting systems for the development, testing, delivery and operation of MediaWiki related software:

  • A MediaWiki development environment that seeks better parity with testing, beta, and production environments.[1]
  • Testing frameworks and practices for MediaWiki projects that increase our confidence in the software we deploy.[2]
  • Production Kubernetes clusters for the efficient deployment of software and better distribution of computing needs across our various data centers.[3]
  • A new continuous delivery system that will empower developers to control exactly how their software is gated and delivered to production.[4]
  • Migration to a more performant, understandable, and maintainable continuous integration system.[5] Zuul v2 and the Jenkins Gearman plugin, our current gating systems, are currently unmaintained by their respective upstreams, and Jenkins is undergoing drastic changes to its architecture to be more "cloud native," away from the central-server model on which we currently rely.

These projects and initiatives are all closely interrelated, but one stands out as the central and critical component of a more productive developer future at WMF.

Without a healthier and more capable CI system, these other projects and systems will be hamstrung. A new and better CI is required.

Evaluations[edit]

Thanks to Lars Wirzenius from Release Engineering, who authored a thorough and thoughtful New CI Architecture document and led the formation of a CI Futures Working Group in 2019, we were able to systematically and collaboratively evaluate a number of candidate CI platforms that might better serve WMF Technology and Wikimedia project contributors, and eventually decide on a next generation platform.

Process and timeline[edit]

February 2019[edit]

Informed by the high-level architecture document, the CI Futures Working Group was established to research and evaluate existing CI platforms. Lars Wirzenius first announced the formation of the working group on wikitech-l, engineering-l, and qa-l, calling for feedback and participation.

An initial one-month charter was granted by Greg Grossmeier, with initial working group members consisting of Lars Wirzenius, Brennen Bearnes, and Željko Filipin. Dan Duvall would join in the beginning of March following his return from vacation.

March 2019[edit]

By the end of its initial charter, a set of requirements had been established with which to evaluate various CI platforms.

Ten candidate platforms were divided among WG members and evaluated based on these requirements, with evaluation process and results tracked by a series of Phabricator tasks. At the end of the process, three candidates were deemed suitable enough to move to the next round of evaluation: Argo, GitLab CI, and Zuul v3.

June 2019[edit]

The CI Futures WG’s charter was extended and a second, proof-of-concept, phase of evaluation began with an announcement by Lars to the same lists as before.

"[...] We are currently writing up what the new CI system should look like in more detail. The approach taken is to start with what's needed and wanted, rather than what the tools provide. The document has had a first round of internal review, to get rid of the worst issues, and v2 is now open for feedback from the whole movement. [...]"

August/September 2019[edit]

A long process of setting up and evaluating proof-of-concepts began in earnest in early August and extended into mid September. Installations of Argo, GitLab CI, and Zuul v3 were all completed, the first using GKE and the latter two using local minikube environments.

Following the proof-of-concept evaluations, the WG conducted a vote by ranking each system based on distillations of the high-level requirements.[6]

Through these processes and informed but not bound by the results of the vote, the CI WG decided that Argo would be the proposed core software component for our future CI system.

Argo[edit]

For much more detail on the pros and cons of Argo, see its WG proof of concept and evaluation, but its aspects most relevant to this proposal for a CI Kubernetes cluster are the following.

Concepts and design[edit]

  • Argo is a "cloud native," or more precisely a "k-native" (Kubernetes) system, meaning it relies on its underlying cluster orchestration and computing platform to do the heavy lifting in executing workloads—in this context, workloads are essentially CI builds.
  • The Argo umbrella project is comprised of a few different subsystems (Workflow, Events, UI) that can be used to construct a working CI. Each can be installed into their own k8s namespaces for better isolation.
  • The Argo Workflow project implements a Workflow Custom Resource Definition (CRD) and a controller that responds to CRUD events for Workflow resources within the Kubernetes cluster on which the controller is running.
  • A Workflow object defines a number of steps that can either be serially executed or scheduled in parallel—in directed acyclic graph (DAG) fashion. Each step is executed as a Pod in the same namespace as the Workflow.
  • Argo Events runs as a separate subsystem responsible for consuming one or more external events and responds to events by creating any number of k8s resources including, but not limited to, Workflow resources.
  • Argo UI runs as a separate subsystem and lets users get (read-only) workflow outputs, histories, artifacts, etc.

Below is a logical design diagram for the Argo proof of concept used during evaluation.

A logical diagram of an Argo system set up Wikimedia Release Engineering in 2019 as a proof of concept for a future continuous integration system.

Note that integration with Gerrit for the PoC was accomplished via webhooks, and reporting back to Gerrit was done by spawning reporter workflows.

There are many possible ways of approaching integration due to Argo Events being a generalized event-processing system. For example, Kafka could be used as an intermediate event broker; Gerrit has a Kafka plugin for its event stream; Argo Events already provides a gateway for consuming Kafka events.

Better infrastructure needed[edit]

A new CI system will only be as healthy as its underlying infrastructure, and our existing VPS-based CI infrastructure is insufficient for running Argo. Aside from the obvious constraints of a k-native system requiring Kubernetes, running Argo on VPS instances would increase contention for computing resources already in high demand.

Proposal[edit]

Release Engineering (RelEng) proposes that a new Kubernetes cluster be provisioned by SRE’s Service Operations (ServiceOps) with sufficient computing resources for running all CI subsystems and developer provided CI workloads. RelEng also requires that the system provide persistent volumes for performant and long-term artifact storage. After the transition, the existing CI resources from Cloud Services will be released.

Resource requirements[edit]

RelEng would prefer to leave hardware specifications to ServiceOps as they occupy that domain of expertise. However, by providing analysis of our current Jenkins build trends and integration WMCS project resource quotas, RelEng can propose a reasonable starting point for cluster resource allocation.

Build concurrency[edit]

One simple method of determining usage patterns for our current CI system is to analyze the daily maximum and median concurrency of builds executed by our Jenkins installation.[7] From the daily maximum, we can determine a rough acceptable upper bound for cluster capacity. From the daily median and seven-day moving median, we can see what weekly usage patterns are like as well as project a rough two-month forecast.

The following chart represents concurrency patterns for Docker-based Jenkins builds from June 9, 2019 through October 22, 2019. Maximum and median daily values are plotted alongside a seven-day moving median. Forecasting was accomplished by training an ARIMA model with historical data in AWS Forecast, and a linear trend was drawn along the P90 forecast to give us a very conservative growth rate for the coming months.

A time series graph showing build concurrency maximums, moving medians, and an ARIMA forecast for Wikimedia Foundation's continuous integration system in FY19Q1.

The figure shows a bursty usage pattern with weekly seasonality that averages out to a slightly upward linear trend over the near term.

Along the P90 forecast, two future values are plotted, one on Oct 23 and the other on Nov 20. The difference between those values can be used to calculate a 28-day growth coefficient of 0.0492. Considering there are 9 28-day periods between Oct 23 and RelEng’s current OKR for partial migration to a new CI system by end of FY19-20, a conservative approach would be to set the target growth for concurrent build capacity of the new system at 44.3% of current capacity (9 * 0.0492). That would give us a very conservative—and rough—basis for cluster resource requirements from now through the migration period.

Note that CI usage patterns are likely to change substantially after the migration has been completed, as repo owners will have greater control over their job definitions and the ability to define workflows that branch to a greater degree of parallel execution.

Current VPS quotas[edit]

There are currently 17 instances running in the integration project that are registered as Jenkins agents for running Docker-based builds. Each is configured to use the mediumram flavor which allocates 8 vCPU, 24G memory, and 80G disk space. Each agent/instance is configured on the Jenkins master to allow for 4 concurrent executors.

vCPU Memory (G) Executors
Each instance 8 24G 4
Total cluster (x17) 136 408G 68

New cluster resource estimation[edit]

Total vCPU and memory are the constraining resources for the number of possible executors which in turn sets the upper bounds for build concurrency. Disk space allocation follows as a requirement for each registered executor and doesn’t need cluster wide estimation.

Applying the 44.3% 9x28-day growth rate to the table of current cluster resources gives us a conservative estimate for the initial workload computing needs of a new CI cluster by the target date for our goal of partial migration.[8]

Target date vCPU Memory (G) Executors
Current cluster quota (Q) current 136 408G 68
Seakeeper cluster (ceil(Q*1.443)) 2019-07-01 197 589G 99

System design[edit]

Deployment of Argo to a WMF CI Kubernetes cluster could be done in a number of different ways. Below is a tentative deployment model that isolates various CI subsystems based on the access needs of external actors and system processes. RelEng is completely open to different deployment models and would work with SRE ServiceOps to achieve an overall system that is both secure and accessible to CI users and admins.

The subsystem interactions, actors, and data flows of a proposed future CI system based on Argo.

The above diagram represents how the tenet CI processes—the Argo subsystems, Gerrit integration processes, and project workflows—might be deployed to a given Kubernetes cluster. Of particular focus is the separation of pods and services into namespaces based on how processes and actors will interact.

argo-system[edit]

Living in this namespace are the core controllers for both Argo Workflow and Argo Events. All of these controllers require a degree of cluster-wide access to fulfill changes made to their respective custom resources—fulfillment in this case typically means deployments to the same namespaces in which each custom resource was created.

processes[edit]

  • gateway-controller – Fulfills argoproj.io/Gateway resource creation and modification in any cluster namespace. Specifications for Gateway resources typically encapsulate two container definitions (client and server) for a single pod. The pod is spawned in the namespace of their Gateway resource.
  • sensor-controller – Fulfills argoproj.io/Sensor resource creation and modification in any cluster namespace. Specifications for Sensor resources typically encapsulate one container/pod definition and one service definition. Both are spawned in the namespace of their parent Sensor resource.
  • workflow-controller – Fulfills argoproj.io/Workflow resource creation and modification in any cluster namespace. Specifications for Workflow resources encapsulate a number of pods/containers ("steps") that can be executed either serially or by a directed acyclic graph (DAG) scheduler. All pods are spawned in the namespace of their parent Workflow resource.

actors/access[edit]

RelEng has sufficient access to this namespace to perform periodic (re-)deployments of core Argo and Argo Events controllers.

gerrit-gateway[edit]

Living in this namespace are the Sensor and Gateway resources needed for downstream Gerrit integration—specifically the consumption and handling of events such as patchset-created and ref-updated.

processes[edit]

  • gerrit-gateway – An argoproj.io/Gateway encapsulating a single pod of two containers.
    1. A "server" process implemented by RelEng that connects to the Gerrit event stream over SSH, reading in JSON event data.
    2. The standard Argo gateway client that ferries Gerrit JSON event payloads from the gateway to subscribed Sensor pods.
  • gerrit-event-sensor – An argoproj.io/Sensor encapsulating the following.
    1. A standard Argo pod/container that reads in event payloads from the gerrit-gateway.
    2. A number of configured filters and conditions that limit which Gerrit events are handled. These filters are highly configurable.
    3. A trigger telling the sensor pod to spawn a new Workflow based on a definition retrieved from the event payload’s project repo, nominally the refs/meta/config branch so as to restrict access to Workflow definitions.

actors/access[edit]

Access would be granted to RelEng for (re-)deployment of the Sensors and Gateways needed for Gerrit integration, as well as other supporting controllers.[9]

argo-ui[edit]

Living in this namespace would be Argo’s web server.

processes[edit]

  • argo-ui – A deployment of pods/containers for Argo’s web server and artifact proxy, and a service that exposes it publicly.

actors/access[edit]

Access would be granted to ServiceOps for periodic (re-)deployment of the Argo web server. Public access would be granted for the Argo UI over HTTP.[10]

project namespaces[edit]

In this deployment model, Workflow resources are sequestered by the gerrit-event-sensor to a namespace based on the project of the originating Gerrit event.[11] Assigning namespaces per project affords a high degree of access control which in turn allows for use of standard tooling—argo and kubectl, for example—by trusted users. Additionally, project-based namespace assignment would allow for better scheduling of compute resources, ensuring that workloads for deployments, gating of essential patchsets, and other critical jobs, are not starved.

processes[edit]

  • [project]-[n]Workflow resources created by the gerrit-event-sensor based on configuration retrieved from the project repo’s refs/meta/config branch and the Gerrit event payload. Each Workflow can constitute any number of pods scheduled to execute serially or in parallel according to the Workflow.

actors/access[edit]

Repo owners would be granted limited access to their project namespace in order to make use of argo CLI tooling for the management of Workflow and related resources. Ideally we would also grant read/watch access to a trusted subset of contributors.

A public view of workflow status and history would remain limited to the Argo UI.

Administrative access[edit]

While RelEng ultimately defers to SRE on matters of cluster security, RelEng requires sufficient administrative access to namespaces occupied by Argo and other CI subsystems.

Network access[edit]

To achieve immediate parity with our existing CI system, unrestricted egress traffic from project namespaces is the current expectation.

Known unknowns[edit]

Gerrit reporting[edit]

The Argo proof of concept implemented Gerrit reporting by installing an additional Gateway and Sensor pair that listened for the completion of project Workflow resources and spawned its own reporter Workflow resources in a restricted namespace—restricted to isolate the Gerrit user credentials needed to comment on patchsets. While this implementation was sufficient for the proof of concept, it incurred an unacceptable degree of overhead due to the setup/teardown of each workflow/pod for the sake of making a single Gerrit API request.

A better implementation of Gerrit reporting will be needed for a production rollout. One solution might be to implement and install a persistent controller into the proposed gerrit-gateway namespace which can watch for Workflow completion and continuous make Gerrit API requests without having to spawn additional Kubernetes resources.

Project namespaces[edit]

As mentioned in the design section, having one namespace per project is just one possible model. While a single Kubernetes cluster can theoretically host up to 10,000 namespaces, it’s unknown to what degree the control-plane performance might suffer from having a large number of namespaces defined.[12]

Another question that arises from this proposed model is: How are such project namespaces created? Given we’re working toward a self-serve system, this should be streamlined from a user perspective yet not triggered in a way that might compromise security.

For example, Argo Events does have the ability to create arbitrary resources based on consumed events. One possible model for automated namespace creation might be to run a Sensor/Gateway pair—in a protected namespace—that listens for a Gerrit event signaling the creation of a Workflow definition in refs/meta/config, and have the sensor provision the project namespace.

Gating[edit]

One of our current CI subsystems, Zuulv2, provides a feature known as project gating.

Gating ensures that a change will not merge until all of its project’s builds pass. Builds in this context comprises linters, test suites, or any other automated process that can determine an acceptable degree of correctness for the repo state after a change is applied.

In addition to providing gating for a linear set of changes made to single, independent projects, Zuul provides a mechanism for gating changes made to multiple projects that have upstream/downstream relationships (i.e. cross-project dependencies) and are submitted concurrently. This mechanism is called the Dependent Pipeline Manager.

Independent gating is an essential feature and can likely be achieved to some degree with Argo Events. There has been much discussion, however, over whether cross-project-dependency gating is necessary for our projects and workflows, but as a feature of our current system, we must evaluate how it might be implemented for the new system to achieve immediate parity.

Artifacts[edit]

Argo provides a generalized interface for artifact saving and retrieval. Minio—an S3 compatible object store—was used for the proof-of-concept. We’ll need to further evaluate which backend artifact storage system is best for a production rollout.

Security workflows[edit]

Security patches and other sensitive workflows are conceptually possible given the proposed system design. For example, an additional Sensor could be installed to a restricted namespace that would respond to security patches submitted to some trusted external system (Phabricator, restricted Gerrit branches, etc.), spawning Workflows in the same or another restricted namespace. However, more research on this facet of the system, and security requirements in general, is needed.

Endnotes[edit]

  1. FY2019/TEC12/O1, Developer Productivity, "Local development is unified with testing and production"
  2. FY2019/TEC3/O2, Deployment Pipeline, "Deployers have a better assessment of risk with each deploy"
  3. [[1]], Deployment Pipeline, "Deployments happen through percentage based stages (eg: canaries, 10%, 100%)," "Developers are able to create services that achieve production level standards with minimal overhead," "Services and the deployment pipeline are hosted on production-level infrastructure"
  4. FY2019/TEC3/O6, Deployment Pipeline, "Developers and deployers are aware of the platform, its benefits and how to make use of it"
  5. FY2019/TEC3/O1, Deployment Pipeline, "Continuous Integration is unified with production tooling and developer feedback is faster"
  6. It should be noted that by the time voting took place, Željko Filipin had left the WG and Antoine Musso had joined.
  7. Medians were chosen over averages as frequent idling heavily skews the latter downward.
  8. Does not include computing needs of Argo system itself and other K8s/CI subsystems.
  9. We’ll need something to report workflow status back upstream to Gerrit but the exact requirements have yet to be defined.
  10. TLS termination and load balancing are not represented in the diagram but are also requirements.
  11. Other models of namespace assignment are certainly possible given the ability of Sensors to set arbitrary properties of created resources with values from the event payload.
  12. As of the time of this writing, our Gerrit instance hosts 2,199 active projects of type CODE.