Draft:Wikimedia Release Engineering Team/Seakeeper proposal (FY2019-20Q4 rework)

= Introduction =

This document serves as an up-to-date (as of FY 19-20 Q4) compendium of goals, architecture, and requirements for the Seakeeper project which aims to replace Wikimedia Foundation's existing general purpose continuous integration (CI) system with one that is more performant, capable, secure, and user friendly.

Purpose
The primary purpose of this document is two fold, to organize the goals and requirements of a viable replacement CI system as well as propose a path forward for its adoption and implementation. While it is highly informed by—and may outright borrow from—work in this domain over the past year —and lessons learned —it is not bound by any specific assertions of that work. It should be considered the latest iteration and primary reference document pertaining to CI replacement and is meant to be exhaustive in its representation of goals, requirements, and architecture.

Scope
While prior documents covered requirements and architecture for a general purpose CI replacement as well as some aspects of prospective continuous delivery system, this document will only cover aspects of the former. The reasoning for this narrowing of scope is summed up by the following comment from Security Concept Review task T240943.

We recognize that some of our documentation and process has conflated the requirements and policy of a general purpose CI system with that of the Deployment Pipeline project or another form of continuous delivery/deployment that we are working towards in the long term. While these systems are highly interrelated, they are also distinct and therefore can (and should) be reasoned about separately, for the sake of clarity in forming security policy, modeling threat, and proposing implementation.

The Deployment Pipeline is another important and ambitious project that will no doubt benefit from the success of this one; It both hinges on the success of a well planned and implemented CI platform, and deserves its own properly scoped process of planning, review, and implementation.

At its outset, this project has been driven by a very real need to replace the aging CI system we run now which handles for the most part general purpose workloads, is critical in supporting the daily work of WMF staff and volunteers, and is composed of unmaintained (some fully deprecated) components. Narrowing scope to accomplish a timely replacement seems self-evidently justifiable.

Audience
Wikimedia Site Reliability Engineering, the Wikimedia Release Engineering Team, and the Wikimedia Security Team are the intended primary audience of this document as they have been most deliberative. Additional audiences may include management and product owners from Wikimedia Technology and Wikimedia Product as well as third party vendors should we engage in any formal procurement of or consultative process for PaaS.

Individual users of our existing CI system are not the intended audience of this document. However, feedback is welcome from any and all stakeholders.

= Overview =

Described in this section are the problems we're aiming to solve with the Seakeeper project by replacing our existing CI system and our specific goals in implementing a replacement.

Statement of need
Staff and volunteer contributors heavily rely on our existing CI system for the static analysis, functional testing, and integration of patchsets to over 2,200 different projects. Daily usage ranges from several hundreds of builds per day to several thousands and from a few dozen build-time hours in a single day to several hundred build-time hours. Concurrency levels vary, but the daily 95th percentile falls most often in a range of 20-40 concurrent builds. Overall usage has grown steadily over time.

Simultaneously with this steady and growing need is the languishing of our current CI stack. Zuul v2 and the Jenkins Gearman plugin, our current gating systems, are currently unmaintained by their respective upstreams, and Jenkins is undergoing drastic changes to its architecture to be more "cloud native," away from the central-server model on which we currently rely.

Configuration of CI jobs remains prohibitively cumbersome for most of our users with most changes to the  repo being made by a specialized few with knowledge of the esoteric  Jenkins Job Builder DSL.

To meet the growing and variable capacity demands of our CI users and achieve a high degree of self service, we need a system built on a scalable underlying platform and based on accessible interfaces and schema.

Lastly, our current system lacks the isolation mechanisms necessary to run security-sensitive workloads such as Debian package building and automated application/testing of embargoed security patches. In order to ensure the integrity of production deployed artifacts, we will need a system that can schedule trusted and  untrusted workloads to specific isolated environments that are controlled logically through namespacing and access control, and physically through separation of underlying hardware nodes.

Design goals
A successful implementation of Seakeeper is where:


 * CI software is maintained and supported by upstream.
 * CI deployments are easily manageable by Release Engineering.
 * Project contributors can easily understand workflow definitions for their own repos.
 * A subset of privileged contributors have direct access to modify workflow definitions for their own repos.
 * Any contribution can be securely analyzed and tested by untrusted workflows.
 * Approved contributions can be further analyzed and tested by trusted workloads and result in deployable artifacts.
 * System compute capacity can be scaled easily to meet demand.
 * Artifacts produced by trusted workloads are verifiable and production deployable.
 * Artifacts produced by untrusted workloads are usable in analysis and testing but are not promoted or production deployable.
 * Migration from current Jenkins Job Builder based job definitions is automated.
 * Workload results are readily accessible to CI users.
 * Workload results originating from Gerrit events are reported back to Gerrit.
 * Workload results can be propagated to external systems for further analysis.
 * Workload logs are kept for a sufficient amount of time for troubleshooting.
 * Artifacts are auditable, traceable to originating workloads and Gerrit patchsets.

Stakeholders

 * Wikimedia Quality and Test Engineering Team
 * Responsible for improving software practices at WMF, QTE has a direct stake in seeing that developers can have their patchsets reliably analyzed and tested by CI.


 * Wikimedia Release Engineering Team
 * Tasked with ensuring timely and safe production deployments, Release Engineering has a direct stake in keeping our continuous integration platform running to facilitate proper analysis and testing of project patches. Additionally, RelEng is a direct contributor to the Deployment Pipeline project which depends on a stable and performant underlying CI platform.


 * Wikimedia Security Team : Providing "services to inform risk and to cultivate a culture of security," Security has a direct stake in a CI system that can perform security analysis for teams and enforce policy on software patchsets.


 * Wikimedia Site Reliability Engineering : Responsible for ensuring the integrity, stability, and availability of production systems and networks, SRE has a direct stake in a CI system that can securely produce verifiable production artifacts. SRE is also a direct contributor to the Deployment Pipeline project which depends on a stable and performant underlying CI platform.


 * Wikimedia Technology and Wikimedia Product : All other departments and teams at WMF writing software patchsets for eventual production deployment have a direct stake in a CI system that is accessible, stable, and performant.


 * Technical Contributors : Volunteers that wish to contribute software changes to Wikimedia projects benefit directly from a CI system that provides clear and actionable feedback.


 * Wikimedia Project Users : Editors, readers, and other end users of Wikimedia projects have an indirect stake in a CI system that contributes to the overall security and reliably of our production sites and backend services.

Assumptions

 * 1)  Argo will continue to be well maintained for the next 4-6 years.
 * 2)  Argo membership with the Cloud Native Computing Foundation (CNCF) means it has a good chance at continuing to garner wide community support.
 * 3) Kubernetes will continue to be well maintained for the next 4-6 years.
 * 4) Gerrit will continue to be the code review system for Wikimedia projects for the next 2-3 years.
 * 5)  Release Engineering will continue to have the resourcing necessary to maintain a CI system.
 * 6)  Release Engineering will continue to have enough internal Kubernetes knowledge to administer a K-native system.
 * 7)  SRE will have the capacity to collaborate on  trusted workload requirements to the extent that they will be confident having produced artifacts deployed to production.
 * 8)  Security will have the capacity to review this design document and advise on matters of security risk, constraints, and policy.

= Architecture =

The architecture of the proposed CI system comprises distinct conceptual layers.

Logical architecture
Like other general purpose CI systems, there is a pattern of scheduling workloads, performing work, storing artifacts, and reporting results.



Schedule
Scheduling is upstream Gerrit events are consumed and where workloads are spawned based on event parameters for the analysis of both the untrusted or  trusted patchsets. This is the primary—albeit indirect—interface between contributors and CI as it allows for analysis of project contributions.

Work
Workloads in the form of Argo Workflows are scheduled on the Kubernetes cluster and routed to one of two worker pools depending on their previously determined  trusted or  untrusted status.

Argo Workflows can have either a serial or directed acyclic graph execution structure comprised of one or more discrete tasks.

Store
Workflow tasks may output artifacts for immediate use by subsequent tasks in the same workflow or future use by external consumers such as deployment processes or package publishers.

Report
Results are processed and submitted to Gerrit as feedback for patch authors and reviewers, or aggregated to other systems for long-term retention and future analysis. Code merges are also performed based on results.

Functional architecture
Expanding the conceptual overview, the system proposed will function in the following ways.

Scheduling
Gerrit events are consumed by a custom Gateway that listens to the Gerrit event stream over a persistent SSH connection. Events are quickly passed off to a Sensor that applies a number of constraints and conditions on each event payload to determine whether it should result in a triggered  Workflow.

Conditions for scheduling can include anything present in the Gerrit event stream but are chiefly:


 * Event type such as,  ,  ,.
 * Change details such as project, branch.
 * Patchset details such as ref, author or uploader email address, and paths of modified files.
 * Comment details such as message and commenter/reviewer.
 * Approvals (label changes) such as CR+2.

Additionally, parameters in the originating event payload will determine whether the workload should be considered trusted or  untrusted. See security architecture for details.

Work
Once an event passes the Sensor constraints, it applies its  Trigger Templates, fetching a  Workflow definition from the project's   branch, annotating it with values from the event payload, enforcing a user namespace, and then submitting the resulting  Workflow to the resident Kubernetes API.

Submitted Workflow objects are fulfilled by the  Argo Workflow controller, each task being scheduled as an independent pod in the enforced namespace. Execution order is determined by the serial step definitions or DAG of tasks in the  Workflow definition, and each task specifies explicit input and output parameters for data and result passing.

Store
Any task in a Workflow may save files generated by its completed container as artifacts. Tasks may also take artifacts as input, allowing for pipeline like processing.

Passing artifacts between tasks—binding one task's artifact output parameters to another's input parameters—involves a pair of round-trip save and fetch operations between the resident node—on which the task's pod is executing—and the storage backend, incurring network I/O, archive and compression CPU time, and storage read/write overhead.

Argo provides a number of artifact drivers with differing degrees of support for input and output, the most notable for us being cloud provider storage engines like AWS (and compatible) S3 and GCS for saving blobs of data as outputs, and Git for retrieving and analyzing project patchsets and other refs as inputs.

Note that artifacts are not the only means of binding task inputs and outputs. For smaller values, standard Workflow parameters are a better option.

Consumers of saved artifacts may retrieve them by way of the Argo UI web interface or directly, the former being preferred wherever possible.

For separation of artifacts originating from untrusted and  trusted workloads into different stores, see  security architecture.

Report
Completed Workflows trigger further processing for contributor feedback and merging of patchsets.

Processing can be implemented in one of two ways.

One option is to use an additional Gateway and  Sensor to subscribe and react to completed Gerrit-originating  Workflows and trigger an internal "reporting"  Workflow that comments in Gerrit and conditionally merges patchsets. This method would have the benefit of re-using the same Argo Events subsystem responsible for processing Gerrit events, and of scaling with the overall node capacity of the cluster. A major downside is the overhead required to schedule and execute Workflows (and thus Pods) for every reporting event.

Another option is to implement a custom resident controller that watches the Kubernetes API for completed Workflows and submits comments and merges in Gerrit using a separate thread or process. The benefit to this approach would be efficiency of forking instead of Pod scheduling. The downsides include the maintenance burden of a custom controller and the lack of scalability with reporting load despite having cluster capacity.

Security architecture
= Scenarios =

[Role3] scenarios
= Design =

Security design
= Risks =

Cost
= Definitions =

that Umbrella term for all Argo sub-projects relevant to this system design, namely Argo Workflow,  Argo UI, and  Argo Events.
 * Argo : Umbrella term for all Argo sub-projects relevant to this system design, namely  Argo Workflow,  Argo UI, and  Argo Events.
 * Argo Events : Argo sub-system that consumes events from external sources (e.g. Gerrit) and conditionally spawns Kubernetes resources (e.g.  workflows) based on each event payload.
 * Argo UI : Argo sub-system that provides a web view into  workflow history and links to artifacts.
 * Argo Workflow : Argo sub-system that extends Kubernetes functionality to support    definitions as native  custom Kubernetes resources.
 * Custom Resource Definition (CRD) : Kubernetes configuration that allows a new type of object to be submitted to a Kubernetes cluster's API.
 * Directed Acyclic Graph (DAG) Execution : A specific form of parallel execution whereby discrete tasks are fulfilled according to a directed graph structure.
 * Gateway : A  custom resource provided by  Argo Events that defines an event source to listen to. Many kinds of gateways are supported out of the box (e.g. Kafka, Webhooks, etc.) and can also be provided in the form of a Docker image path. Gateways are subscribed to by one or more  Sensors for processing.
 * Job Definition : Human readable/writable configuration that provides the CI system with programs to execute—as well as inputs to read and outputs to save—in the context of a given project repo with patchset(s) applied.
 * K-native : A software sub-system that is implemented through a set of CRDs and software controllers such that it embeds itself into a Kubernetes cluster's core functioning, extending its API with new types of objects and fulfilling CRUD operations on those objects.
 * Sensor : A  custom resource provided by  Argo Events that defines a number of  event gateways to subscribe to, event dependencies/contraints/conditions, and  trigger templates that fetch definitions for resources to be spawned (e.g.  Argo Workflows).
 * Trigger Template : The section of an  Argo Events Sensor that defines what kind of resource to spawn as the result of an accepted event (e.g. an  Argo Workflow) and how to fetch that resource's definition (e.g. from a branch/path of the originating event's project Git repo).
 * Trusted Workload : A workload that is the result of some interaction (e.g. +2 vote in Gerrit, merge, tag push, etc.) by a user or subsystem that has merge privileges and post merge access to a given repo.
 * Untrusted Workload : A workload that is the result of some interaction (e.g. patch submission, manual  workflow submission, etc.) by a user that lacks merge privileges or post merge access to a given repo.
 * Workflow Definition : Essentially the same as Job Definition but in  Argo Workflow parlance and having the capability of  DAG execution.
 * Workload : A single discrete execution of either a workflow definition or  job definition.

= References =