Draft:Wikimedia Release Engineering Team/Seakeeper proposal (FY2019-20Q4 rework)

= Introduction =

This document serves as an up-to-date (as of FY 19-20 Q4) compendium of goals, architecture, and requirements for the Seakeeper project which aims to replace Wikimedia Foundation's existing general purpose continuous integration (CI) system with one that is more performant, capable, secure, and user friendly.

Purpose
The primary purpose of this document is two fold, to organize the goals and requirements of a viable replacement CI system as well as propose a path forward for its adoption and implementation. While it is highly informed by—and may outright borrow from—work in this domain over the past year —and lessons learned —it is not bound by any specific assertions of that work. It should be considered the latest iteration and primary reference document pertaining to CI replacement and is meant to be exhaustive in its representation of goals, requirements, and architecture.

Scope
While prior documents covered requirements and architecture for a general purpose CI replacement as well as some aspects of prospective continuous delivery system, this document will only cover aspects of the former. The reasoning for this narrowing of scope is summed up by the following comment from Security Concept Review task T240943.

We recognize that some of our documentation and process has conflated the requirements and policy of a general purpose CI system with that of the Deployment Pipeline project or another form of continuous delivery/deployment that we are working towards in the long term. While these systems are highly interrelated, they are also distinct and therefore can (and should) be reasoned about separately, for the sake of clarity in forming security policy, modeling threat, and proposing implementation.

The Deployment Pipeline is another important and ambitious project that will no doubt benefit from the success of this one; It both hinges on the success of a well planned and implemented CI platform, and deserves its own properly scoped process of planning, review, and implementation.

At its outset, this project has been driven by a very real need to replace the aging CI system we run now which handles for the most part general purpose workloads, is critical in supporting the daily work of WMF staff and volunteers, and is composed of unmaintained (some fully deprecated) components. Narrowing scope to accomplish a timely replacement seems self-evidently justifiable.

Audience
Wikimedia Site Reliability Engineering, the Wikimedia Release Engineering Team, and the Wikimedia Security Team are the intended primary audience of this document as they have been most deliberative. Additional audiences may include management and product owners from Wikimedia Technology and Wikimedia Product as well as third party vendors should we engage in any formal procurement of or consultative process for PaaS.

Individual users of our existing CI system are not the intended audience of this document. However, feedback is welcome from any and all stakeholders.

= Overview =

Described in this section are the problems we're aiming to solve with the Seakeeper project by replacing our existing CI system and our specific goals in implementing a replacement.

Statement of need
Staff and volunteer contributors heavily rely on our existing CI system for the static analysis, functional testing, and integration of patchsets to over 2,200 different projects. Daily usage ranges from several hundreds of builds per day to several thousands and from a few dozen build-time hours in a single day to several hundred build-time hours. Concurrency levels vary, but the daily 95th percentile falls most often in a range of 20-40 concurrent builds. Overall usage has grown steadily over time.

Simultaneously with this steady and growing need is the languishing of our current CI stack. Zuul v2 and the Jenkins Gearman plugin, our current gating systems, are currently unmaintained by their respective upstreams, and Jenkins is undergoing drastic changes to its architecture to be more "cloud native," away from the central-server model on which we currently rely.

Configuration of CI jobs remains prohibitively cumbersome for most of our users with most changes to the  repo being made by a specialized few with knowledge of the esoteric  Jenkins Job Builder DSL.

To meet the growing and variable capacity demands of our CI users and achieve a high degree of self service, we need a system built on a scalable underlying platform and based on accessible interfaces and schema.

Lastly, our current system lacks the isolation mechanisms necessary to run security-sensitive workloads such as Debian package building and automated application/testing of embargoed security patches. In order to ensure the integrity of production deployed artifacts, we will need a system that can schedule trusted and  untrusted workloads to specific isolated environments that are controlled logically through namespacing and access control, and physically through separation of underlying hardware nodes.

Definitions

 * Argo : Umbrella term for all Argo sub-projects relevant to this system design, namely  Argo Workflow,  Argo UI, and  Argo Events.
 * Argo Events : Argo sub-system that consumes events from external sources (e.g. Gerrit) and conditionally spawns Kubernetes resources (e.g.  workflows) based on each event payload.
 * Argo UI : Argo sub-system that provides a web view into  workflow history and links to artifacts.
 * Argo Workflow : Argo sub-system that extends Kubernetes functionality to support    definitions as native  custom Kubernetes resources.
 * Custom Resource Definition (CRD) : Kubernetes configuration that allows a new type of object to be submitted to a Kubernetes cluster's API.
 * Directed Acyclic Graph (DAG) Execution : A specific form of parallel execution whereby discrete tasks are fulfilled according to a directed graph structure.
 * Job Definition : Human readable/writable configuration that provides the CI system with programs to execute—as well as inputs to read and outputs to save—in the context of a given project repo with patchset(s) applied.
 * K-native : A software sub-system that is implemented through a set of CRDs and software controllers such that it embeds itself into a Kubernetes cluster's core functioning, extending its API with new types of objects and fulfilling CRUD operations on those objects.
 * Trusted Workload : A workload that is the result of some interaction (e.g. +2 vote in Gerrit, merge, tag push, etc.) by a user or subsystem that has merge privileges and post merge access to a given repo.
 * Untrusted Workload : A workload that is the result of some interaction (e.g. patch submission, manual  workflow submission, etc.) by a user that lacks merge privileges or post merge access to a given repo.
 * Workflow Definition : Essentially the same as Job Definition but in  Argo Workflow parlance and having the capability of  DAG execution.
 * Workload : A single discrete execution of either a workflow definition or  job definition.

Design goals
A successful implementation of Seakeeper is where:


 * CI software is maintained and supported by upstream.
 * CI deployments are easily manageable by Release Engineering.
 * Project contributors can easily understand workflow definitions for their own repos.
 * A subset of privileged contributors have direct access to modify workflow definitions for their own repos.
 * Any contribution can be securely analyzed and tested by untrusted workflows.
 * Approved contributions can be further analyzed and tested by trusted workloads and result in deployable artifacts.
 * System compute capacity can be scaled easily to meet demand.
 * Artifacts produced by trusted workloads are verifiable and production deployable.
 * Artifacts produced by untrusted workloads are usable in analysis and testing but are not promoted or production deployable.
 * Migration from current Jenkins Job Builder based job definitions is automated.
 * Workload results are readily accessible to CI users.
 * Workload results originating from Gerrit events are reported back to Gerrit.
 * Workload results can be propagated to external systems for further analysis.
 * Workload logs are kept for a sufficient amount of time for troubleshooting.
 * Artifacts are auditable, traceable to originating workloads and Gerrit patchsets.

Stakeholders

 * Wikimedia Quality and Test Engineering Team
 * Responsible for improving software practices at WMF, QTE has a direct stake in seeing that developers can have their patchsets reliably analyzed and tested by CI.


 * Wikimedia Release Engineering Team
 * Tasked with ensuring timely and safe production deployments, Release Engineering has a direct stake in keeping our continuous integration platform running to facilitate proper analysis and testing of project patches. Additionally, RelEng is a direct contributor to the Deployment Pipeline project which depends on a stable and performant underlying CI platform.


 * Wikimedia Security Team : Providing "services to inform risk and to cultivate a culture of security," Security has a direct stake in a CI system that can perform security analysis for teams and enforce policy on software patchsets.


 * Wikimedia Site Reliability Engineering : Responsible for ensuring the integrity, stability, and availability of production systems and networks, SRE has a direct stake in a CI system that can securely produce verifiable production artifacts. SRE is also a direct contributor to the Deployment Pipeline project which depends on a stable and performant underlying CI platform.


 * Wikimedia Technology and Wikimedia Product : All other departments and teams at WMF writing software patchsets for eventual production deployment have a direct stake in a CI system that is accessible, stable, and performant.


 * Technical Contributors : Volunteers that wish to contribute software changes to Wikimedia projects benefit directly from a CI system that provides clear and actionable feedback.


 * Wikimedia Project Users : Editors, readers, and other end users of Wikimedia projects have an indirect stake in a CI system that contributes to the overall security and reliably of our production sites and backend services.

Assumptions

 * 1)  Argo will continue to be well maintained for the next 4-6 years.
 * 2)  Argo membership with the Cloud Native Computing Foundation (CNCF) means it has a good chance at continuing to garner wide community support.
 * 3) Kubernetes will continue to be well maintained for the next 4-6 years.
 * 4) Gerrit will continue to be the code review system for Wikimedia projects for the next 2-3 years.
 * 5)  Release Engineering will continue to have the resourcing necessary to maintain a CI system.
 * 6)  Release Engineering will continue to have enough internal Kubernetes knowledge to administer a K-native system.
 * 7)  SRE will have the capacity to collaborate on  trusted workload requirements to the extent that they will be confident having produced artifacts deployed to production.
 * 8)  Security will have the capacity to review this design document and advise on matters of security risk, constraints, and policy.

= Architecture =

Security architecture
= Scenarios =

[Role3] scenarios
= Design =

Security design
= Risks =

Cost
= References =