Draft:Wikimedia Release Engineering Team/Seakeeper proposal (FY2019-20Q4 rework)

= Introduction =

This document serves as an up-to-date (as of FY 19-20 Q4) compendium of goals, architecture, and requirements for the Seakeeper project which aims to replace Wikimedia Foundation's existing general purpose continuous integration (CI) system with one that is more performant, capable, secure, and user friendly.

Purpose
The primary purpose of this document is two fold, to organize the goals and requirements of a viable replacement CI system as well as propose a path forward for its adoption and implementation. While it is highly informed by—and may outright borrow from—work in this domain over the past year —and lessons learned —it is not bound by any specific assertions of that work. It should be considered the latest iteration and primary reference document pertaining to CI replacement and is meant to be exhaustive in its representation of goals, requirements, and architecture.

Scope
While prior documents covered requirements and architecture for a general purpose CI replacement as well as some aspects of prospective continuous delivery system, this document will only cover aspects of the former. The reasoning for this narrowing of scope is summed up by the following comment from Security Concept Review task T240943.

We recognize that some of our documentation and process has conflated the requirements and policy of a general purpose CI system with that of the Deployment Pipeline project or another form of continuous delivery/deployment that we are working towards in the long term. While these systems are highly interrelated, they are also distinct and therefore can (and should) be reasoned about separately, for the sake of clarity in forming security policy, modeling threat, and proposing implementation.

The Deployment Pipeline is another important and ambitious project that will no doubt benefit from the success of this one; It both hinges on the success of a well planned and implemented CI platform, and deserves its own properly scoped process of planning, review, and implementation.

At its outset, this project has been driven by a very real need to replace the aging CI system we run now which handles for the most part general purpose workloads, is critical in supporting the daily work of WMF staff and volunteers, and is composed of unmaintained (some fully deprecated) components. Narrowing scope to accomplish a timely replacement seems self-evidently justifiable.

Audience
Wikimedia Site Reliability Engineering, the Wikimedia Release Engineering Team, and the Wikimedia Security Team are the intended primary audience of this document as they have been most deliberative. Additional audiences may include management and product owners from Wikimedia Technology and Wikimedia Product as well as third party vendors should we engage in any formal procurement of or consultative process for PaaS.

Individual users of our existing CI system are not the intended audience of this document. However, feedback is welcome from any and all stakeholders.

= Overview =

Described in this section are the problems we're aiming to solve with the Seakeeper project by replacing our existing CI system and our specific goals in implementing a replacement.

Statement of need
Staff and volunteer contributors heavily rely on our existing CI system for the static analysis, functional testing, and integration of patchsets to over 2,200 different projects. Daily usage ranges from several hundreds of builds per day to several thousands and from a few dozen build-time hours in a single day to several hundred build-time hours. Concurrency levels vary, but the daily 95th percentile falls most often in a range of 20-40 concurrent builds. Overall usage has grown steadily over time.

Simultaneously with this steady and growing need is the languishing of our current CI stack. Zuul v2 and the Jenkins Gearman plugin, our current gating systems, are currently unmaintained by their respective upstreams, and Jenkins is undergoing drastic changes to its architecture to be more "cloud native," away from the central-server model on which we currently rely.

In addition to the performance and maintenance issues, configuration of CI jobs remains prohibitively cumbersome for most of our users with most changes to the  repo being made by a specialized few with knowledge of the esoteric  Jenkins Job Builder DSL.

To meet the growing and variable capacity demand of our CI users and achieve a high degree of self service, we need a system built on a scalable underlying platform and based on accessible interfaces and schema.

Assumptions
= Architecture =

Security architecture
= Scenarios =

[Role3] scenarios
= Design =

Security design
= Risks =

Cost
= References =