Wikimedia Release Engineering Team/Seakeeper proposal/Kubernetes vendor selection

From mediawiki.org

Kubernetes vendor selection[edit]

Overview[edit]

Kubernetes vendor criteria are in service to the Seakeeper proposal and informed by Future CI requirements. Requirements herein defined are scoped such that they cover only the products, services and relationship WMF Release Engineering is seeking to establish with a Kubernetes PaaS provider for the purpose of rolling out and maintaining an Argo based CI system. These requirements must bolster those of the greater CI system—and should directly reference them when relevant—but may also be informed by considerations outside the precise purview of CI.

Discussion between primary participating stakeholders should drive the creation of these requirements and should be kept to the discussion page of this article as much as possible.

Classification of requirements[edit]

Requirements will be classified into four distinct categories, each of which is a quadrant formed by two main category axes function/quality and service/organization.

Function/Quality
A functional requirement describes what the k8s vendor and service needs to do for us.
A quality (aka non-functional) requirement describes constraints on how the vendor and platform are to operate and deliver, generally expressed using some kind of quality attribute.
Service/Organization
A service requirement speaks to the k8s platform and supporting services being provided by the vendor.
An organization requirement speaks to the history/standing of the vendor organization itself, and its underlying ability to provide and support the service, akin to supplier evaluation criteria.

Categories[edit]

Describes a four quadrant model for classification of service vendor requirements.

Combined, these main axes constitute four category quadrants we can use for classifying requirements.

Service Function (SF)
A requirement that speaks to the behavior of the provided k8s platform and is informed by the material needs of Wikimedia technical contributors and WMF admins.
Examples
SF1 – API supports standard k8s toolchain
SF2 – Allows custom resource definitions (CRDs)
SF3 – Supports single-node tenancy
Service Quality (SQ)
A requirement that speaks to the qualities of the provided k8s PaaS and our needs around availability, scalability, performance, maintainability, etc. Some of these will map directly to service level objectives (SLOs).
Examples
SQ1 – Cluster availability exceeds 99.99%
SQ2 – Node performance meets or exceeds current levels
Organization Function (OF)
A requirement that speaks to the capacity of the vendor organization to deliver and continue to support its service.
Examples
OF1 – Financially stable as a k8s provider
OF2 – Demonstrates competency as a k8s operator
OF3 – Support level with at least 1-hour response window
Organization Quality (OQ)
A requirement that speaks to the vendor organization itself, its business practices and its alignment with WMF's culture and values.
Examples
OQ1 – Contributes improvements to k8s upstream
OQ2 – Track record of equitable labor practices
OQ3 – Scored four-star EFF data-request rating

Requirements[edit]

Requirements listed below will will later be used to score each prospective vendor. Each new requirement added to the table must include the following:

Code
Unique identifier formed by a classification prefix and number.
Weight
Whether this requirement constitutes a should have or a must have requirement. Should a vendor not meet a must have requirement, it is highly likely it will not be chosen unless all other candidates fail in some way as well.
Name
Short but descriptive name of the requirement.
Description
Full requirement description.
Reason
Justification for including the requirement with links to upstream documents such as the Future CI requirements and Seakeeper proposal.
Status
Either needs review or reviewed (linking to any relevant discussion topic).
Code Weight Name Description Reason Status
SF1 should Standard k8s toolchain Once a cluster is provisioned, users can interact with its API endpoint using standard Kubernetes tools like kubectl. A standard toolchain will enable those with Kubernetes experience to be immediately productive as both administrators of the platform. Note that learning additional toolchains for interacting with the vendor's underlying compute platform may be unavoidable. Needs review
SF2 must Custom k8s resources Supports creation of custom resource definitions (CRDs) such as argoproj.io/Workflow. Argo subsystems are K-native, relying on CRDs to carry out their basic functions. Needs review
SF3 must Sole-tenant nodes Supports provisioning of nodes hardware that will be used exclusively by WMF. In order to effectively create high-security or otherwise more "trusted" CI segments, we'll need to ensure sole tenancy on the underlying nodes for each segment. Needs review
SF4 should Cluster autoscaling Supports automatic adjustment of cluster size (horizontal) and/or reallocation of VM resources (vertical) based on forecasted changes in overall cluster load. Although CI load can be bursty, average load patterns show daily and weekly seasonality. Autoscaling would be allow us to save on cost during low times and provide better performance during peak times. Needs review
SQ1 should Cluster availability exceeds 99.99% Once a cluster is provisioned, its monthly uptime percentage is at least %99.99. (Should reference vendor's SLO.) Users of our CI system are distributed throughout the world's time zones. The system needs to be available for them at all times to ensure timely testing of contributions. Needs review
OF1 must Established k8s operator Vendor has been operating k8s clusters as a provider for some time. Kubernetes is a fairly new technology, so finding a provider with a long history of providing service will be impossible. However, the operator should at least be well established relative to others in the solution space. Needs review
OF2 should Established cloud provider Vendor has been operating cloud compute resources as a provider for some time. Kubernetes providers are almost certain to be running on virtualized systems lower in the stack that we'll be interacting with as well. Needs review
OQ1 should Core k8s contributor Vendor staff have contributed improvements/fixes to the k8s project. This speaks to our first Wikimedia Foundation Guiding Principle of "Freedom and open source" that compells us to "work with upstream projects and contribute back improvements to their code." We should favor working with vendors who share this value. It also speaks to the vendor's deep knowledge of the Kubernetes system it operates. Needs review