Wikimedia Technology/Annual Plans/FY2019/TEC3: Deployment Pipeline

From MediaWiki.org
Jump to navigation Jump to search

NOTE: This is a continuation of the FY18 program titled "Streamlined Services Delivery"

Containerized Continuous Deployment Pipeline 2018

We will build a new production platform for integrated development, testing, deployment, and hosting of applications. This will greatly reduce the complexity and speed of delivering a service and maintaining it throughout its lifecycle, with fewer dependencies between teams and greater automation and integration. The platform will offer more flexibility through support for automatic high-availability and scaling, abstraction from hardware, and a streamlined path from development through testing to deployment. Services will be isolated from each other for increased reliability and security.

A big focus for this year is reducing the risk of each deployment we make, addressed by 3 of the work streams:

  • Unifying our Continuous Integration infrastructure and tooling with production (with the added benefit of speeding up developer feedback),
  • Giving our Release Engineers the tools they need to assess and reduce risk of any given deployment, and
  • Deploy new updates to our users through percentage based stages to catch issues early before all users have a bad experience.

NOTE: This program only covers the "build" through "deploy" stages in the above image. The support (and more) of the "dev" stage is covered in the Developer productivity proposal. If funded, that program would be merged into this one for manageability.

Goals

Program outline[edit]

Teams contributing to the program[edit]

Release Engineering, Site Reliability Engineering, and Services

Annual Plan priorities[edit]

Primary Goal: 3. Knowledge as a Service - evolve our systems and structures

How does your program affect annual plan priority?[edit]

By enabling our developers to quickly see their code in production we will enable faster and more efficient product development aiding all who create and consume the sum of all human knowledge.

Program Goal[edit]

We will streamline and integrate the delivery of services, by building a new production platform for integrated development, testing, deployment and hosting of applications.

Wikimedia developers experience a tooling parity between our Continuous Integration (CI) and production environments which enables them to release code more frequently by continuously reducing risk.

Outcomes[edit]

Outcome 1: Continuous Integration is unified with production tooling and developer feedback is faster[edit]

Output 1.1
Convert current CI builds to use the new tooling (Blubber).
Output 1.2
Setup test execution time profiling with a report, make a prioritized list of improvements to how tests are run.
Output 1.3
Research and share a report of our options for implementing delta-only/code path aware testing.

Outcome 2: Deployers have a better assessment of risk with each deploy[edit]

Output 2.1
Create a deployments report with metrics from the Code Health Group.
Output 2.2
Stretch: Create a dashboard for real-time insight to the deployment report
Output 2.3
Improve our incident response, post-mortem, and follow-up management tooling.

Outcome 3: Deployments happen through percentage based stages (eg: canaries, 10%, 100%)[edit]

Output 3.1
Migration of services currently on our "shared service cluster" into Kubernetes deployments with staged rollout
Output 3.2
Make preparations for moving MediaWiki into the Kubernetes system by defining a set of broad service level tests.

Outcome 4: Developers are able to create services that achieve production level standards with minimal overhead[edit]

Output 4.1
Developers get a service creating experience that is on par with production level standards with regard to logging, monitoring, security and configuration

Outcome 5: Services and the deployment pipeline are hosted on production-level infrastructure[edit]

Output 5.1
Adequately maintain the service infrastructure according to production standards. Upgrade the platform to new upstream versions to benefit from bug fixes and new features.

Outcome 6: Developers and deployers are aware of the platform, its benefits and how to make use of it[edit]

Output 6.1
Create a developer portal for the Deployment Pipeline platform with documentation and instructions
Output 6.2
Promote the platform's adoption

Resources[edit]

People FY2017–18 FY2018–19
Release Engineering
  • ~2 Engineers
  • 0.25 ✕ QA Engineer (contractor, reallocated)
  • 0.75 ✕ Software Engineer (contractor, reallocated)
  • Software Engineer (reallocated)
  • Software Engineer (reallocated)
  • 0.5 ✕ Software Engineer (reallocated)
  • 0.75 ✕ Sr Software Engineer (reallocated)
  • 0.5 ✕ Engineering Manager (reallocated)
SRE
  • ~0.5 Site Reliability Engineers
  • Site Reliability Engineer (new hire)
  • 0.5 ✕ Site Reliability Engineer
Stuff (CapEx)
  • Kubernetes cluster for Continuous Integration use (should be in SRE's CapEx already).
Travel & Other
(Consolidated with Code Health and Reliability, Performance and Maintenance):
  • 8 x Developer Summit & Team offsite
  • 8 x Hackathon & Team offsite

Targets[edit]

Outcome 1: Continuous Integration is unified with production tooling and developer feedback is faster[edit]

Target 1.1
All Continuous Integration jobs are migrated to use production deployment tooling (eg: helm, minikube, docker, and blubber).
Measurement method
  1. This is measured by the number of Jenkins Jobs migrated to using our production deployment tooling (eg: Blubber).

Outcome 2: Deployers have a better assessment of risk with each deploy[edit]

Target 2.1
We reduce the number of MediaWiki deployment incidents by 10%
Measurement method
  1. This is measured by the number of rollback inducing deployments either through the weekly release train or SWAT deploys.

Outcome 3: Deployments happen through percentage based stages (eg: canaries, 10%, 100%)[edit]

Target 3.1
All services currently on our "shared service cluster" are deployed through percentage based stages.
Measurement method
  1. This is measured by identifying which services are deployed on Kubernetes through a percentage based rollout method.

Outcome 4: Developers are able to create services that achieve production level standards with minimal overhead[edit]

Targets
  1. 100% of new services following our own coding standards will have their logs collected, their metrics exposed and monitored and will be using encryption
Measurement method
  1. Number of Phab tasks under https://phabricator.wikimedia.org/tag/service-deployment-requests/

Outcome 5: Services and the deployment pipeline are hosted on production-level infrastructure[edit]

Targets
  1. 99% of availability for the deployment pipeline
Measurement method
  1. CI availability metrics

Outcome 6: Developers and deployers are aware of the platform, its benefits and how to make use of it[edit]

Targets
Measurement method
  1. Survey among the target audience

Dependencies[edit]

  • MediaWiki Platform: This program requires cross-team collaboration and planning for deploying MediaWiki and Services on a Kubernetes cluster.