Wikimedia Technology/Annual Plans/FY2019/TEC3: Deployment Pipeline/Goals

=Program Goals and Status for FY18/19=

TEC3 Deployment Pipeline
 * Goal Owner: Greg Grossmeier
 * Program Goals for FY18/19: We will streamline and integrate the delivery of services, by building a new production platform for integrated development, testing, deployment and hosting of applications. Wikimedia developers experience a tooling parity between our Continuous Integration (CI) and production environments which enables them to release code more frequently by continuously reducing risk.
 * Annual Plan: TEC3 Deployment Pipeline
 * Primary Goal is Knowledge as a Service: Evolve our systems and structures
 * Tech Goal: Sustaining





= Q1 Goals =

Outcome 1 / Output 1.1
Continuous Integration is unified with production tooling and developer feedback is faster
 * Convert current CI builds to use the new tooling (Blubber).

Dependencies on: SRE team

Goal(s)

 * Move verify stage from Minikube to CI k8s namespace in production context

Status
July 2018

August 10, 2018
 * Discussed that work on a patch is still ongoing, need to refactor the pipeline job to the new namespace. This will be a change to the existing service but will need to be refactored when we get to the shared library.

September 14, 2018
 * This is now ✅!



=Q2 Goals =

Outcome 1 / Output 1.2
Continuous Integration is unified with production tooling and developer feedback is faster


 * Setup test execution time profiling with a report, make a prioritized list of improvements to how tests are run.

Goal

 * Formalize the collection of CI infrastructure and tooling metrics -

Status
October 2, 2018
 * This is now

November 7, 2018
 * dduvall gave a presentation Monday looking at CI performance percentiles
 * Work continues on automating the collection of these metrics.
 * Work continues on automating the collection of these metrics.

December 6, 2018
 * This goal is but we need to expose the interface of the metrics that we're collecting.

Outcome 2 / Output 2.3
Deployers have a better assessment of risk with each deploy


 * Improve our incident response, post-mortem, and follow-up management tooling.

Goal

 * Develop set of metrics to assess incident reports/post mortems.

Status
October 2, 2018
 * This work has not yet been started at this time

November 7, 2018
 * This is now with Zeljko's analysis of the past year's worth of incident reports.

December 6, 2018
 * We can now determine how associated commits, repos, etc are connected to the incident reports and consider this goal ✅. There is more work that we can do to further refine the metrics.

Outcome 3 / Output 3.1
Deployments happen through percentage based stages (eg: canaries, 10%, 100%)


 * Migration of services currently on our "shared service cluster" into Kubernetes (k8s) deployments with staged rollout

Primary teams: Service Operations, Release Engineering

Goal(s)

 * Adopt more services into Deployment pipeline
 * Migrate graphoid to the Deployment pipeline ❌
 * Deploy zotero v2 to the Deployment pipeline ✅
 * Deploy blubberoid ✅
 * Reprise the work on the logging infrastructure

Status
October 2, 2018
 * This is now ✅

November 7, 2018
 * Deploy zotero v2 to the Deployment pipeline ✅
 * Currently living in k8s staging
 * Plan to go live next week
 * Deploy blubberoid ✅
 * liw working on changes to internal data structuring as a prerequisite to creating OpenAPI spec required for pipeline — on track.

December 6, 2018
 * Zotero is ✅, Graphoid will be recommended for stewardship review ❌, Blubber will be ✅ in the next week. Reprise the work on the logging infrastructure is still ✅.



=Q3 Goals = DRAFT, and still be to put under the respective outcomes:

(SRE)


 * Adopt more services in the pipeline
 * Add session storage service, SSR
 * Stretch: Migrate cp-jobqueue, ORES
 * Conduct at least N trainings for new pipeline users
 * Increase documentation quality
 * Upgrade the infrastructure to recent/current software versions
 * Add dedicated security sensitive nodes to the Kubernetes clusters
 * Stretch: Implementation of a Helm chart management solution

Outcome 1 / Output 1.2 (RelEng)
Continuous Integration is unified with production tooling and developer feedback is faster


 * Setup test execution time profiling with a report, make a prioritized list of improvements to how tests are run.

Goal

 * Instrument Quibble for data collection
 * Create a graph where time is spent and make a prioritized list for improvements.

Status
January 10, 2019
 * Discussed that as we've just gotten back from our vacations, this work is ramping up and is

February 5,2019
 * These task has been documented in Phab during last week's all hands meetings.

March 12, 2019
 * This goal is currently in danger of finishing up this quarter and will be part of Q4 goals ❌

Outcome 2 / Output 2.1 (RelEng)
Deployers have a better assessment of risk with each deploy


 * Create a deployments report with metrics from the Code Health Group.

Goal(s)

 * Select and integrate a code health metric solution into our tooling.

Status
January 10, 2019
 * Discussed that as we've just gotten back from our vacations, this work is ramping up and is

February 2019
 * Discussed that this is contingent on some other program work in TEC13, but we should be able to get fully started on it soon.

March 12, 2019
 * We've selected sonarcube to be our metric solution, but we need to create and finalize the integration for it (with sonarcloud) for self hosting and get it integrated into CI (still )

Outcome 3 / Output 3.1 (SRE / ServiceOps & RelEng)
Deployments happen through percentage based stages (eg: canaries, 10%, 100%)


 * Migration of services currently on our "shared service cluster" into Kubernetes (k8s) deployments with staged rollout

Primary teams: Service Operations, Release Engineering, Core Platform

Goal(s)
Adopt more services in the pipeline


 * cxserver, ORES (partially), citoid, changeprop, cpjobqueue (stretch)
 * Deploy eventgate

Status
January 10, 2019
 * Discussed that as we've just gotten back from our vacations, this work is ramping up and is

February 5, 2019
 * Discussed that there is one service that is currently (event gate) going through the pipeline...still

March 2019
 * cxserver: all prerequsites are ready, just need to be deployed; ores: still and blockers identified; citoid ✅; changeprop is still ; cpjobqueue is ❌ to Q4 and eventgate has been deployed and is ✅

Outcome 4 / Output 4.1 (SRE / ServiceOps)
Developers are able to create services that achieve production level standards with minimal overhead


 * Developers get a service creating experience that is on par with production level standards with regard to logging, monitoring, security and configuration

Goal(s)

 * Evaluate helm charts management solutions

Status
January 10, 2019
 * Discussed that as we've just gotten back from our vacations, this work is ramping up and is

February 5, 2019
 * Discussed this at All Hands, but we're working with SRE on it, still

March 2019
 * Discussed...

Outcome 5 / Output 5.1 (SRE / ServiceOps)
Services and the deployment pipeline are hosted on production-level infrastructure


 * Adequately maintain the service infrastructure according to production standards. Upgrade the platform to new upstream versions to benefit from bug fixes and new features.

Goal(s)

 * Aim for a better resilient, scalable, easier to manage and upgrade Kubernetes cluster service
 * Upgrade cluster components to a newer version
 * Improve docker registry architecture

Status
January 10, 2019
 * Discussed that as we've just gotten back from our vacations, this work is ramping up and is

February 2019
 * Discussed...

March 2019
 * Discussed...

Outcome 6 / Output 6.1 (SRE / ServiceOps)
Developers and deployers are aware of the platform, its benefits and how to make use of it


 * Create a developer portal for the Deployment Pipeline platform with documentation and instructions

Goal(s)

 * Create a developer portal with Deployment Pipeline documentation

Status
January 10, 2019
 * Discussed that as we've just gotten back from our vacations, this work is ramping up and is

February 2019
 * Discussed...

March 2019
 * Discussed...

Outcome 6 / Output 6.2 (SRE / ServiceOps)
Developers and deployers are aware of the platform, its benefits and how to make use of it


 * Promote the platform's adoption

Goal(s)

 * Conduct Promotion and Training events for Wikimedia developers

Status
January 10, 2019
 * Discussed that as we've just gotten back from our vacations, this work is ramping up and is

February 2019
 * Discussed...

March 2019
 * Discussed...



=Q4 Goals =

Outcome 1 / Output 1.2 (RelEng)
Continuous Integration is unified with production tooling and developer feedback is faster


 * Setup test execution time profiling with a report, make a prioritized list of improvements to how tests are run.

Goal

 * Instrument Quibble for data collection
 * Create a graph where time is spent and make a prioritized list for improvements.
 * Prepare the Deployment Pipeline for changes to our CI tooling.

Status
April 2019
 * Discussed...

May 2019
 * Discussed...

June 2019
 * Discussed...

Outcome 3 / Output 3.1
Deployments happen through percentage based stages (eg: canaries, 10%, 100%)
 * Migration of services currently on our "shared service cluster" into Kubernetes deployments with staged rollout

Goal(s)

 * Create a .pipeline/config.yaml standard to give users more control over how their tests are run in the pipeline and allow the easy saving of artifacts at pipeline completion.
 * Migration of services:
 * cpjobqueue
 * ORES

Status
April 2019
 * Discussed...

May 2019
 * Discussed...

June 2019
 * Discussed...