Wikimedia Technology/Annual Plans/FY2019/TEC3: Deployment Pipeline/Goals
Program Goals and Status for FY18/19
[edit]- Goal Owner: Greg Grossmeier
- Program Goals for FY18/19: We will streamline and integrate the delivery of services, by building a new production platform for integrated development, testing, deployment and hosting of applications. Wikimedia developers experience a tooling parity between our Continuous Integration (CI) and production environments which enables them to release code more frequently by continuously reducing risk.
- Annual Plan: TEC3 Deployment Pipeline
- Primary Goal is Knowledge as a Service: Evolve our systems and structures
- Tech Goal: Sustaining
Outcome 1 / Output 1.1
[edit]Continuous Integration is unified with production tooling and developer feedback is faster
- Convert current CI builds to use the new tooling (Blubber).
Dependencies on: SRE team
Goal(s)
[edit]- Move verify stage from Minikube to CI k8s namespace in production context
Status
[edit] Note: July 2018
In progress
Note: August 10, 2018
In progress Discussed that work on a patch is still ongoing, need to refactor the pipeline job to the new namespace. This will be a change to the existing service but will need to be refactored when we get to the shared library.
Note: September 14, 2018
- This is now
Done!
Outcome 1 / Output 1.2
[edit]Continuous Integration is unified with production tooling and developer feedback is faster
- Setup test execution time profiling with a report, make a prioritized list of improvements to how tests are run.
Goal
[edit]- Formalize the collection of CI infrastructure and tooling metrics - T205923
Status
[edit] Note: October 2, 2018
- This is now
In progress
Note: November 7, 2018
In progress
- dduvall gave a presentation Monday looking at CI performance percentiles
- Work continues on automating the collection of these metrics.
Note: December 6, 2018
- This goal is
Partially done but we need to expose the interface of the metrics that we're collecting.
Outcome 2 / Output 2.3
[edit]Deployers have a better assessment of risk with each deploy
- Improve our incident response, post-mortem, and follow-up management tooling.
Goal
[edit]- Develop set of metrics to assess incident reports/post mortems.
Status
[edit] Note: October 2, 2018
- This work has not yet been started at this time
To do
Note: November 7, 2018
- This is now
In progress with Zeljko's analysis of the past year's worth of incident reports.
Note: December 6, 2018
- We can now determine how associated commits, repos, etc are connected to the incident reports and consider this goal
Done. There is more work that we can do to further refine the metrics.
Outcome 3 / Output 3.1
[edit]Deployments happen through percentage based stages (eg: canaries, 10%, 100%)
- Migration of services currently on our "shared service cluster" into Kubernetes (k8s) deployments with staged rollout
Primary teams: Service Operations, Release Engineering
Goal(s)
[edit]- Adopt more services into Deployment pipeline
- Migrate graphoid to the Deployment pipeline
Postponed
- Deploy zotero v2 to the Deployment pipeline
Done
- Migrate graphoid to the Deployment pipeline
- Deploy blubberoid
Done
- Reprise the work on the logging infrastructure T207200
Status
[edit] Note: October 2, 2018
- This is now
Done
Note: November 7, 2018
- Deploy zotero v2 to the Deployment pipeline
Done
- Currently living in k8s staging
- Plan to go live next week
- Deploy blubberoid
Done
- liw working on changes to internal data structuring as a prerequisite to creating OpenAPI spec required for pipeline — on track.
- Deploy zotero v2 to the Deployment pipeline
Note: December 6, 2018
- Zotero is
Done, Graphoid will be recommended for stewardship review
Postponed, Blubber will be
Done in the next week. Reprise the work on the logging infrastructure is still
Done.
Outcome 1 / Output 1.2 (RelEng)
[edit]Continuous Integration is unified with production tooling and developer feedback is faster
- Setup test execution time profiling with a report, make a prioritized list of improvements to how tests are run.
Goal
[edit]- Instrument Quibble for data collection
- Create a graph where time is spent and make a prioritized list for improvements.
Status
[edit] Note: January 10, 2019
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
In progress
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
Note: February 5,2019
- These task has been documented in Phab during last week's all hands meetings.
Note: March 12, 2019
- This goal is currently in danger of finishing up this quarter and will be part of Q4 goals
Stalled
Outcome 2 / Output 2.1 (RelEng)
[edit]Deployers have a better assessment of risk with each deploy
- Create a deployments report with metrics from the Code Health Group.
Goal(s)
[edit]- Select and integrate a code health metric solution into our tooling.
Status
[edit] Note: January 10, 2019
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
In progress
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
Note: February 2019
- Discussed that this is contingent on some other program work in TEC13, but we should be able to get fully started on it soon.
Note: March 12, 2019
- We've selected sonarcube to be our metric solution, but we need to create and finalize the integration for it (with sonarcloud) for self hosting and get it integrated into CI (still
In progress and moved into Q4 work.
- We've selected sonarcube to be our metric solution, but we need to create and finalize the integration for it (with sonarcloud) for self hosting and get it integrated into CI (still
Outcome 3 / Output 3.1 (SRE / ServiceOps & RelEng)
[edit]Deployments happen through percentage based stages (eg: canaries, 10%, 100%)
- Migration of services currently on our "shared service cluster" into Kubernetes (k8s) deployments with staged rollout
Primary teams: Service Operations, Release Engineering, Core Platform
Goal(s)
[edit]Adopt more services in the pipeline
- cxserver, ORES (partially), citoid, changeprop, cpjobqueue (stretch)
- Deploy eventgate
Status
[edit] Note: January 10, 2019
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
In progress
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
Note: February 5, 2019
- Discussed that there is one service that is currently (event gate) going through the pipeline...still
In progress
- Discussed that there is one service that is currently (event gate) going through the pipeline...still
Note: March 2019
- cxserver: all prerequsites are ready, just need to be deployed; ores: still
In progress and blockers identified; citoid
Done; changeprop is still
In progress; cpjobqueue is
Postponed to Q4 and eventgate has been deployed and is
Done
- cxserver: all prerequsites are ready, just need to be deployed; ores: still
Note: April 8, 2019
- cxserver is now
Done
- cxserver is now
Outcome 4 / Output 4.1 (SRE / ServiceOps)
[edit]Developers are able to create services that achieve production level standards with minimal overhead
- Developers get a service creating experience that is on par with production level standards with regard to logging, monitoring, security and configuration
Goal(s)
[edit]- Evaluate helm charts management solutions
Status
[edit] Note: January 10, 2019
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
In progress
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
Note: February 5, 2019
- Discussed this at All Hands, but we're working with SRE on it, still
In progress
- Discussed this at All Hands, but we're working with SRE on it, still
To do March 2019
- Discussed...
Outcome 5 / Output 5.1 (SRE / ServiceOps)
[edit]Services and the deployment pipeline are hosted on production-level infrastructure
- Adequately maintain the service infrastructure according to production standards. Upgrade the platform to new upstream versions to benefit from bug fixes and new features.
Goal(s)
[edit]- Aim for a better resilient, scalable, easier to manage and upgrade Kubernetes cluster service
- Upgrade cluster components to a newer version
- Improve docker registry architecture
Status
[edit] Note: January 10, 2019
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
In progress
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
To do February 2019
- Discussed...
To do March 2019
- Discussed...
Outcome 6 / Output 6.1 (SRE / ServiceOps)
[edit]Developers and deployers are aware of the platform, its benefits and how to make use of it
- Create a developer portal for the Deployment Pipeline platform with documentation and instructions
Goal(s)
[edit]- Create a developer portal with Deployment Pipeline documentation
Status
[edit] Note: January 10, 2019
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
In progress
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
To do February 2019
- Discussed...
To do March 2019
- Discussed...
Outcome 6 / Output 6.2 (SRE / ServiceOps)
[edit]Developers and deployers are aware of the platform, its benefits and how to make use of it
- Promote the platform's adoption
Goal(s)
[edit]- Conduct Promotion and Training events for Wikimedia developers
Status
[edit] Note: January 10, 2019
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
In progress
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
To do February 2019
- Discussed...
To do March 2019
- Discussed...
Outcome 1 / Output 1.2 (RelEng)
[edit]Continuous Integration is unified with production tooling and developer feedback is faster
- Setup test execution time profiling with a report, make a prioritized list of improvements to how tests are run.
Goal
[edit]- Instrument Quibble for data collection
- Create a graph where time is spent and make a prioritized list for improvements.
- Prepare the Deployment Pipeline for changes to our CI tooling.
Status
[edit] Note: April 8, 2019
- This is now
In progress, but the instrumenting
Blocked right now.
- This is now
Note: May 7, 2019
- This is still
Blocked, as we are waiting on other teams.
- This is still
Note: June 4, 2019
- This is still
Blocked, as we are waiting our team to get done with other more urgent work, will probably go into next FY.
- This is still
Outcome 3 / Output 3.1 (RelEng + SRE)
[edit]Deployments happen through percentage based stages (eg: canaries, 10%, 100%)
- Migration of services currently on our "shared service cluster" into Kubernetes deployments with staged rollout
Goal(s)
[edit]- Create a .pipeline/config.yaml standard to give users more control over how their tests are run in the pipeline and allow the easy saving of artifacts at pipeline completion. (RelEng)
- Migration of more services to the pipeline (RelEng + SRE) - T212801:
- Wikidata Termbox SSR
- Kask for Session Storage Service
- cpjobqueue (stretch)
- ORES (stretch)
Status
[edit] Note: April 8, 2019
- All goals are
In progress
- All goals are
Note: May 7, 2019
- All goals are
In progress
- All goals are
Note: June 4, 2019
- All goals are
In progress, termbox is now producing images, but not yet in production; kask is also
Partially done. Stretch goals will probably move to next FY. Pipeline config work is
Done
- All goals are
Note: June 13, 2019
- Migration should be done this quarter, but the overall program will remain
In progress into next FY, stretch goals will be done next FY. (SRE update)
- Migration should be done this quarter, but the overall program will remain
Outcome 5 / Output 5.1 (SRE)
[edit]Services and the deployment pipeline are hosted on production-level infrastructure
- Adequately maintain the service infrastructure according to production standards. Upgrade the platform to new upstream versions to benefit from bug fixes and new features.
Goal(s)
[edit]- Upgrade the infrastructure to recent/current software versions
- Add dedicated security sensitive nodes to the Kubernetes clusters
- Stretch: Implementation of a Helm chart management solution
Status
[edit] Note: June 13, 2019
- Upgrade is still
In progress and will be done by end of quarter. The addition of the dedicated nodes is
Done and the stretch goal was just recently started and will be done by the end of the quarter too.
- Upgrade is still
Outcome 6 / Output 6.1 + 6.2 (SRE)
[edit]Developers and deployers are aware of the platform, its benefits and how to make use of it
- Create a developer portal for the Deployment Pipeline platform with documentation and instructions
- Promote the platform's adoption
Goal(s)
[edit]- Conduct at least N trainings for new pipeline users
- Increase documentation quality
Status
[edit] Note: June 13, 2019
- Trainings have been on-going and still
In progress
- Documentation quality has been going slow, and will probably go into next quarter.
- Trainings have been on-going and still