Wikimedia Technology/Annual Plans/FY2019/TEC1: Reliability, Performance, and Maintenance/Goals

=Program Goals and Status for FY18/19=

TEC1: Reliability, Performance, and Maintenance
 * Goal Owners: Mark Bergsma; Ian Marlier; Nuria Ruiz; Bryan Davis
 * Program Goals for FY18/19: We will maintain the availability of Wikimedia’s sites and services for our global audiences and ensure they’re running reliably, securely, and with high performance. We will do this while modernizing our infrastructure and improving current levels of service when it comes to testing, deployments, and maintenance of software and hardware.
 * Annual Plan: TEC1: Reliability, Performance, and Maintenance
 * Primary Goal is Knowledge as a Service: Evolve our systems and structures
 * Tech Goal: Sustaining



 === Q1 Goals Outcome / Output (Analytics)=== We have scalable, reliable and secure systems for data transport
 * Analytics stack maintains current level of service. OS Upgrades.

Goal(s)

 * Continue upgrading to Debian Stretch:
 * AQS ✅
 * thorium ✅
 * Archiva ✅
 * bhorium (where piwik resides)
 * Refresh of:
 * hadoop master nodes ✅
 * analytics1003

Status
July 2018
 * one goal

August 22, 2018
 * one goal

September 18, 2018
 * Remaining hosts to be completed before end of quarter

Outcome 1 / Output 1.1 (SRE)
Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
 * Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)

Dependencies on: Core Platform (MediaWiki, RESTbase), Performance, Release Engineering, Parsing (Parsoid), Analytics (EventBus),Community Liaisons; Primary team: SRE

Goal(s)
Perform a datacenter switchover
 * Successfully switch backend traffic (MediaWiki, Swift, RESTBase, and Parsoid) to be served from codfw with no downtime and reduced read-only time ✅
 * Serve the site from codfw for at least 3 weeks
 * Refactor the switchdc script into a more re-usable automation library and update it to the newer switchover requirements ✅

Status
July 2018
 * Discussed...

August 14, 2018
 * this is ongoing and looking good, had some minor delays but it's scheduled for Sep for first switchover and back in October.

September 11, 2018
 * The switch to codfw is happening this week. The switchdc refactoring is complete and has been published as a generically re-usable (FOSS) automation library, as and is published as spicerack (FOSS).

Outcome 1 / Output 1.1 (SRE / Traffic)
Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
 * Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)

Primary teams: SRE & Traffic

Goal(s)

 * Deploy an ATS backend cache test cluster in core DCs: ✅
 * 2x 4 node clusters
 * Puppetization
 * Application-layer routing
 * Deploy a scalable service for ACME (LetsEncrypt) certificate management:
 * Features:
 * Core DC redundancy
 * Wildcards + SANs
 * Multiple client hosts for each cert
 * Coding + Puppetization
 * Deploy in both DCs
 * Live use for one prod cert
 * Increase network capacity:
 * eqiad: 2 rows with 3*10G racks
 * codfw: 2 rows with 3*10G racks
 * ulsfo: replace routers
 * eqdfw: replace router

Status
July 2018
 * Discussed...

August 14, 2018
 * The ATS testing and creation of packages is going well and some is already in production. This is doing the basic needs so far but nearly done.
 * ACME service is also on track
 * Increasing network capacity goal is moving much slower than expected, due to hardware installation (network switches) on the switch stacks issues with the vendor. This goal is at risk for this quarter due to these hardware and testing issues with the new/old switches. The replacement routers will also be at risk for this quarter.

September 2018
 * ✅ This goal is fully complete; an ATS test cluster has been setup in both core DCs.
 * ACME software has been written, deployment and use is expected by end of the quarter.
 * Increasing network capacity goal has hit a roadblock, due to network switch stack topology incompatibility issues with the vendor, causing network instability in production. This goal is at risk for this quarter, as we can neither take any risks nor move swifty on this very essential and critical infrastructure. The replacement of the routers is expected to complete before EOQ.

Outcome 1 / Output 1.1 (Performance)
Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
 * Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)

Primary team: Performance

Goal(s)

 * Improve synthetic monitoring
 * Monitor more WMF sites, including some in each deploy group
 * Configure alerts such that they're directed to the teams that are able to address them
 * Improve Navigation Timing data
 * Update dashboards to use newer navtiming2 keys
 * Move navtiming data from graphite to prometheus
 * Remove dependency on jQuery from Mediawiki's base Javascript module ✅
 * Support and develop the Mediawiki ResourceLoader component
 * Dependency: Core Platform
 * Support and develop Mediawiki's data access components
 * Dependency: Core Platform

Status
July 2018

August 14, 2018
 * Synthentic monitoring and timing data is in progress but slightly delayed due to summer vacations. Everything else is in progress or done.

September 18, 2018
 * Synthetic monitoring and timing data are in progress and on track. Other goals are ongoing maintenance work that don't have a defined end state.

Outcome 2 / Output 2.1 (Performance)
Better designed systems
 * Assist in the architectural design of new services and making them operate at scale

Primary team: Performance

Goal(s)

 * Research performance perception in order to identify specific metrics that influence user behavior

Status
July 2018
 * Discussed...

August 14, 2018

September 18, 2018
 * Data collection is in progress and continues to go well. Data analysis will continue in to Q2

Outcome 3 / Output 3.1 (WMCS)
Users can leverage a reliable and public Infrastructure as a Service (IaaS) product ecosystem for VPS hosting.
 * Maintain existing OpenStack infrastructure and services

Primary team: WMCS

Goal(s)

 * Develop timeline for Ubuntu Trusty deprecation ✅
 * Communicate deprecation timeline to Cloud VPS community
 * Continue replacing Trusty with Debian Jessie/Stretch in infrastructure layer

Status
July 2018
 * 

August 10, 2018
 *  Discussed: that experiments have been completed but showed some issues ✅, will need to build parallel grids, new puppet code, etc 

September 12, 2018
 *  Brooke has made some progress on Puppet changes that will allow us to build a second grid engine deployment with Debian Stretch. This outcome will carry over into Q2 and progress should accelerate as the parallel Neutron migration project moves from development to implementation.

Outcome 3 / Output 3.2 (WMCS)
Users can leverage a reliable and public Infrastructure as a Service (IaaS) product ecosystem for VPS hosting.
 * Replace the current network topology layer with OpenStack Neutron

Primary team: WMCS

Goal(s)

 * Migrate at least one Cloud VPS project to the eqiad1 region and its Neutron SDN layer ✅

Status
July 2018
 * 

August 10, 2018
 *  discussed working on getting eqiad to play well with MAIN (shared services)

September 12, 2018
 *  * Arturo found a way to transparently route traffic between the legacy Nova Network deployment and the new Neutron deployment. This should make migrating projects to the new deployment significantly easier and more transparent for our end users.
 *  * Also, Progress is currently stalled due to unexpected hardware compatibility issues with Debian Jessie on new cloudvirt servers. We are getting a lot of help from the DC Ops team and Moritz to find solutions for these issues, but until we have a working combination of hardware & software we are holding off on procuring new servers. This in turn will slow the rate at which we can move projects to the new deployment. This is of larger impact for our Q2 continuation of this program than it is to meeting the current Q1 goal of moving at least one project.



===Q2 Goals Outcome / Output (Analytics)=== We have scalable, reliable and secure systems for data transport.


 * Analytics stack maintains current level of service.
 * Completing the OS upgrade to Debian stretch. Prepare the replacement of the Analytics store database.

Goal(s)

 * Continue upgrading to Debian Stretch:
 * bhorium (where piwik resides) ✅
 * Order hardware and set up basic puppet configuration for the dbstore1002's replacement (multi-instance database hosts).
 * STRETCH GOAL: Set up Mysql data replication on the dbstore1002's replacement.
 * STRETCH GOAL: Add prometheus metrics for varnishkafka instances running on caching hosts

Status
October 19, 2018
 * Update to stretch goal is ✅ for piwik machine; hardware orders for dbstore are in place

November 14, 2018
 * Prometheus work is now

December 2018
 * Discussed...

Outcome 1 / Output 1.1 (SRE)
Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.


 * Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)

Primary team(s): SRE, Dependencies on: Cloud Services, Search, Core Platform

Goal(s)

 * Refresh hardware and perform necessary maintenance
 * Refresh expiring leased hardware with replacements:
 * pc1004-1006
 * pc2004-2006
 * labvirt1010-1011
 * elastic2001-2024
 * restbase2001-2006
 * Refresh of aging (purchased) hardware
 * Refresh and expand Swift cluster in eqiad and codfw
 * Procurement and OS install of db1061-db1073 (refresh)
 * Switch etcd in eqiad to new servers conf100[4-6]

Status
October 2018
 * This is now as hardware is being ordered

November 14, 2018
 * This is still, as hardware has been ordered and is being received and installed. The codfw Parser Cache hosts is and the etcd migration is ✅

December 12, 2018
 * Leased hardware will be sent back by next week, and the refresh will be ✅ in the next two weeks.

Outcome 1 / Output 1.1 (SRE/Traffic)
Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.


 * Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)

Primary team(s): SRE/Traffic

Goal(s)

 * ATS production-ready as a backend cache layer
 * Purging ✅
 * Logging ✅
 * Monitoring ✅
 * Alerting ✅
 * Multi-DC Routing ✅
 * Backend-side request-mangling ✅
 * All the above prepares us for potential ATS live deployment as cache_upload backends in FQ3
 * Audit all Varnish-fronted services for lack of TLS, ping owners ahead of needs in FQ3/4 and beyond


 * Migrate most standard public TLS certificates to CertCentral issuance ✅
 * Replaces previous minimal/legacy LetsEncrypt automation
 * Work out bugs / operational issues that arise as we scale up CertCentral usage
 * Explicitly out of scope public certs: the big unified wildcard, frack, labs


 * Increase Network Capacity
 * Follow-up and Follow-on to same goal from FQ1
 * eqiad: Finish new switches and migration to supported topologies - T187962 - T183585
 * codfw: Finish new switches and migration to supported topologies - T197147
 * Replace cr1-eqord in Chicago
 * Add cr2-eqsin in Singapore

Status
October 2018
 * Discussed...

November 14, 2018
 * This goal is as purging for the ATS goal has been implemented and the cr1-eqord Chicago router replacement has completed.
 * We have a possible ❌ issue as the Network Capacity goal is at risk for FQ2 due to the holidays and critical infrastructure freezes.

December 12, 2018
 * ATS goal will be ✅ shortly and the certificate work will be also ✅ in the next week. Increasing network capacity will be ✅ in Q3, as fundraising is still on-going; the cabling will be done as soon as they are received

December 13, 2018
 * The certificates goal is ✅

Outcome 1 / Output 1.1 (RelEng)
Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.


 * Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)

Dependencies on: SRE

Goal(s)

 * Determine the procedure and requirements for an automated MediaWiki branch cut.

Status
October 2, 2018
 * This work is still

November 7, 2018
 * Scoping work at https://phabricator.wikimedia.org/T156445

December 6, 2018
 * This goal will be nearly complete by finishing up https://phabricator.wikimedia.org/T208528 and https://phabricator.wikimedia.org/T208529 in the next week or so.

Outcome 1 / Output 1.1 (Performance)
Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.


 * Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)

Dependencies on: SRE

Goal(s)

 * Train feature developers on the use of performance metrics to detect and address regressions.
 * Create and deliver a training session on the use of synthetic metrics
 * Create and deliver a training session on the use of RUM metrics
 * Deliver high-traffic images as WebP ✅
 * Improve Navigation Timing data, by moving it from Graphite to Prometheus
 * Expand mobile testing
 * Run extended/ongoing tests of performance on mobile phones
 * Expand outreach and engagement with the wider Performance community
 * Attend W3C meeting, as a participant in the Web Performance Working Group ✅
 * Begin publishing a monthly blog post summarizing performance over the prior month
 * Figure out whether it is possible to publish Navigation Timing data sets in some appropriately anonymized form ❌
 * Test the effect of Mediawiki commits on Performance
 * Add a basic performance test as part of the Jenkins pipeline, as a non-voting member
 * Ongoing maintenance of components owned by Performance team or individuals
 * MediaWiki's data access components
 * ResourceLoader
 * WebPageTest and other synthetic testing infrastructure
 * Thumbor/thumbnail generation

Status
October 18, 2018
 * Discussed how things are on track right now for progress for the quarter.

November 14, 2018
 * Improve Navigation Timing data, by moving it from Graphite to Prometheus is now
 * Attendance at the last W3C meeting is ✅

December 12, 2018
 * Anonymized data publishing deferred to Q3 ❌
 * WebP thumbnails is ✅
 * Improve Navigation Timing data, by moving it from Graphite to Prometheus
 * Proposal for W3C is
 * Summary blog post will be sent out soon
 * Ongoing maintenance of components owned by Performance team or individuals is (always)

Outcome 2 / Output 2.1 (Performance)
Better designed systems
 * Assist in the architectural design of new services and making them operate at scale

Primary team: Performance

Goal(s)

 * Research performance perception in order to identify specific metrics that influence user behavior

Status
October 18, 2018
 * Discussed how this goal is still (was a carry-over from Q1)

December 12, 2018
 * Final milestone is presentation of the research paper internally on December 17th; will hear about acceptance in early 2019 and is mostly ✅ at this point while we wait; Gilles is working with the Research team on this goal and will have follow-on work in 2019.

Outcome 3 / Output 3.1 (WMCS)
Users can leverage a reliable and public Infrastructure as a Service (IaaS) product ecosystem for VPS hosting.


 * Maintain existing OpenStack infrastructure and services

Primary team: WMCS

Goal(s)

 * Continue replacing Trusty with Debian Jessie/Stretch in infrastructure layer
 * Communicate Trusty deprecation timeline to Cloud VPS community ✅
 * Develop Trusty deprecation plan for Toolforge and communicate timeline to community
 * Track progress towards full removal of Trusty from Cloud VPS to encourage migration

Status
November 14, 2018
 * Updated per goal status and added links.

December 12, 2018
 * These goals are still, and we're still a bit stalled on the full plan communication to the community until it's actually done, so that the community can immediately go in and use it — we hope to be fully done by end of Q3.

Outcome 3 / Output 3.2 (WMCS)
Users can leverage a reliable and public Infrastructure as a Service (IaaS) product ecosystem for VPS hosting.


 * Replace the current network topology layer with OpenStack Neutron

Primary team: WMCS

Goal(s)

 * Migrate 50% of Cloud VPS projects to the eqiad1 region and its Neutron SDN layer ✅

Status
November 14, 2018
 * This goal is now, as of this week, we had 72 out of 170 (~42%) migrated.

December 12, 2018
 * We have surpassed this goal and are at ~65% of all projects have been migrated. This effort will continue in Q3, but this goal is considered ✅ for this quarter.



===Q3 Goals ===

Outcome 1 / Output 1.1 (RelEng)
Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.


 * Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)

Dependencies on: SRE

Goal(s)

 * Automate the generation of change log notes
 * Investigate notification methods for developers with changes that are riding any given train

Status
January 2019


 * Discussed...

February 2019


 * Discussed...

March 2019


 * Discussed...

Outcome 1 / Output 1.1 (SRE / Traffic)
Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
 * Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)

Primary teams: SRE & Traffic

Goal(s)

 * Deploy managed LetsEncrypt certs for all public use-cases:
 * wikiba.se
 * Global unified wildcard
 * Non-canonical domain redirects
 * Increase network capacity:
 * eqiad: Reconfigure Row A
 * eqsin: deploy cr2-eqsin
 * Test and deploy peering priority changes
 * https://phabricator.wikimedia.org/T204281
 * If testing turns out badly, obviously we'll choose not to deploy!
 * Ping offload:
 * https://phabricator.wikimedia.org/T190090
 * Finish design issues
 * Deploy fully to eqiad and codfw
 * Make a plan for the cache PoPs

Outcome 1 / Output 1.1 (Performance)
Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.


 * Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)

Dependencies on: SRE

Goals

 * Post quarterly metrics (carried over)
 * Develop a strategy for oversampling NavTiming data from unrepresented countries, in order to better understand performance characteristics in less connected/lower bandwidth geographies
 * Expand use of WebP thumbnails where it makes sense to do so, and actively clean up Swift.
 * Prepare and deliver presentations on both synthetic and RUM metrics, with a specific focus on how to make the data that comes from those metrics actionable.
 * Publish an initial ISP ranking, working with Comms to promote as appropriate
 * Expand performance testing on mobile devices
 * [Ongoing] Support and maintenance of MediaWiki's ResourceLoader and associated components
 * [Ongoing] Support and maintenance of MediaWiki's object caching and data access components
 * [Ongoing] Support and maintenance of Thumbor/thumbnail infrastructure
 * [Ongoing] Support and maintenance of WebPageTest and synthetic testing infrastructure

Status

 * January 9, 2019


 * discussed and updated above.

Outcome 2 / Output 2.1 (Performance)
Better designed systems
 * Assist in the architectural design of new services and making them operate at scale

Primary team: Performance

Goal(s)

 * Research performance perception in order to identify specific metrics that influence user behavior
 * Testing/operational support for new session store
 * Testing/operational support for ATS migration

Status

 * January 9, 2019


 * Most of this is in progress, but is reliant on other teams.

Outcome / Output (Analytics)
We have scalable, reliable and secure systems for data transport.
 * Analytics stack maintains current level of service.
 * Dependancies on SRE teams

Goal(s)

 * Replace dbstore1002 before April.  Move people away from dbstore1002 to the new set of hosts, and deprecate it before mid-Q3 (hard deadline, Ubuntu Trusty EOL)
 * *STRECH GOAL: Investigate if it is feasible to deprecate the research user account in favor of a multi-account solution

Status
January 2019
 * Discussed...

February 2019
 * Discussed...

March 2019
 * Discussed...

Outcome 1 / Output 1.1 (SRE)
Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.


 * Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)

Primary team(s): SRE, Dependencies on: Cloud Services, Analytics

Base system/distribution update

 * Remove remaining Ubuntu deployments from the production cluster
 * Adjust our operating system base layer to work on the forthcoming Debian 10/buster release
 * Install or upgrade 5 systems to buster
 * Draft a policy for operating systems lifecycle and subsequent EOL dates

Outcome 3 / Output 3.1 (WMCS)
Users can leverage a reliable and public Infrastructure as a Service (IaaS) product ecosystem for VPS hosting.


 * Maintain existing OpenStack infrastructure and services

Primary team: WMCS

Goal(s)

 * Replace Trusty with Debian Jessie/Stretch in Cloud Services infrastructure layer
 * Remove all Ubuntu-based instances from all Cloud VPS projects
 * Evaluate Ceph as a storage service component by building a proof of concept virtualized cluster

Status
January 2019


 * Discussed...

February 2019


 * Discussed...

March 2019


 * Discussed...

Outcome 3 / Output 3.2 (WMCS)
Users can leverage a reliable and public Infrastructure as a Service (IaaS) product ecosystem for VPS hosting.


 * Replace the current network topology layer with OpenStack Neutron

Primary team: WMCS

Goal(s)

 * Migrate 100% of Cloud VPS projects to the eqiad1 region and its Neutron SDN layer
 * Rebuild "labtest" staging environment as "cloud-dev" staging environment
 * (stretch goal) Upgrade OpenStack deployment to Newton or newer version on Debian Stretch hosts

Status
January 2019


 * Discussed...

February 2019


 * Discussed...

March 2019


 * Discussed...

Outcome 4 / Output 4.1 (WMCS)
Members of the Wikimedia movement are able to develop and deploy technical solutions with a reasonable investment of time and resources on the Wikimedia Cloud Services Platform as a Service (PaaS) product.


 * Maintain existing Grid Engine and Kubernetes web services infrastructure and ecosystems.

Primary team: WMCS

Goal(s)

 * Build Debian Stretch grid engine in Toolforge and assist community in migration
 * Upgrade Toolforge Kubernetes cluster to a well supported version and plan future upgrade cycles

Status
January 2019


 * Discussed...

February 2019


 * Discussed...

March 2019


 * Discussed...

<div style="padding:1.125em; display:inline-block; border:1px solid #a2a9b1; vertical-align:top; border-radius:2px; position:relative; box-shadow:0px 2px 2px rgba(0,0,0,0.1);">

=<div class="boxtitle" style="font-size:1.35em; padding-bottom:0.5625em; font-weight:bold; text-align:left; border-bottom: 1px solid #c9c9c9">Q4 Goals =

Outcome X / Output X
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
 * Nullam interdum, elit in malesuada aliquam, libero lorem auctor lacus, eu mattis lacus velit vitae mauris.

Dependancies on: ___________

Goal(s)

 * Ut eget sodales odio. Maecenas a varius leo.

Status
April 2019
 * Discussed...

May 2019
 * Discussed...

June 2019
 * Discussed...