Wikimedia Technology/Annual Plans/FY2019/TEC1: Reliability, Performance, and Maintenance/Goals

=Program Goals and Status for FY18/19=

TEC1: Reliability, Performance, and Maintenance
 * Goal Owners: Mark Bergsma; Ian Marlier; Nuria Ruiz; Bryan Davis
 * Program Goals for FY18/19: We will maintain the availability of Wikimedia’s sites and services for our global audiences and ensure they’re running reliably, securely, and with high performance. We will do this while modernizing our infrastructure and improving current levels of service when it comes to testing, deployments, and maintenance of software and hardware.
 * Annual Plan: TEC1: Reliability, Performance, and Maintenance
 * Primary Goal is Knowledge as a Service: Evolve our systems and structures
 * Tech Goal: Sustaining



 = Q1 Goals =

Outcome / Output (Analytics)
We have scalable, reliable and secure systems for data transport
 * Analytics stack maintains current level of service. OS Upgrades.

Primary teams: Analytics

Goal(s)

 * Continue upgrading to Debian Stretch:
 * AQS ✅
 * thorium ,
 * Archiva
 * bhorium (where piwik resides)
 * Refresh of:
 * hadoop master nodes
 * analytics1003
 * STRETCH GOAL: Add prometheus metrics for varnishkafka instances running on caching hosts

Status
July 2018
 * one goal

August 22, 2018
 * one goal

September 2018
 * Discussed...

Outcome 1 / Output 1.1 (SRE)
Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
 * Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)

Dependencies on: Core Platform (MediaWiki, RESTbase), Performance, Release Engineering, Parsing (Parsoid), Analytics (EventBus),Community Liaisons; Primary team: SRE

Goal(s)
Perform a datacenter switchover
 * Successfully switch backend traffic (MediaWiki, Swift, RESTBase, and Parsoid) to be served from codfw with no downtime and reduced read-only time
 * Serve the site from codfw for at least 3 weeks
 * Refactor the switchdc script into a more re-usable automation library and update it to the newer switchover requirements ✅

Status
July 2018
 * Discussed...

August 14, 2018
 * this is ongoing and looking good, had some minor delays but it's scheduled for Sep for first switchover and back in October.

September 11, 2018
 * The switch to codfw is happening this week. The switchdc refactoring is complete and has been published as a generically re-usable (FOSS) automation library, as [and is published as spicerack (FOSS) spicerack].

Outcome 1 / Output 1.1 (SRE / Traffic)
Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
 * Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)

Primary teams: SRE & Traffic

Goal(s)

 * Deploy an ATS backend cache test cluster in core DCs: ✅
 * 2x 4 node clusters
 * Puppetization
 * Application-layer routing
 * Deploy a scalable service for ACME (LetsEncrypt) certificate management:
 * Features:
 * Core DC redundancy
 * Wildcards + SANs
 * Multiple client hosts for each cert
 * Coding + Puppetization
 * Deploy in both DCs
 * Live use for one prod cert
 * Increase network capacity:
 * eqiad: 2 rows with 3*10G racks
 * codfw: 2 rows with 3*10G racks
 * ulsfo: replace routers
 * eqdfw: replace router

Status
July 2018
 * Discussed...

August 14, 2018
 * The ATS testing and creation of packages is going well and some is already in production. This is doing the basic needs so far but nearly done.
 * ACME service is also on track
 * Increasing network capacity goal is moving much slower than expected, due to hardware installation (network switches) on the switch stacks issues with the vendor. This goal is at risk for this quarter due to these hardware and testing issues with the new/old switches. The replacement routers will also be at risk for this quarter.

September 2018
 * ✅ This goal is fully complete; an ATS test cluster has been setup in both core DCs.
 * ACME software has been written, deployment and use is expected by end of the quarter.
 * Increasing network capacity goal has hit a roadblock, due to network switch stack topology incompatibility issues with the vendor, causing network instability in production. This goal is at risk for this quarter, as we can neither take any risks nor move swifty on this very essential and critical infrastructure. The replacement of the routers is expected to complete before EOQ.

Outcome 1 / Output 1.1 (Performance)
Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
 * Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)

Primary team: Performance

Goal(s)

 * Improve synthetic monitoring
 * Monitor more WMF sites, including some in each deploy group
 * Configure alerts such that they're directed to the teams that are able to address them
 * Improve Navigation Timing data
 * Update dashboards to use newer navtiming2 keys
 * Move navtiming data from graphite to prometheus
 * Remove dependency on jQuery from Mediawiki's base Javascript module ✅
 * Support and develop the Mediawiki ResourceLoader component
 * Dependency: Core Platform
 * Support and develop Mediawiki's data access components
 * Dependency: Core Platform

Status
July 2018

August 14, 2018
 * Synthentic monitoring and timing data is in progress but slightly delayed due to summer vacations. Everything else is in progress or done.

September 2018
 * Discussed...

Outcome 2 / Output 2.1 (Performance)
Better designed systems
 * Assist in the architectural design of new services and making them operate at scale

Primary team: Performance

Goal(s)

 * Research performance perception in order to identify specific metrics that influence user behavior

Status
July 2018
 * Discussed...

August 14, 2018

September 2018
 * Discussed...

Outcome 3 / Output 3.1 (WMCS)
Users can leverage a reliable and public Infrastructure as a Service (IaaS) product ecosystem for VPS hosting.
 * Maintain existing OpenStack infrastructure and services

Primary team: WMCS

Goal(s)

 * Develop timeline for Ubuntu Trusty deprecation ✅
 * Communicate deprecation timeline to Cloud VPS community
 * Continue replacing Trusty with Debian Jessie/Stretch in infrastructure layer

Status
July 2018
 * 

August 10, 2018
 *  Discussed: that experiments have been completed but showed some issues ✅, will need to build parallel grids, new puppet code, etc 

September 2018
 * Discussed...

Outcome 3 / Output 3.2 (WMCS)
Users can leverage a reliable and public Infrastructure as a Service (IaaS) product ecosystem for VPS hosting.
 * Replace the current network topology layer with OpenStack Neutron

Primary team: WMCS

Goal(s)

 * Migrate at least one Cloud VPS project to the eqiad1 region and its Neutron SDN layer

Status
July 2018
 * 

August 10, 2018
 *  discussed working on getting eqiad to play well with MAIN (shared services)

September 2018
 * Discussed...



=Q2 Goals =

Outcome X / Output X
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
 * Nullam interdum, elit in malesuada aliquam, libero lorem auctor lacus, eu mattis lacus velit vitae mauris.

Dependencies on: ___________

Goal(s)

 * Ut eget sodales odio. Maecenas a varius leo.

Status
October 2018
 * Discussed...

November 2018
 * Discussed...

December 2018
 * Discussed...