Wikimedia Technology/Annual Plans/FY2019/TEC1: Reliability, Performance, and Maintenance/Goals
Program Goals and Status for FY18/19
[edit]- Goal Owners: Mark Bergsma; Kate Chapman; Nuria Ruiz; Bryan Davis
- Program Goals for FY18/19: We will maintain the availability of Wikimedia’s sites and services for our global audiences and ensure they’re running reliably, securely, and with high performance. We will do this while modernizing our infrastructure and improving current levels of service when it comes to testing, deployments, and maintenance of software and hardware.
- Annual Plan: TEC1: Reliability, Performance, and Maintenance
- Primary Goal is Knowledge as a Service: Evolve our systems and structures
- Tech Goal: Sustaining
Outcome / Output (Analytics)
[edit]We have scalable, reliable and secure systems for data transport and data processing.
- Analytics stack maintains current level of service. OS Upgrades. T192642
Goal(s)
[edit]- Continue upgrading to Debian Stretch:
- AQS
Done
- thorium
Done
- Archiva
Done
- bhorium (where piwik resides)
To do
- AQS
- Refresh of:
- hadoop master nodes
Done
- analytics1003
In progress
- hadoop master nodes
Status
[edit] Note: July 2018
- one goal
In progress
Note: August 22, 2018
- one goal
Partially done
Note: September 18, 2018
- Remaining hosts to be completed before end of quarter
Partially done
Outcome 1 / Output 1.1 (SRE)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
- Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)
Dependencies on: Core Platform (MediaWiki, RESTbase), Performance, Release Engineering, Parsing (Parsoid), Analytics (EventBus),Community Liaisons; Primary team: SRE
Goal(s)
[edit]Perform a datacenter switchover
- Successfully switch backend traffic (MediaWiki, Swift, RESTBase, and Parsoid) to be served from codfw with no downtime and reduced read-only time
Done
- Serve the site from codfw for at least 3 weeks
In progress
- Refactor the switchdc script into a more re-usable automation library and update it to the newer switchover requirements
Done
Status
[edit] Note: July 2018
- Discussed...
Note: August 14, 2018
In progress this is ongoing and looking good, had some minor delays but it's scheduled for Sep for first switchover and back in October.
Note: September 11, 2018
In progress The switch to codfw is happening this week. The switchdc refactoring is complete and has been published as a generically re-usable (FOSS) automation library, as and is published as spicerack (FOSS).
Outcome 1 / Output 1.1 (SRE / Traffic)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
- Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)
Primary teams: SRE & Traffic
Goal(s)
[edit]- Deploy an ATS backend cache test cluster in core DCs:
Done
- 2x 4 node clusters
- Puppetization
- Application-layer routing
- Deploy a scalable service for ACME (LetsEncrypt) certificate management:
In progress
- Features:
- Core DC redundancy
- Wildcards + SANs
- Multiple client hosts for each cert
- Coding + Puppetization
- Deploy in both DCs
- Live use for one prod cert
- Features:
- Increase network capacity:
Partially done
- eqiad: 2 rows with 3*10G racks
- codfw: 2 rows with 3*10G racks
- ulsfo: replace routers
- eqdfw: replace router
Status
[edit] Note: July 2018
- Discussed...
Note: August 14, 2018
Partially done The ATS testing and creation of packages is going well and some is already in production. This is doing the basic needs so far but nearly done.
In progress ACME service is also on track
In progress Increasing network capacity goal is moving much slower than expected, due to hardware installation (network switches) on the switch stacks issues with the vendor. This goal is at risk for this quarter due to these hardware and testing issues with the new/old switches. The replacement routers will also be at risk for this quarter.
Partially done September 2018
Done This goal is fully complete; an ATS test cluster has been setup in both core DCs.
In progressACME software has been written, deployment and use is expected by end of the quarter.
In progressIncreasing network capacity goal has hit a roadblock, due to network switch stack topology incompatibility issues with the vendor, causing network instability in production. This goal is at risk for this quarter, as we can neither take any risks nor move swifty on this very essential and critical infrastructure. The replacement of the routers is expected to complete before EOQ.
Outcome 1 / Output 1.1 (Performance)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
- Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)
Primary team: Performance
Goal(s)
[edit]- Improve synthetic monitoring
In progress
- Monitor more WMF sites, including some in each deploy group
- Configure alerts such that they're directed to the teams that are able to address them
- Improve Navigation Timing data
In progress
- Update dashboards to use newer navtiming2 keys
- Move navtiming data from graphite to prometheus
- Remove dependency on jQuery from Mediawiki's base Javascript module
Done
- Support and develop the Mediawiki ResourceLoader component
In progress
- Dependency: Core Platform
- Support and develop Mediawiki's data access components
In progress
- Dependency: Core Platform
Status
[edit] Note: July 2018
In progress
Note: August 14, 2018
In progress Synthentic monitoring and timing data is in progress but slightly delayed due to summer vacations. Everything else is in progress or done.
Note: September 18, 2018
In progress Synthetic monitoring and timing data are in progress and on track. Other goals are ongoing maintenance work that don't have a defined end state.
Outcome 2 / Output 2.1 (Performance)
[edit]Better designed systems
- Assist in the architectural design of new services and making them operate at scale
Primary team: Performance
Goal(s)
[edit]- Research performance perception in order to identify specific metrics that influence user behavior
Status
[edit] Note: July 2018
- Discussed...
Note: August 14, 2018
In progress
Note: September 18, 2018
In progress Data collection is in progress and continues to go well. Data analysis will continue in to Q2
Outcome 3 / Output 3.1 (WMCS)
[edit]Users can leverage a reliable and public Infrastructure as a Service (IaaS) product ecosystem for VPS hosting.
- Maintain existing OpenStack infrastructure and services
Primary team: WMCS
Goal(s)
[edit]- Develop timeline for Ubuntu Trusty deprecation
Done
- Communicate deprecation timeline to Cloud VPS community
To do
- Continue replacing Trusty with Debian Jessie/Stretch in infrastructure layer
In progress
Status
[edit] Note: July 2018
In progress
Note: August 10, 2018
In progress Discussed: that experiments have been completed but showed some issues
Done, will need to build parallel grids, new puppet code, etc
In progress
Note: September 12, 2018
In progress Brooke has made some progress on Puppet changes that will allow us to build a second grid engine deployment with Debian Stretch. This outcome will carry over into Q2 and progress should accelerate as the parallel Neutron migration project moves from development to implementation.
Outcome 3 / Output 3.2 (WMCS)
[edit]Users can leverage a reliable and public Infrastructure as a Service (IaaS) product ecosystem for VPS hosting.
- Replace the current network topology layer with OpenStack Neutron
Primary team: WMCS
Goal(s)
[edit]- Migrate at least one Cloud VPS project to the eqiad1 region and its Neutron SDN layer
Done
Status
[edit] Note: July 2018
In progress
Note: August 10, 2018
In progress discussed working on getting eqiad to play well with MAIN (shared services)
Note: September 12, 2018
In progress * Arturo found a way to transparently route traffic between the legacy Nova Network deployment and the new Neutron deployment. This should make migrating projects to the new deployment significantly easier and more transparent for our end users.
In progress * Also, Progress is currently stalled due to unexpected hardware compatibility issues with Debian Jessie on new cloudvirt servers. We are getting a lot of help from the DC Ops team and Moritz to find solutions for these issues, but until we have a working combination of hardware & software we are holding off on procuring new servers. This in turn will slow the rate at which we can move projects to the new deployment. This is of larger impact for our Q2 continuation of this program than it is to meeting the current Q1 goal of moving at least one project.
Outcome / Output (Analytics)
[edit]We have scalable, reliable and secure systems for data transport and data processing.
- Analytics stack maintains current level of service.
- Completing the OS upgrade to Debian stretch. Prepare the replacement of the Analytics store database.
Goal(s)
[edit]- Continue upgrading to Debian Stretch:
- bhorium (where piwik resides)
Done
- bhorium (where piwik resides)
- Order hardware and set up basic puppet configuration for the dbstore1002's replacement (multi-instance database hosts).
Partially done
- STRETCH GOAL: Set up Mysql data replication on the dbstore1002's replacement.
- STRETCH GOAL: Add prometheus metrics for varnishkafka instances running on caching hosts T196066
Status
[edit] Note: October 19, 2018
- Update to stretch goal is
Done for piwik machine; hardware orders for dbstore are in place
Note: November 14, 2018
- Prometheus work is now
In progress
To do December 2018
- Discussed...
Outcome 1 / Output 1.1 (SRE)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
- Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)
Primary team(s): SRE, Dependencies on: Cloud Services, Search, Core Platform
Goal(s)
[edit]- Refresh hardware and perform necessary maintenance
- Refresh expiring leased hardware with replacements:
- Refresh of aging (purchased) hardware
- Refresh and expand Swift cluster in eqiad and codfw
- Procurement and OS install of db1061-db1073 (refresh)
- Switch etcd in eqiad to new servers conf100[4-6]
Status
[edit] Note: October 2018
- This is now
In progress as hardware is being ordered
Note: November 14, 2018
- This is still
In progress, as hardware has been ordered and is being received and installed. The codfw Parser Cache hosts is
Partially done and the etcd migration is
Done
Note: December 12, 2018
- Leased hardware will be sent back by next week, and the refresh will be
Done in the next two weeks.
Outcome 1 / Output 1.1 (SRE/Traffic)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
- Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)
Primary team(s): SRE/Traffic
Goal(s)
[edit]- ATS production-ready as a backend cache layer
Partially done
- Purging
Done
- Logging
Done
- Monitoring
Done
- Alerting
Done
- Multi-DC Routing
Done
- Backend-side request-mangling
Done
- All the above prepares us for potential ATS live deployment as cache_upload backends in FQ3
- Audit all Varnish-fronted services for lack of TLS, ping owners ahead of needs in FQ3/4 and beyond
Partially done
- Purging
- Migrate most standard public TLS certificates to CertCentral issuance
Done
- Replaces previous minimal/legacy LetsEncrypt automation
- Work out bugs / operational issues that arise as we scale up CertCentral usage
- Explicitly out of scope public certs: the big unified wildcard, frack, labs
- Increase Network Capacity
Partially done
- Follow-up and Follow-on to same goal from FQ1
- eqiad: Finish new switches and migration to supported topologies - phab:T187962 - phab:T183585
- codfw: Finish new switches and migration to supported topologies - phab:T197147
- Replace cr1-eqord in Chicago
- Add cr2-eqsin in Singapore
Status
[edit] To do October 2018
- Discussed...
Note: November 14, 2018
- This goal is
Partially done as purging for the ATS goal has been implemented and the cr1-eqord Chicago router replacement has completed.
- We have a possible
Stalled issue as the Network Capacity goal is at risk for FQ2 due to the holidays and critical infrastructure freezes.
Note: December 12, 2018
- ATS goal will be
Done shortly and the certificate work will be also
Done in the next week. Increasing network capacity will be
Done in Q3, as fundraising is still on-going; the cabling will be done as soon as they are received
Partially done
Note: December 13, 2018
- The certificates goal is
Done
Outcome 1 / Output 1.1 (RelEng)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
- Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)
Dependencies on: SRE
Goal(s)
[edit]- Determine the procedure and requirements for an automated MediaWiki branch cut.
Status
[edit] Note: October 2, 2018
- This work is still
To do
Note: November 7, 2018
- Scoping work at https://phabricator.wikimedia.org/T156445
Note: December 6, 2018
- This
In progress goal will be nearly complete by finishing up https://phabricator.wikimedia.org/T208528 and https://phabricator.wikimedia.org/T208529 in the next week or so.
Outcome 1 / Output 1.1 (Performance)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
- Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)
Dependencies on: SRE
Goal(s)
[edit]- Train feature developers on the use of performance metrics to detect and address regressions.
- Create and deliver a training session on the use of synthetic metrics
Partially done
- Create and deliver a training session on the use of RUM metrics
Partially done
- Create and deliver a training session on the use of synthetic metrics
- Deliver high-traffic images as WebP
Done
- Improve Navigation Timing data, by moving it from Graphite to Prometheus
Not done
- Expand mobile testing
- Run extended/ongoing tests of performance on mobile phones
Done
- Run extended/ongoing tests of performance on mobile phones
- Expand outreach and engagement with the wider Performance community
- Attend W3C meeting, as a participant in the Web Performance Working Group
Done
- Begin publishing a monthly blog post summarizing performance over the prior month
Not done
- Figure out whether it is possible to publish Navigation Timing data sets in some appropriately anonymized form
Not done
- Attend W3C meeting, as a participant in the Web Performance Working Group
- Test the effect of Mediawiki commits on Performance
- Add a basic performance test as part of the Jenkins pipeline, as a non-voting member
Done
- Add a basic performance test as part of the Jenkins pipeline, as a non-voting member
- Ongoing maintenance of components owned by Performance team or individuals
Done
- MediaWiki's data access components
- ResourceLoader
- WebPageTest and other synthetic testing infrastructure
- Thumbor/thumbnail generation
Status
[edit] Note: October 18, 2018
- Discussed how things are on track right now for progress for the quarter.
Note: November 14, 2018
- Improve Navigation Timing data, by moving it from Graphite to Prometheus is now
In progress
- Attendance at the last W3C meeting is
Done
Note: December 12, 2018
- Anonymized data publishing deferred to Q3
Stalled
- WebP thumbnails is
Done
- Improve Navigation Timing data, by moving it from Graphite to Prometheus
Partially done
- Proposal for W3C is
In progress
- Summary blog post will be sent out soon
Partially done
- Ongoing maintenance of components owned by Performance team or individuals is (always)
In progress
Outcome 2 / Output 2.1 (Performance)
[edit]Better designed systems
- Assist in the architectural design of new services and making them operate at scale
Primary team: Performance
Goal(s)
[edit]- Research performance perception in order to identify specific metrics that influence user behavior
Done (continues in Q3)
Status
[edit] Note: October 18, 2018
- Discussed how this goal is still
In progress (was a carry-over from Q1)
Note: December 12, 2018
In progress Final milestone is presentation of the research paper internally on December 17th; will hear about acceptance in early 2019 and is mostly
Done at this point while we wait; Gilles is working with the Research team on this goal and will have follow-on work in 2019.
Outcome 3 / Output 3.1 (WMCS)
[edit]Users can leverage a reliable and public Infrastructure as a Service (IaaS) product ecosystem for VPS hosting.
- Maintain existing OpenStack infrastructure and services
Primary team: WMCS
Goal(s)
[edit]- Continue replacing Trusty with Debian Jessie/Stretch in infrastructure layer
In progress
- Communicate Trusty deprecation timeline to Cloud VPS community
Done
- Develop Trusty deprecation plan for Toolforge and communicate timeline to community
In progress
- Track progress towards full removal of Trusty from Cloud VPS to encourage migration
In progress
Status
[edit] Note: November 14, 2018
- Updated per goal status and added links.
Note: December 12, 2018
- These goals are still
In progress, and we're still a bit stalled on the full plan communication to the community until it's actually done, so that the community can immediately go in and use it — we hope to be fully done by end of Q3.
Outcome 3 / Output 3.2 (WMCS)
[edit]Users can leverage a reliable and public Infrastructure as a Service (IaaS) product ecosystem for VPS hosting.
- Replace the current network topology layer with OpenStack Neutron
Primary team: WMCS
Goal(s)
[edit]- Migrate 50% of Cloud VPS projects to the eqiad1 region and its Neutron SDN layer
Done
Status
[edit] Note: November 14, 2018
- This goal is now
In progress, as of this week, we had 72 out of 170 (~42%) migrated.
Note: December 12, 2018
- We have surpassed this goal and are at ~65% of all projects have been migrated. This effort will continue in Q3, but this goal is considered
Done for this quarter.
Outcome 1 / Output 1.1 (RelEng)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
- Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)
Dependencies on: SRE
Goal(s)
[edit]- Automate the generation of change log notes
- Investigate notification methods for developers with changes that are riding any given train
Status
[edit] Note: March 13, 2019
- Both goals are
Done
Outcome 1 / Output 1.1 (SRE / Traffic)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
- Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)
Primary teams: SRE & Traffic
Goal(s)
[edit]- Deploy managed LetsEncrypt certs for all public use-cases:
- wikiba.se
- Global unified wildcard
- Non-canonical domain redirects
- https://phabricator.wikimedia.org/T213705
- Increase network capacity:
- eqiad: Reconfigure Row A
- eqsin: deploy cr2-eqsin
- https://phabricator.wikimedia.org/T213122
- Test and deploy equal prioritization of peering and transit
- Establish metrics to monitor: performance impact and transit bandwidth
- Trial the change and gather new metrics
- Compare and decide whether (and/or where) to keep the new priorities based on perf and cost analysis
- https://phabricator.wikimedia.org/T204281
- Implement Ping Offload service in core DCs
- Finish design issues
- Deploy offloading service for public ICMP ping traffic to primary service IPs in eqiad and codfw
- Make a plan for the cache/network PoPs
- https://phabricator.wikimedia.org/T190090
Status
[edit] Note: January 10, 2019
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
In progress
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
Note: February 19, 2019
- Most of these are still
In progress: certcentral renaming is nearing completion, cr2-eqsin is on-track to deploy, peering priority change has been done for esams, ping offload still looking at final design issues
- Most of these are still
Note: March 27, 2019
- certcentral renaming is
Partially done and will finish up in Q4
- network capacity configuring and deploying is
Done
- prioritization of peering and transit is
Done
- Ping Offload service in core DCs is
Partially done and expected to be fully completed by end of March or first week of April.
- certcentral renaming is
Note: April 8, 2019
- All unfinished bits were finished up last week!
Done
- All unfinished bits were finished up last week!
Outcome 1 / Output 1.1 (Performance)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
- Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)
Dependencies on: SRE
Goals
[edit]- Post quarterly metrics (carried over)
- Develop a strategy for oversampling NavTiming data from unrepresented countries, in order to better understand performance characteristics in less connected/lower bandwidth geographies
In progress
- Expand use of WebP thumbnails where it makes sense to do so, and actively clean up Swift.
In progress
- Prepare and deliver presentations on both synthetic and RUM metrics, with a specific focus on how to make the data that comes from those metrics actionable.
In progress
- Publish an initial ISP ranking, working with Comms to promote as appropriate
In progress
- Expand performance testing on mobile devices
In progress
- [Ongoing] Support and maintenance of MediaWiki's ResourceLoader and associated components
- [Ongoing] Support and maintenance of MediaWiki's object caching and data access components
- [Ongoing] Support and maintenance of Thumbor/thumbnail infrastructure
- [Ongoing] Support and maintenance of WebPageTest and synthetic testing infrastructure
Status
[edit] Note: January 9, 2019
- discussed and updated above.
Note: February 13, 2019
- Discussed that everything is still in progress, we're working with Legal and Commons on the ISP ranking
Note: March 13, 2019
- Discussed that everything is still
In progress; ISP ranking service is
Partially done but waiting word from Legal before publishing.
- Discussed that everything is still
Outcome 2 / Output 2.1 (Performance)
[edit]Better designed systems
- Assist in the architectural design of new services and making them operate at scale
Primary team: Performance
Goal(s)
[edit]- Research performance perception in order to identify specific metrics that influence user behavior
In progress
- Testing/operational support for new session store
In progress
- Testing/operational support for ATS migration
To do
Status
[edit] Note: January 9, 2019
- Most of this is in progress, but is reliant on other teams.
Note: February 13, 2019
- Discussed that the session store work is now
In progress
- Discussed that the session store work is now
Note: March 13, 2019
- We are still in a support role for this, but still
In progress
- We are still in a support role for this, but still
Outcome / Output (Analytics)
[edit]We have scalable, reliable and secure systems for data transport.
- Analytics stack maintains current level of service.
- Dependancies on SRE teams
Goal(s)
[edit]- Replace dbstore1002 before April. Move people away from dbstore1002 to the new set of hosts, and deprecate it before mid-Q3 (hard deadline, Ubuntu Trusty EOL) T210478
Done
- STRETCH GOAL: Investigate if it is feasible to deprecate the research user account in favor of a multi-account solution
Status
[edit] Note: February 14, 2019
- Dbstore migration is on track, should be finished by mid March. Users already have access to new 3 machine clusters.
Note: March 14, 2019
- Cluster db has been migrated with work by Analytics and DBA, we have decided to
Postpone any changes with research user account in the light of us wanting to move from MySQL as the main data store for data access.
- Dbstore migration is now
Done
- Cluster db has been migrated with work by Analytics and DBA, we have decided to
Outcome 1 / Output 1.1 (SRE)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
- Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)
Primary team(s): SRE, Dependencies on: Cloud Services, Analytics
Goal(s)
[edit]Base system/distribution update
[edit]- Remove remaining Ubuntu deployments from the production cluster
- Adjust our operating system base layer to work on the forthcoming Debian 10/buster release
- Install or upgrade 5 systems to buster
- Draft a policy for operating systems lifecycle and subsequent EOL dates
Status
[edit] Note: January 10, 2019
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
In progress
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
Note: February 13, 2019
- Bare-metal buster installations now working, the first production system with Buster (stat1005) is being installed. On track.
In progress
- Bare-metal buster installations now working, the first production system with Buster (stat1005) is being installed. On track.
Note: March 27, 2019
- Remove remaining Ubuntu deployments from the production cluster is
Partially done and will wrap up by April 25, 2019. The rest of this goal is
Done.
- Remove remaining Ubuntu deployments from the production cluster is
Outcome 3 / Output 3.1 (WMCS)
[edit]Users can leverage a reliable and public Infrastructure as a Service (IaaS) product ecosystem for VPS hosting.
- Maintain existing OpenStack infrastructure and services
Primary team: WMCS
Goal(s)
[edit]- Replace Trusty with Debian Jessie/Stretch in Cloud Services infrastructure layer
In progress
- Remove all Ubuntu-based instances from all Cloud VPS projects
In progress
- Evaluate Ceph as a storage service component by building a proof of concept virtualized cluster
To do
Status
[edit] Note: January 9, 2019
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
In progress
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
Note: February 13, 2019
- Discussed that the replacing of Trusty is still
In progress and probably go into April.
- Removing all Ubuntu-based instances is
Partially done and notifications went out last week for the new grid information; we hope to be totally done by end of quarter, one way or another.
- Evaluating Ceph is still
To do as we work on other more pressing goals.
- Discussed that the replacing of Trusty is still
Note: March 13, 2019
- The timeline for replacing Trusty slipped a bit, but is
Partially done, is expected to be finished in Q4
- Removing Ubuntu instances is
Partially done but will also be extended into Q4
- Eval of Ceph has been
Postponed to Q4.
- The timeline for replacing Trusty slipped a bit, but is
Outcome 3 / Output 3.2 (WMCS)
[edit]Users can leverage a reliable and public Infrastructure as a Service (IaaS) product ecosystem for VPS hosting.
- Replace the current network topology layer with OpenStack Neutron
Primary team: WMCS
Goal(s)
[edit]- Migrate 100% of Cloud VPS projects to the eqiad1 region and its Neutron SDN layer
In progress
- Rebuild "labtest" staging environment as "cloud-dev" staging environment
In progress
- (stretch goal) Upgrade OpenStack deployment to Newton or newer version on Debian Stretch hosts
In progress
Status
[edit] Note: January 9, 2019
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
In progress
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
Note: February 13, 2019
- Migration is still
In progress but
Stalled on ToolForge issues - will be completed in April 2019
- Rebuild is
In progress, racking and unracking hardware is fully underway.
- Our stretch goal is now
In progress, as the spike to figure out what we need to do is
Done. Implementation is
Blocked on the migration to Debian.
- Migration is still
Note: March 13, 2019
- We are currently
Blocked on the Trusty issue
- We are hoping to complete the staging environment,
Partially done
- Stretch goal will be pushed to Q4
- We are currently
Outcome 4 / Output 4.1 (WMCS)
[edit]Members of the Wikimedia movement are able to develop and deploy technical solutions with a reasonable investment of time and resources on the Wikimedia Cloud Services Platform as a Service (PaaS) product.
- Maintain existing Grid Engine and Kubernetes web services infrastructure and ecosystems.
Primary team: WMCS
Goal(s)
[edit]- Build Debian Stretch grid engine in Toolforge and assist community in migration
Done
- Upgrade Toolforge Kubernetes cluster to a well supported version and plan future upgrade cycles
In progress
Status
[edit] Note: January 9, 2019
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
In progress
- Discussed that as we've just gotten back from our vacations, this work is ramping up and is
Note: February 13, 2019
- Debian stretch grid has been built out
Done and we're migrating and helping the community with any fixes as needed, and
In progress
- Upgrading k8s cluster is also
In progress, with lots of work to do be done after the active planning phase is completed.
- Debian stretch grid has been built out
Note: March 13, 2019
- Stretch grid is
In progress with a revised schedule of tasks, hoping to wrap up by week of March 25th (508 tools still remaining, nearly 50% moved)
- Upgrading the k8 cluster is
In progress and in planning stage currently
- Stretch grid is
Outcome 1 / Output 1.1 (SRE)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
- Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)
Primary teams: SRE / Infrastructure Foundations Dependencies on:
Goal(s)
[edit]First steps towards Puppet 5
[edit]- Upgrade Puppet agents to v5.5 across all production jessie & stretch systems
- Upgrade Facter to v3.11 across all production jessie, stretch and buster systems
- [stretch] Explore further steps towards the Puppetmaster, Hiera and PuppetDB upgrades
Status
[edit] Note: May 8, 2019
- Both goals are
Partially done, stretch goal is also
Partially done
- Both goals are
Note: June 7, 2019
- Goal is
Done, stretch goal is also
Partially done
- Goal is
Outcome 1 / Output 1.1 (SRE)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
- Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)
Primary teams: SRE / Service Operations Dependencies on: WMCS, Community Relations
Goal(s)
[edit]Discourse
[edit]Status
[edit] Note: May 8, 2019
- Prep work is ongoing and in early stages
In progress
- Prep work is ongoing and in early stages
Note: June 13, 2019
- This is
Postponed for now, we'll revisit next FY
- This is
Outcome 1 / Output 1.1 (SRE)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
- Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)
Primary teams: SRE / Service Operations Dependencies on: WMCS, Community Relations
Goal(s)
[edit]Address Database infrastructure blockers on datacenter switchover & multi-dc deployment
- Purchase, setup (accelerated) codfw core DB hosts
- Provision and setup the new 13 eqiad DBs
- Failover some old eqiad DB masters to new hardware
Status
[edit] Note: May 8, 2019
- This is currently
Blocked on the data center ops after purchasing the servers, but overall it is
In progress and ongoing.
- This is currently
Note: June 13, 2019
- Purchase and setup is
Done, provisioning is
In progress; failover is
In progress and on track to finish this quarter. We'd like to failover some of the equid masters and see where it takes us.
- Purchase and setup is
Outcome 1 / Output 1.1 (SRE / Traffic)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
- Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)
Primary teams: SRE / Traffic
Goal(s)
[edit]Deploy TLS 1.3 for main cache clusters
- https://phabricator.wikimedia.org/T170567
- Determine whether we're implementing this via ATS or nginx
- Puppetize a solution
- Test
- Deploy
Implement secure redirect service for non-canonical domains
Convert cache_upload to ATS backends in ulsfo
- https://phabricator.wikimedia.org/T219967
- Puppetize the combination of varnish-frontend + ATS-backend on a single cache node
- Convert all of cache_upload @ ulsfo to the new combined configuration
- Stretch: convert more/all datacenters
Deploy Bird-lg public looking glass service
Deploy Anycast Recursive DNS
- https://phabricator.wikimedia.org/T186550
- Finish any remaining design / implementation review
- Test failure scenarios, test service from every DC
- Enable via resolv.conf for some minimal set of production clients
Status
[edit] Note: May 8, 2019
- We will most likely be doing this in ATS and it is
In progress with test/deploy to be done later.
- Secure redirect has not been started yet
- Convert cache_upload is
Done and all hosts are running ATS currently
- Conversations are
In progress to see if we want to deploy Bird-lg public looking glass service, might be declined later on.
- Deploy Anycast Recursive DNS is
In progress
- We will most likely be doing this in ATS and it is
Note: June 13, 2019
- Deploy TLS 1.3 for main cache clusters might be able to be finished this quarter and is
Partially done.
- Implement secure redirect service for non-canonical domains is currently
Stalled due to more urgent work
- Deploy Bird-lg public looking glass service is
Not done and will move to next FY
- Deploy Anycast Recursive DNS is
In progress and should be finished up by the end of the quarter.
- Deploy TLS 1.3 for main cache clusters might be able to be finished this quarter and is
Outcome 1 / Output 1.5 (SRE)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
- We have scalable, reliable and secure systems for data transport and storage.
Primary teams: SRE / Infrastructure Foundations Dependencies on: Analytics Engineering
Goal(s)
[edit]Transition Kafka main ownership from Analytics Engineering to SRE
[edit]- Review current architecture/capacity and establish plan for Kafka main cluster upgrade/refresh to cover needs for next 2-3 years
- Audit existing Kafka main producers/consumers and document their configuration and use cases
- Establish guideline documentation for Kafka cluster use cases (main, jumbo, logging, etc.) (note: added to clarify scope of Kafka main cluster going forward)
- Upgrade and expand Kafka main cluster
Status
[edit] Note: May 8,2019
- This is
In progress and we are working on hardware procurement.
- This is
Note: June 13, 2019
- Design and planning is
Done, existing producers
In progress, documentation is
Done and the setup to upgrade and expand is
In progress.
- Design and planning is
Outcome / Output (Analytics)
[edit]We have scalable, reliable and secure systems for data transport and data processing.
Dependancies on: SRE
Goal(s)
[edit]- Remove computational bottlenecks in stats machines via modenizing our hardware: addition of a GPU that can be used to train ML models T148843
Status
[edit] To do May 2019
- GPU is working and tensorflow ML tests are running. We have however, one close source library that contains image processing with openCL that we need to see if we can stub and use.
In progress
To do June 2019
- Discussed...
Outcome / Output (Analytics)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
Goal(s)
[edit]- Upgrade the cloudera distribution of analytics cluster to CDH5.16 T218343
Note this work is contingent in team evaluating wether new version provides enough value in terms of security upgrades.
Status
[edit] To do May 2019
Done
Outcome 1 / Output 1.1 (Performance)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
- Deploy, update, configure, and maintain and improve production services, platforms, tooling, and infrastructure (Traffic infrastructure, databases & storage, MediaWiki application servers, (micro)services, network, Infrastructure Foundations, Analytics infrastructure, developer & release tooling, and miscellaneous sites & services)
Dependencies on: SRE, Core Platform, Multimedia
Goal(s)
[edit]- Explore/implement {{PREVIEWMODE}} concept and curtail use of {{REVISIONID}}
- Evaluate alternatives for image-embedding in ResourceLoader stylesheets.
- Investigate how to make DB replication lag checks account for secondary datacenters.
- Get AbuseFilterCachingParser re-enabled
- Help TimedMediaHandler migration from Kaltura to Video.js
- Mobile device testing (add documentation for setting up phones and server)
- Use Docker for WebPageTest.
- New performance APIs origin trials (Event Timing, Priority Hints, Feature Policy reporting, Element Timing for Images, Layout Stability)
- Reduce reliance on master-DB writes for file-dependency tracking.
- Swift cleanup + WebP ramp up.
- Write two performance topic blog posts.
- [Ongoing] Support and maintenance of MediaWiki's ResourceLoader and associated components
- [Ongoing] Support and maintenance of MediaWiki's object caching and data access components
- [Ongoing] Support and maintenance of Thumbor/thumbnail infrastructure
- [Ongoing] Support and maintenance of WebPageTest and synthetic testing infrastructure
Status
[edit] Note: May 8, 2019
- Evaluate alternatives for image-embedding in ResourceLoader stylesheets is
To do
- AbuseFilterCachingParser re-enabled is also
To do as well as TimedMediaHandler migration from Kaltura is
To do
- Swift cleanup + WebP ramp up has not started yet as well
To do
- everything else on this list is currently being worked on and is
In progress
- Evaluate alternatives for image-embedding in ResourceLoader stylesheets is
Note: June 27, 2019
- Work on the AbuseFilterCachingParser is
Done for this goal/team.
- TimedMediaHandler migration is also
Done
- Swift cleanup + WebP ramp up is
In progress and will continue next FY
- Mobile device testing and documentation is
Done
- Docker for WebPageTest testing is
Done
- API origin trials are also
Done
- Reduce reliance on master-DB writes for file-dependency tracking is still
In progress and will go into early next quarter.
- Swift cleanup and WebP ramp is
In progress and will continue into next quarter
- We wrote 5 performance topic blog posts, yay!
Done
- Work on the AbuseFilterCachingParser is
Outcome 1 / Output 1.1 (RelEng)
[edit]Current levels of service are maintained and/or improved for all production sites, services and underlying infrastructure.
Goal(s)
[edit]- Undeploy the CodeReview extension.
Status
[edit] Note: April 8, 2019
- This is
In progress
- This is
Note: May 7, 2019
- James and Core Platform need to chat about this, so
Blocked for now
Note: June 4, 2019
- This is still
Blocked for now and probably won't be able to be completed this quarter. We want to clean up the code, but this is not an urgent goal.
Outcome 2 / Output 2.1 (Performance)
[edit]Better designed systems
- Assist in the architectural design of new services and making them operate at scale
Primary team: Performance
Goal(s)
[edit]- Supporting MW sessions in Cassandra for Beta Cluster.
Status
[edit] Note: May 8, 2019
- This is now
In progress
- This is now
Note: June 27, 2019
- This is
In progress and is expected to finish up by end of next quarter.
- This is
Outcome 3 / Output 3.1 (WMCS)
[edit]Users can leverage a reliable and public Infrastructure as a Service (IaaS) product ecosystem for VPS hosting.
- Maintain existing OpenStack infrastructure and services
Primary team: WMCS
Goal(s)
[edit]- Complete partial goals from past quarters
Done
- Replace Trusty with Debian Jessie/Stretch in Cloud Services infrastructure layer
- Remove all Ubuntu-based instances from all Cloud VPS projects
- Evaluate Ceph as a storage service component by building a proof of concept virtualized cluster
Status
[edit] Note: June 6, 2019
- This is now
Done: Replace Trusty with Debian Jessie/Stretch in Cloud Services
- Ceph design and evaluation is fully
In progress, this work will continue into Q1 as a major goal.
- This is now
Note: June 27, 2019
- Evaluating Ceph is still
In progress and getting hardware to build up the cluster (discussing things like where and how to rack servers, etc) and this work will continue into next FY as the largest ongoing goal for the year.
- Evaluating Ceph is still
Outcome 3 / Output 3.2 (WMCS)
[edit]Users can leverage a reliable and public Infrastructure as a Service (IaaS) product ecosystem for VPS hosting.
- Replace the current network topology layer with OpenStack Neutron
Primary team: WMCS
Goal(s)
[edit]- Complete partial goals from past quarters
- Migrate 100% of Cloud VPS projects to the eqiad1 region and its Neutron SDN layer
Done
- Rebuild "labtest" staging environment as "cloud-dev" staging environment
Done
- Upgrade OpenStack deployment to Newton or newer version on Debian Stretch hosts
In progress
- Migrate 100% of Cloud VPS projects to the eqiad1 region and its Neutron SDN layer
Status
[edit] Note: June 6, 2019
- Migrating and rebuilding the new environment are both
Done and the upgrade of OpenStack deployment is
In progress and probably will run into Q1.
- Migrating and rebuilding the new environment are both
Note: June 27, 2019
- Neutron is
Done - yay! OpenStack update is still
In progress as a stretch goal and will continue into next FY with help from DCOps to get up on bigger network links and switches.
- Neutron is
Outcome 4 / Output 4.1 (WMCS)
[edit]Members of the Wikimedia movement are able to develop and deploy technical solutions with a reasonable investment of time and resources on the Wikimedia Cloud Services Platform as a Service (PaaS) product.
- Maintain existing Grid Engine and Kubernetes web services infrastructure and ecosystems.
Primary team: WMCS
Goal(s)
[edit]- Complete partial goals from past quarters
- Upgrade Toolforge Kubernetes cluster to a well supported version and plan future upgrade cycles
Status
[edit] Note: June 6, 2019
- The upgrade to toolforge is
Postponed due to unforeseen staffing issues but we hope to have it done early next FY.
- The upgrade to toolforge is
Note: June 27, 2019
- The upgrade to toolforge is now
In progress, but will continue through next FY.
- The upgrade to toolforge is now
Outcome 6 / Output 6.1 (Core Platform)
[edit]Improved MediaWiki availability and reduced read-only impact from data center fail-overs
- Production deployment of routing of MediaWiki GET/HEAD requests to the secondary data center.
Dependancies on: SRE
Goal(s)
[edit]- Prepare MW and the secondary DC for enabling the ability to serve requests from both DCs
- Finish comment storage changes
- Finish actor storage changes
Status
[edit] Note: May 8, 2019
- This is
In progress
- This is
Note: June 27, 2019
- This is
In progress and we'll have a complete definition of done on July 1 (will wrap up by July 15)
- This is