Wikimedia Performance Team/Sprints

Status
 * 🔴 < 30% done
 * 🟡 < 70% done
 * 🟢 70 to 100% done

FY22-23 Q2 (Oct-Dec)
Improvement:


 * Research opportunities in static.php traffic to identify simpler and longer-lasting caching policies. Reduce backend traffic to static.php by more than 70%, and removing a custom WMF-specific endpoint in the process, in favour of standard MediaWiki routes, requiring less maintenance going forward. (T285232, T302465)

2021
See also internal 2021-2022 roadmap and internal Jan-Mar 2022 achievements.

Outreach:


 * Support product development by Inuka Team (Wikipedia Preview), Reading Web (NearbyPages, and RelatedArticles), CPT (WebAuthn), Design Systems Team (WVUI/Vue.js), and WMDE (Kartographer-revid)
 * Participate in SLO working group to help establish an SLO around MediaWiki Save Timing SLO.
 * Participate in W3C WebPerf WG, provide feedback to Chrome team on Google Web Vitals and Chrome bugs.
 * Organise the Web Performance devroom for FOSDEM 2021 (recordings).
 * Speak at the We Love Speed conference (recording).
 * Organise four Web Perf Hero awards.

Insights:


 * Migrate our device lab to BitBar.
 * Evaluate and build proof-of-concept synthetic testing on bare metal instead of at AWS.
 * Write runbooks for investigating RUM alerts, WPT alerts, and WPR alerts.
 * Support to SRE Observablity in developing a new Prometheus-compatible MW-Stats client library.
 * On-going maintenance of WebPageTest, WebPageReplay, and Fresh-node.

Improvement:


 * Multi-DC: Deploy MainStash DB and migrate away from Redis-based MainStash (T212129).
 * Multi-DC: MariaDB-TLS tested and enabled for all wikis.
 * Multi-DC: CDN routing logic written and deployed to Beta and Prod behind feature flag.
 * ResourceLoader debug mode v2, reduce wait time on complex pages from ~1 minute to ~1 second.
 * Guidance and code review for DBA-led normalization of "templatelinks" MediaWiki database table, to reduce storage pressure and improve query performance. (T299417)
 * Support to SRE ServiceOps for MW-on-K8s project.
 * Develop precache-based GlobalUserEdit API for CentralAuth, following an incident.

2020
See also internal 2020-2021 roadmap.

Outreach:


 * Support product launch by Anti-Harrasment Team (IPInfo extension), and CPT (API Portal skin, API Portal OAuth extension, Changes to OAuth ext).
 * Support development kick-off of Abstract Wikipedia (WikiLambda) through early check-in and 1-month team residency/matrixing in both directions.
 * Organise the first Web Performance conference at FOSDEM (blogpost, recordings).
 * Organise the first Web Perf Hero award.
 * Get published in the Web Performance Calendar (4x: Human performance metrics, Profiling PHP at scale, Future of Web Vitals from a non-Googler, Setting up a device lab).
 * Enable teams to create their own production error dashboards in Logstash with a template, written guide, and video presentation.

Insights:


 * Expand navtiming RUM metrics pipeline with new Layout Shift metric.
 * Kobiton setup for our device lab, expand to include iOS in addition to Android.
 * Explore BitBar for our device lab.
 * Explore moving WPT/WPR infra away from AWS.

Improvement:


 * Multi-DC: Implement multi-dc strategy for ChronologyProtector (T254634).
 * Multi-DC: Determine and start implementing strategy for MainStash DB (T212129).

2019
See also 2019-20 Q1#Performance and internal 2019-2020 roadmap.


 * Outreach:
 * Design and implement the AS Report, to expand and formalize collaborations to leverage our influence with browsers vendors and ISPs. (Announcement on Techblog).
 * Initiate and work on Wikimedia Foundation becoming an official W3C member organization. This expands the Performance Team's participation in web standards and moves us from an "invited expert" (individual) to a represented membership organisation. (Announcement on wikimediafoundation.org)
 * Support product launches by Parsing Team (Parsoid-PHP launch), Editing Team (DiscussionTools launch), Growth Team (GrowthExperiments launch), and Inuka Team (Wikipedia KaiOS app launch).
 * Support RelEng around establishing production error triage workflows and semi-automation thereof.
 * Organise WMF-wide frontend web performance training.
 * Provide performance expertise to Frontend Architecture Working Group (FAWG).
 * Get published in the Web Performance Calendar (2x: Measuring LT and FID, Big questions on RUM)
 * Insights:
 * Research and develop and test new RUM metrics that better match user perception (T187299, Rossi 2019 paper).
 * Organise and oversee implementation of First Paint metric in WebKit for Apple Safari (blog post).
 * Introduce automatic developer-facing performance metrics for specific chunks of MediaWiki code in core and extensions, powered by WANObjectCache (T197849).
 * Add more RUM metrics to the navtiming pipeline, including instrumentation for First Input Delay (T332012).
 * Participate in Chrome Origin trial for Element Timing and provide feedback on upcoming W3C standard (blog post).
 * Release WikimediaDebug v2 (blog post).
 * Create our own Mobile Device Lab.
 * On-going first-respondence to synthetic testing alerts, including investigating regressions after Chrome/Firefox releases and comms with upstream browser vendors.
 * On-going maintenance of WebPageTest and WebPageReplay.
 * On-going maintenance of XHGui, including dealing with MongoDB becoming non-free software by developing and upstreaming MySQL drivers for XHGui, and migration our install from MongoDB to MySQL.


 * Improvements:
 * PHP7 Transition: Finish the transition from HHVM and support SRE with instrumentation, sampling, and benchmarking.
 * Multi-DC: Start work on MainStash DB.
 * Faster MediaWiki backend startup time to reclaim PHP7 latency increase in certain areas. (T233886, T189966).
 * Faster page load time, by reducing ResourceLoader startup cost (blog post).
 * Guidance, CR and testing for new AbuseFilter parser (development by Daimona) to improve Save Timing (T156095).

2018
See also 2018-19 Q1, 2018-19 Q2, and internal 2018-2019 roadmap.

Insights:


 * Annual Plans/FY2019/TEC1: Current levels of service are maintained and/or improved.
 * Enhance performance testing infrastructure, including addition of Chrome Tracelog (T182510), and introduction of WebPageReplay+Browsertime (based on last year's research) to complement and eventually replace WebPageTest (T153360). Blog post: Performance testing in a controlled lab environment
 * Introduce Excimer, a new sampling profiler for PHP 7 to replace HHVM Xenon (T176916). Includes creation of the new php-excimer extension (blog post).
 * Implement new "Backend-Timing" metric on Apache PHP web servers, as first full measurement of MediaWiki latencies. Backed by Prometheus. (T131894)
 * Migrate WebPageTest hosting from Windows to Linux.
 * Expand synthetic testing to more non-English wikis.
 * Introduce Fresnel, performance testing in MediaWiki CI jobs. (T133646).
 * Review current research on performance perception (T165272, T187299). Essay: Perceived Performance (2018). Blog posts: Mobile web performance: the importance of the device, Machine learning: how to undersample the wrong way.
 * Develop new "navtiming2" metric definitions, addressing what we learned since 2015, and enable use of stacked graphs (T104902).
 * On-going maintenance of navtiming.py service, including migration to dedicated hardware, and support for failover to secondary datacenter.

Outreach:


 * Measure performance from Asia both pre- and post- Singapore data center coming online (T169180, T168416), including a new navtiming capability for geographic oversampling (T169522). (blog post)
 * Publish the first post in the Perf Matters at Wikipedia series.
 * Get published in the Web Performance Calendar (5x: Magic numbers, Comparing HAR, Measuring Wikipedia, Why perf matters, AVIF).

Improvement:


 * Annual Plans/FY2019/TEC1: Improve MediaWiki availability and reduce read-only impact from data center switchovers.
 * Multi-DC: Develop integration and support for Mcrouter service in MediaWiki's WANObjectCache, support SRE's rollout of mcrouter service. (T198239)
 * Annual Plans/FY2019/TEC4: PHP7 Migration: Guide the work and support other teams.
 * Introduce support for packageFiles to ResourceLoader (T133462).
 * Introduce support for WebP compression format to Thumbor.
 * Reduce page load time by refactoring the startup module to need only one roundtrip instead of two, effectively loading jQuery in parallel outside the critical path. (T192623).

2017
See also Annual Plan/2017-2018#Technology, 2017-18 Q3, 2017-18 Q4, and internal 2017-2018 roadmap.

Outreach:


 * Publish in the Web Performance Calendar (Automate performance regression alerts).

Insights:


 * Program 1. Availability, performance, and maintenance.
 * All production sites and services maintain current levels of availability or better.
 * Maintain a comprehensive toolset to measure the performance of our platforms.
 * Research reverse proxies technologies with objective to obtain more stable metrics from synthetic testing infrastructure, increasing confidence, reduce minimum regression size for detection. Evaluated Mahimahi, WebPageReplay, and mitmproxy; selected WebPageReplay. Deployed WebPageReplay+Browsertime to complement and eventually replace WebPageTest (T153360).

Improvement:


 * Support for HHVM-PHP7 migration and upgrade.
 * Expand support in Thumbor to private wikis. Thumbor service replaces MediaWiki ImageHandler (3-part blog post series).
 * Program 8. Progress towards multi-datacenter support (Performance/Multi-DC MediaWiki).
 * Faster Wikipedia time-to-logo. (blog post, T100999)
 * Faster edit save timing. (blog post)
 * Faster page load time. Reduce load time on 3G-Slow connections by one whole second, from 14s to 13s. T164299#3572231
 * Phase out "mediawiki.legacy.wikibits" module to reduce page view cost. T122755
 * Migrate MediaWiki core and all deployed extensions to jQuery 3, multi-month cross-team effort. T124742

2016
See also Perf Matters at Wikipedia in 2016 (Blog post), and Annual Plan/2016-2017 Program 4: Improve site performance.

Insights:


 * Enhance performance testing infrastructure, including speeding up the infrastructure to achieve hourly testing instead every 3 hours (T151197), and adding new metrics for DOM size (T159362).

Improvement:


 * Help develop Thumbor as service to replace MediaWiki FileHandler in production (3-part blog post series).
 * Help guide and prepare for HTTP/2 roll out to Wikimedia CDN (blog post).
 * Progress towards multi-datacenter support (Performance/Multi-DC MediaWiki).

2015
See also Perf Matters at Wikipedia in 2015 (Blog post).