Wikimedia Performance Team
As the Wikimedia Foundation’s Performance Team, we want to create value for readers and editors by making it possible to retrieve and render content at the speed of thought, from anywhere in the world, on the broadest range of devices and connection profiles.
Outreach. Our team strives to develop a culture of performance first in the movement. Through communication, embedding ourselves in the product lifecycle and training, we want to make performance a prime consideration in technological and product developments across the movement.
Monitoring. By developing better tooling, designing better metrics, automatically tracking regressions, all in a way that can be reused by anyone, we want to monitor the right metrics and discover issues that can sometimes be hard to detect.
Improvement. Some performance gains require a very high level of expertise and complex work to happen before they are possible. We undertake large projects, often on our legacy code base, that can yield important performance gains in the long run.
Knowledge. We are the movement's reference on all things performance, which requires keeping up with rapid changes in technology across our entire stack. In order to disseminate correct information in our outreach, we aim to build the most comprehensive knowledge base about performance.
Availability. Although Wikimedia Foundation currently operates six data centers, MediaWiki is only running on one (in Ashburn, Virginia). If you are an editor in Jakarta, Indonesia, content has to travel over 15,000 kilometers to get from our servers to you (or vice versa). To run MediaWiki from multiple places across the globe, our code needs to be more resilient to failure modes that can occur when different subsystems are geographically remote from one another.
Performance testing infrastructure. WebPageTest provides a stable reference for a set of browsers, devices, and connection types from different points in the world. It collects very detailed telemetry that we use to find regressions and pinpoint where problems are coming from. This is addition to the more basic Navigation Timing metrics we gather from real users in production.
Media stack. We're currently working on overhauling our thumbnail infrastructure to achieve multiple goals. Improving future-proof maintainability by taking thumbnail code out of MediaWiki-Core and using a service instead to perform thumbnail operations. Saving on expensive storage by no longer storing multiple copies of all thumbnails on disk forever. Enabling far-future expires for images, to greatly improve their client cacheability. And finally switching to the best performing underlying software to generate thumbnails faster.
Presentations and blog posts
- Watch 2015-08-18 Tech talk on "Let's talk about web performance".
- Watch 2016-01-14 Tech talk on "Creating Useful Dashboards with Grafana".
- The speed of thought - ongoing blog posts from the Performance Team
A big part of our work is devoted to collecting and analyzing site performance data to ensure that we have a holistic and accurate understanding of what users experience when they access Wikimedia sites. You can discover our dashboards by visiting the Wikimedia performance portal. A selection of our dashboards is also provided here.
- Navigation Timing
- Job Queue Health
- Save Timing
- Edit Stash
- MySQL aggregate
Below is an overview of the various applications, tools and services we use for collecting, processing and displaying our data.
Maintained by the Wikimedia Performance Team:
- coal-web (see | backend | frontend) - [Python] Custom Graphite writer and web API. Frontend graphs using D3.
- PerformanceInspector (docs | GitHub | Gerrit) - [JS] MediaWiki plugin to profile the current page and find potential performance problems.
- performance.wikimedia.org (see | GitHub | Gerrit | Puppet role) - Static site.
- perflogbot (source) - An IRC bot in tracking MediaWiki ResourceLoader behavior.
- Xenon CLI tools
We also use:
- Grafana at https://grafana.wikimedia.org - See our documentation and #Dashboards.
- XHGui at https://performance.wikimedia.org/xhgui.
- FlameGraphs (brendangregg/FlameGraph) from MediaWiki production requests at https://performance.wikimedia.org/xenon.
- Logstash at logstash.wikimedia.org (NDA restricted).
- dbtree - Detailed MySQL cluster information (see also: Grafana: MySQL dashboard)
- WebPageTest at wpt.wmftest.org.
- XHProf (HHVM extension) - Profile any request via X-Wikimedia-Debug and view it in XHGui.
- wikimedia/arc-lamp - Subscribe to HHVM Xenon and send aggregated and sampled profiles from production requests to Redis. Used for flame graphs.
- Navigation Timing (docs | GitHub | Gerrit) - MediaWiki plugin to collect Navigation Timing data.
- WebPageTest configuration
We also collect data from the following features/services:
- Migration from Zend PHP to HHVM. This greatly reduced backend response time (2x faster).
- Helped with the HTTPS + HTTP/2 migration.
- Asynchronous ResourceLoader top queue.
- Improve cache hits for front-end resources.
- DeferredUpdates (src). Greatly contributed to bringing edit save time median below 1 second.
- WebPageTest. Now used by several teams to do synthetic performance testing.
- Xenon/Flame graphs. Surfaces performance of all our PHP backend.
- NavigationTiming improvements. We are now able to track real user performance in a more fine-frained fashion.
- Introduction of many Grafana dashboards to track performance (see list above).
- Statsv. Greatly simplifies sending light weight data to statsd and Graphite from front-end code and apps.
- Helped improve the new portal page performance
- Helped on the images lazy loading project, which focused on improving the performance of the mobile site.
- Implemented a performance metric alert system on top of Grafana.
- Thumbor. Rewrote the media thumbnailing layer for Wikimedia production.
- Migrated Mediawiki and all extensions to jQuery 3