This analysis is dated March 17th 2014. Media Viewer, which is what's being measured here, is currently in beta testing on various websites. For a few weeks now, we've been measuring detailed network performance on a sample of users.
What the graphs tell us
Some of the data has been put in the form of limn graphs. What do these graphs tell us?
- Most API calls have stable performance, usually < 200ms. <- this actually seems to be completely wrong. Now that I've looked directly at the SQL data, it seems like the data extracted on stat1001 is incorrect. We'll have to revisit this entire section once it's fixed.
- The "Global usage" API call has noticeable spikes (10 times slower than usual on March 11th). We should investigate what could cause this API call to be slow sometimes.
- "Image usage" and "Image info" have smaller spikes, not sure if they're worth investigating right now.
- Image load performance has seen noticeable spikes. I think the issue here is that the graphs don't differentiate varnish cache hits and misses. It's likely that the spike on March 11th was due to someone visiting a gallery full of uncached thumbs. Further analysis is required.
What the graphs don't tell us (yet?) that we can see in the data
- What's the performance geographical distribution? Are some countries impacted by poor performance that isn't proportional to these countrie's average internet speed?
We don't have data for many countries, and for each country, we don't have that many data points, but already we can see that the US vs rest of the world gap can be huge. This is probably an issue for our wikis in general, though, and not specific to media viewer. According to netindex.com all the countries below have similar average broadband speed, between 21 and 27Mbps.
SELECT AVG(event_total) AS average, event_country, COUNT(*) AS country_count FROM `MultimediaViewerNetworkPerformance_7488625` WHERE event_country != AND event_type != 'image' GROUP BY event_country HAVING country_count > 10 ORDER BY average 278.2186 US 183 391.6897 DE 29 427.9286 GB 14 643.3846 FR 26 1632.2857 RU 14
With that in mind, we can start looking at US-only stats (the "best case scenario", in terms of networking), which should paint a different picture than the limn graphs. Google doc looking at data distribution for US stats.
The key finding when looking at US data is that when there's a varnish hit (presumably the really slow ones are varnish misses, to be be confirmed) the bottleneck doesn't seem to be the images, but is in fact the API call(s). I think this shows that it's critical that we patch mediawiki core to include the image size information in the DOM, because the thumbnail info API call is much slower than the subsequent thumbnail itself. It would allow us to display the sharp thumbnail 200ms sooner on average.
- What's the performance of http vs https?
Since we're only measuring logged-in users at the moment, there isn't enough http data to compare.
- What's the mean image load time for varnish cache hits vs cache misses?
All calls except image calls have varnish data. This is because Chrome blocks header access on CORS requests. As a result we don't have any data to answer this question, which is probably the most important one. There is hope, though, as the "Access-Control-Expose-Headers" header might do the trick. I've sent a changeset to ops to turn it on for images. We basically need:
Access-Control-Expose-Headers: Content-Length, X-Cache, X-Varnish, Age, Date
- To what degree does our performance correlate with bandwidth?
We don't have any entries with "bandwidth" set (would come from mobile traffic). We'll need more data before we can answer that question.