Talk:Analytics/Reports/ULSFOImpact

From mediawiki.org
Latest comment: 10 years ago by Nemo bis in topic Thanks

Precise Differences on Overall Page Latency after Deployment per Country[edit]

(Just to avoid confusion, none of the following 7 items are a blocker for me. Caistleitner (talk) 19:04, 28 February 2014 (UTC))Reply

1000 data points / week vs 1000 data points total[edit]

While the first paragraph ends in “countries for which we had at least 1000 data points per week.”, it seems the code actually sums over all weeks and then checks if this grand total is above the 1000. --Caistleitner (talk) 19:01, 28 February 2014 (UTC)Reply

Corrected now, data looks actually more consistant. NRuiz (WMF) (talk) 19:11, 28 February 2014 (UTC)Reply

Small buckets[edit]

For some countries, events per day are rather low (MO is <100 per day. BD is ~150 per day). 150 events per day comes down to like 10 minutes between events on average. Is that enough to really reliable estimate small changes in response time? --Caistleitner (talk) 19:01, 28 February 2014 (UTC)Reply

Corrected now , got rid of countries with too few data points, used the ones that had at least 1000 per week NRuiz (WMF) (talk) 18:29, 1 March 2014 (UTC)Reply

Outliers[edit]

The code seems to throw away times above 25 seconds. Thereby it gets rid of about almost 3% of the data. But as the 25 seconds range is still densely populated. Is it ok to throw away that much data?

On my experience yes, data over 20 seconds (in desktop) in navigationTiming is normally bogus, we are also removing negative seconds. NRuiz (WMF) (talk) 19:11, 28 February 2014 (UTC)Reply
I sometimes takes >20 seconds for me to get a page. Once they're loaded, they are perfectly fine pages and certainly not bogus for me.
And, yes, the code is removing readings below 0ms. But this lower (0ms) and upper (25s) bound are not anywhere symmetric in terms of the number of shaved off elements. Indeed, it is two orders of magnitude off. Hence, (unless there is some part in EventLogging's source code that explains this heavy tilt, or even calls for it) it puts a bias on subsequent percentile computations. For example it seems the code accepts readings in [0ms,10ms) (bucket of 9 readings. Likely outliers, as 10ms is faster than I could imagine the page to get fully loaded), while we discard [25000ms,25010ms) (bucket of 40 entries. I sometimes get page load times in that range on my machines).
If one wants to stick to the filtering twice (see below), but have it percentile-neutral, it seems cutting off at 25s on the upper end requires to cut off at ~200ms on the lower end. :-( --Caistleitner (talk) 09:56, 3 March 2014 (UTC)Reply

On the same note, the 50 percentile is a measure that is built to cope with outliers. It should be applied to the uncapped source. We however first cut off at 25 seconds and afterwards compute the percentile of the result. Thereby, we're effectively shaving outliers off twice. --Caistleitner (talk) 19:01, 28 February 2014 (UTC)Reply

True, on my experience data on that range (in desktop) is normally bogus. NRuiz (WMF) (talk) 19:11, 28 February 2014 (UTC)Reply
My remark was not so much on where the upper boundary is. But was more geared towards this “cutting off twice” introducing bias due to the choice of boundaries. --Caistleitner (talk) 09:56, 3 March 2014 (UTC)Reply

Week three has only data for 6 days[edit]

While there are a few outliers, we basically have no data for 2014-02-15, which is probably due to the focus on schema 6703470 and EventLogging's switch of to schema 7494934 around that time. But it also means that week 3 covers only 6 instead of 7 days. As traffic is not evenly distributed across different days of the week does that influence numbers? --Caistleitner (talk) 19:01, 28 February 2014 (UTC)Reply

I consider this and took less data over a change to reporting system NRuiz (WMF) (talk) 19:11, 28 February 2014 (UTC)Reply

Data for 2014-02-14 looks too low[edit]

Numbers for 2014-02-14 are suspiciously low (when comparing to 2014-02-13) across the board. Of the relevant countries, JP exhibits the problem best by dropping from ~1350 (2014-02-13 50-percentile) down to ~1048 (2014-02-14 50-percentile). It is hard to imagine that this drop would be caused by the eqiad -> ulsfo switch, as the switch happened 9 days before on 2014-02-05 for JP.

New MediaWikis got deployed 2014-02-13 very late UTC night, could that have lead to an improvement?

Also, there was a ULSFO outage on 2014-02-14 for at least two hours. This certainly affects numbers for 2014-02-14. But do we understand in which ways?

But overall, it seems other causes influenced 2014-02-14 readings. Can we trust/use the 2014-02-14 data?

But when ignoring it, we'd only have 5 days for week 3. :-(

No clue. --Caistleitner (talk) 19:01, 28 February 2014 (UTC)Reply

Agreed. I also mentioned that fact NRuiz (WMF) (talk) 19:11, 28 February 2014 (UTC)Reply

Switchover in week 3[edit]

While the page says “week2, when the deployment was taking place”, HK, and MO were switched during week 3. --Caistleitner (talk) 19:01, 28 February 2014 (UTC)Reply

Graphs showing distinct drop around switchover[edit]

Of all the 16 countries from the data section, it seems only AU, SG, TW, VN show a distinct drop around the time when the switchover happened.

For all other countries, either the values are jumping around a lot anyways, or there is a general downward trend that does not happen at a given date.

Are we sure that the drop for the other countries can be attributed to the ulsfo switch? --Caistleitner (talk) 19:01, 28 February 2014 (UTC)Reply

Look at "regional" plots, they definitely make the point that ULSFO had an impact at regional level. That I think we can state with confidence. NRuiz (WMF) (talk) 19:11, 28 February 2014 (UTC)Reply
If we need to look at “regional” plots to get the drops at the right dates, plotting individual “per country” data on the map sounds like a misleading level of detail. --Caistleitner (talk) 08:46, 3 March 2014 (UTC)Reply

Reason for distribution[edit]

Very interesting study. A basic question, is there a reason that most of the regions that got improvement in page load time are in south east Asia and Oceania? --Haitham (talk) 19:20, 4 March 2014 (UTC)Reply

ULSFO is the new data center in San Francisco and it currently serves traffic mostly for countries in south east Asia and Oceania. Hence, turning on ULSFO should ideally not affect non-ULSFO countries too much. For example traffic for Germany would get routed through the closer European data center, and not through ULSFO. --Caistleitner (talk) 13:17, 6 March 2014 (UTC)Reply
I see. That explains it. Thank you!--Haitham (talk) 16:51, 6 March 2014 (UTC)Reply

Thanks[edit]

Thanks Nuria, it's nice to have per-country data. --Nemo 15:05, 18 April 2014 (UTC)Reply