Analytics/Reports/ULSFOImpact

Introduction
The operations team deployed ULSFO on February 2014 and we have done some data analysis to help them quantify the impact of the rollout on latency.

The exact dates of the rollout by country/region codes can be found in operations/dns' git history: https://git.wikimedia.org/summary/?r=operations/dns.git

Methodology
First stab at data analysis includes only calculating percentile 50 and 90 for 3 regions: Ocenia, Asia (and SE Asia) and North America for a 3 week period. The week of the 26th of January (week1), which precedes the ULSFO deployment, the following one (week2, when the deployment was taking place) and the one after (week3).

Data comes from Event Logging Navigation Timing Schema (https://meta.wikimedia.org/wiki/Schema:NavigationTiming). We have removed mobile data and only used data for which user was anonymous, i.e. not logged in. We have also removed redirects from dataset and for plots we are only considering requests on a cold cache (i.e. not cached). Our original dataset had about 6 million datapoints, with the restrictions of removing mobile data, warm data..etc, we were left with about 1.7 million datapoints for two weeks of data for the whole world. The daily dataset we have to calculate weekly percentiles greatly differs by region. For Oceania we have about 2000 points per day (for all countries) to calculate daily percentiles. For North America the size of the daily dataset is on order of magnitude greater, about 20.000 samples per day. For SE Asia the daily sample for all countries is about 11.000.

We have less data for the 15th of February as there was a change of Navigation Schema on that date, to rule out any changes to the EL sampling or implementation we are just using data that we know comes from the same schema.

Select from navigation timing table is below. We have filtered these records to plot only times for requests which a dnslookup is happening, we have also removed outliers. select timestamp, event_requestStart,event_responseEnd, event_mediaWikiLoadComplete,event_domInteractive, event_originCountry, event_dnsLookup,event_connectStart,event_responseStart, event_connectEnd from NavigationTiming_6703470 where timestamp < 20140216000000 and timestamp > 20140126000000 and event_mobileMode is NULL and event_redirectCount is NULL and event_isAnon=True order by timestamp asc

All Browsers
mediaWikiLoadComplete This checkpoint is of our own measure and thus present for all browsers, we are plotting it below only for browsers that report network times and thus we are only plotting it for browsers that implement the navigationTiming API. It measures the time from mediawiki's startup.js to the tick following the load event. This has an impact on UX as the lower this variable is, the fastest the page is rendering.

Navigation Timing API measures
This data point is provided by request timing API and thus not available on IE8 and below. See: http://caniuse.com/#feat=nav-timing Note that big improvements in network time do not necessarily translate in faster load pagetimes overall.



We have plotted here responseStart - connectStart time which represents the time spent in the network until first byte arrives minus the time spent in DNS lookups (for a more visual explanation take a look at the Navigation timing graph) If there was a tcp connection drop the time will include the setup of the new connection.

= Results =

Latency: Plots
'''There are substantial drops in latencies in the OC and Asia region. Differences are not so substantial for North America.''' There seems to be anomalous data for the 14th of February for the SE Asia region.

The time to 1st byte measure displays bigger gains, it is important to understand that improvements on network time do not translate directly in gains on overall page latency. For example, if we need 4 network trips to compose a page and if the round trips 2,3,4 are happening while I am parsing the main document (round trip 1) which is huge (let's say) I will only see improvements from the 1st request. Subsequent ones are done in parallel and totally hidden under the fetching of the first one.

Precise Differences on Overall Page Latency after Deployment per Country
'''The gains of Japan and Indonesia are remarkable, page load times dropped up to 300ms. We see smaller (but measurable) improvements of 40 ms in the US too'''

We have calculated overall page latency for SE Asia, Oceania and North America countries for three different weeks. The week of the 26th of January (week1), which precedes the ULSFO deployment, the following one (week2, when the deployment was taking place) and the one after (week3). The overall page latency measure is the 50th weekly percentile of mediawikiLoadComplete calculated per country per week for countries for which we had at least 1000 data points per week. A bigger positive difference among weeks means the page got that much faster. Since we are measuring using data from mediawikiLoadComplete the faster times do have an impact on the UX experience, that is, users are seeing faster pages.

In order to quantify gains in page rendering time we have taken the difference between the 50th percentile of week1 and week2 and the difference between 50th percentile of week1 and week3. The ULSFO deployment is happening on week2 so it is likely that there are greater gains on later weeks (week3), the problem with calculating latency differences with later weeks is that there are too many variables that might be skewing our data. Data is spotty on the 15th of February and also on the 14th is atypical. It is hard to quantify absolute gains but looks like in Japan, Korea and Indonesia gains are of several hundreds of milliseconds. Variability of weekly percentiles seems to be around 100 ms or less.

Caveats
Improvements in Canada are really too small for a such a diverse dataset, we probably should not mention them. If we use data for all countries with at least 1000 samples total there are countries like Palestine or Luxembourgh reporting also 300ms dropouts so how can we quantify these drops are only relative to ULSFO? If we use data for countries that have at least 1000 samples per week data looks much more consistent and we do not see changes on the range of 300 ms anymore, other than for China (CN)

If we remove ULSFO countries we should have (in a controlled experiment) no changes in weekly percentiles for overall latency in our country dataset. This is not the case (expected, was no controlled experiment). However, variability of results among weeks is quite big. Seems like normal variability among weeks is capped at around 100ms.

Data for all non ULSFO countries is below, we are listing countries for which we have at least 1000 data points per week for a 3 week period.

ISO codes per country: http://userpage.chemie.fu-berlin.de/diverse/doc/ISO_3166.html

Reading
http://www.igvita.com/2012/04/04/measuring-site-speed-with-navigation-timing/

connectStart the time immediately before the user agent starts establishing the connection to the server to retrieve the document.

connectEnd the time immediately after the user agent finishes establishing the connection to the server to retrieve the current document.

requestStart the time immediately before the user agent starts requesting the current document from the server.

responseStart the time immediately after the user agent receives the first byte of the response from the server.

Code
Workflow to process data:
 * Produce cvs file from mysql select referenced above: https://gist.github.com/nuria/9052770#file-process-sql-data-change-timestamps-to-day-precision

https://gist.github.com/nuria/9052770#file-calculate-weekly-percentiles-per-country
 * Process cvs file and convert second timestamps to day timestamps:

https://gist.github.com/nuria/9052770#file-calculate-and-plot-daily-percentiles
 * Calculate daily percentiles per region:


 * Calculate weekly percentiles per country

See: https://gist.github.com/nuria/9052770

Times of ulsfo rollout
Ocenia 36d4233c 2014-02-04 08:51:55 -0600 OC => ulsfo, OC maps to these countries: AS AU CK FJ FM GU KI MH MP NC NF NR NU NZ PF PG PN PW SB TK TO TV UM VU WF WS

East/Southeast Asia 1fb1dd5d 2014-02-06 13:57:01 +0200     BD => ulsfo, # Bangladesh 43d8c957 2014-02-12 17:05:46 +0200     BT => ulsfo, # Bhutan 43d8c957 2014-02-12 17:05:46 +0200     HK => ulsfo, # Hong Kong 1fb1dd5d 2014-02-06 13:57:01 +0200     ID => ulsfo, # Indonesia 5e704168 2014-02-05 07:36:13 -0600     JP => ulsfo, # Japan 465877aa 2014-02-05 21:05:44 -0600     KH => ulsfo, # Cambodia 5e704168 2014-02-05 07:36:13 -0600     KP => ulsfo, # Korea, Democratic People's Republic of 5e704168 2014-02-05 07:36:13 -0600      KR => ulsfo, # Korea, Republic of 1657beef 2014-02-06 14:39:23 +0200      MM => ulsfo, # Myanmar 1fb1dd5d 2014-02-06 13:57:01 +0200     MN => ulsfo, # Mongolia 43d8c957 2014-02-12 17:05:46 +0200     MO => ulsfo, # Macao 465877aa 2014-02-05 21:05:44 -0600     MY => ulsfo, # Malaysia 465877aa 2014-02-05 21:05:44 -0600     PH => ulsfo, # Philippines 465877aa 2014-02-05 21:05:44 -0600     SG => ulsfo, # Singapore 1657beef 2014-02-06 14:39:23 +0200     TH => ulsfo, # Thailand 465877aa 2014-02-05 21:05:44 -0600     TW => ulsfo, # Taiwan, Province of China cfacc95a 2014-02-06 13:58:32 +0200     VN => ulsfo, # Viet Nam

US ba8e43dc 2014-02-06 14:40:02 +0200             AK => ulsfo, # Alaska 7890e1fd 2014-02-06 15:53:26 +0200             AZ => ulsfo, # Arizona ba8e43dc 2014-02-06 14:40:02 +0200             CA => ulsfo, # California 7890e1fd 2014-02-06 15:53:26 +0200             CO => ulsfo, # Colorado ba8e43dc 2014-02-06 14:40:02 +0200             HI => ulsfo, # Hawaii 7890e1fd 2014-02-06 15:53:26 +0200             ID => ulsfo, # Idaho 7890e1fd 2014-02-06 15:53:26 +0200             MT => ulsfo, # Montana 7890e1fd 2014-02-06 15:53:26 +0200             NM => ulsfo, # New Mexico 7890e1fd 2014-02-06 15:53:26 +0200             NV => ulsfo, # Nevada ba8e43dc 2014-02-06 14:40:02 +0200             OR => ulsfo, # Oregon 7890e1fd 2014-02-06 15:53:26 +0200             UT => ulsfo, # Utah ba8e43dc 2014-02-06 14:40:02 +0200             WA => ulsfo, # Washington 7890e1fd 2014-02-06 15:53:26 +0200             WY => ulsfo, # Wyoming

Canada 7890e1fd 2014-02-06 15:53:26 +0200             AB => ulsfo, # Alberta ba8e43dc 2014-02-06 14:40:02 +0200             BC => ulsfo, # British Columbia 7890e1fd 2014-02-06 15:53:26 +0200             NT => ulsfo, # Northwest Territories ba8e43dc 2014-02-06 14:40:02 +0200             YT => ulsfo, # Yukon Territory