Reading/Web/Projects/Performance/Removal of secondary content in production

Hypothesis
Certain content doesn't necessarily need to be shipped to the user upfront and sometimes not at all. A good example is the navbox content (this content also is not optimised for mobile but that's a secondary concern and out of scope for this test). We can remove this HTML from the initial page load and lazy load it if and when needed. Before introducing the necessary APIs for lazy loading such content, we wanted to gauge how impactful the removal was.

Despite previous experiments showing that this made little impact on performance, it is unclear whether this reflects the global audience. Currently our 2G tests run from Dulles (Washington, East coast USA) which is much closer to our data centers then for example a country like Indonesia. It is thus not clear whether the webpagetest data we are collecting is a good indication for our global traffic. To understand whether reducing HTML size can make any impact on performance we'd need to view global traffic, specifically the navigation timing reports we collect from real end users.

Prediction
Based on previous experiments, removing navboxes for a quality page such as Barack Obama should There is potential for:
 * drop the number of bytes we ship to users
 * make little to no difference to the fully load time
 * increase the time to first byte (TTFB) from a clear cache due to the time needed for the MobileFormatter to transform the parser output
 * no different to first render
 * Impact on global total page load time
 * Impact on bytes out in the cluster summary for text cache eqiad
 * Impact on global page views traffic due to more engaged visitors

Method
A config change was made to strip navboxes and content not designed for display on mobile (the nomobile class) on Wednesday 16th March around 00:11 PST (week 11 of the year ).

A period of waiting time was left to account for cached pages being updated to respect the new setting and allow data to be collected.

Given the 30 day cache on Wikipedia, it was possible that results would not be visible until at least a month had passed. With that in mind, these results are live and will update as more data becomes available.

Using the webpage test reporter tool we were able to quickly get an idea of the impact on fully loaded and first render time was observed on the Barack Obama page for the 13 days prior to the change and the 11 days after the change using data in Graphite.

We looked at the 95th percentile of global total page load time before and after the change for anonymous users using the command:

We didn't look at beta, given that other experiments were running there that would impact results.

To be more confident of the data we were seeing we also analyzed the raw data in the NavigationTiming tables as collected by EventLogging. Some raw data was exported from the EventLogging tables for the time period before and after the change using the sql queries below (amounting to approximately 2GB worth of data). Using scripts this data was analysed to get a sense of the value of fully loaded time before and after the change. Any instances where the property being measured had a value of NULL, 0 or 1 were ignored. All data analysed would be for anonymous users, for pages in the main namespace for the mobile stable site.

The impact on the bytes sent to users was monitored but given the graphs contain data from both desktop and mobile and desktop traffic accounts for 50% of our page views, it was expected that it would be difficult to get a sense of any impact there.

Analysis
As expected, after 26 days of analysis on the graphite data, a positive impact could be seen on fully loaded time on both the Barack Obama article and global traffic, but it was not substantial. That said, the upper value of fully loaded time dropped considerably giving indication that there is traffic on connections far slower than our simulated 2G connection that are hopefully benefiting from this change. When the raw data was consulted, similar, but not exact patterns were seen. Although globally across all wikis performance seemed to improve by a minimal amount, the greatest impact was seen in enwiki, which also had the highest 95th percentile of all data. It's possible that the majority of our 2G traffic visits this domain and this is where we are likely to see performance gains.

Close analysis suggested that performance on Hebrew Wikipedia and German Wikipedia worsened after the change while performance on English Wikipedia improved. It's worth remembering that our wikis are continuously being edited and these spikes could be caused by any number of things. German Wikipedia does not appear to use the `.navbox` class (although they have a similar .NavFrame class) as a result the improvements here would not necessarily have benefited them. Hebrew Wikipedia does seem to use the `.navbox` class but at a glance not nearly as much as the Japanese and English Wikipedia's. It's impossible to say whether the changes slowed down these wikis - it's possible that users on slower connections may also be gaining access to these sites driving them up, so the worsening performance is not necessarily a bad thing.

It would have been useful to have more data for smaller wikis. For example when looking at Hebrew Wikipedia we only had 71 NavigationTiming entries to consult before and after the change. It's hard to draw conclusions on such small data sets. Even German Wikipedia had 25% of the entries that English Wikipedia had.

The impact on bytes was clear to see on the Barack Obama article but we were unable to get any sense of impact on the global text cache eqiad.

No unusual spikes in page view traffic were witnessed which would be expected given the low impact on fully load time on the 95th percentile.

Conclusions
It seems highly likely that the removal of navboxes had an impact on wikis where they are used frequently. Notable improvements were seen on English and Japanese Wikipedia. That said, it's unclear whether these changes have a negative impact on wikis that do not use them, for example German Wikipedia.

Measuring raw global data seems to be an accurate way of validating our performance changes. That said performance can be impacted by many things - improved infrastructure on cellular networks, new traffic that previously wasn't there.

Using Graphite data can give an indication quickly of possible impacts, but should not be relied on given the large differences in values computed.

Next steps

 * We should increase sampling rates for smaller wikis. Right now our performance metrics are geared towards measuring performance on English Wikipedia.
 * Consider handling .NavFrame the same way as .navbox on German Wikipedia
 * We need better ways to gauge impact of bytes savings for our users. Bytes saved equates to money saved in many countries.
 * Investigating perceived performance degradation on dewiki and hewiki