Reading/Web/Projects/Performance/Removal of secondary content in beta

Prediction
Removing navboxes for a quality page such as Barack Obama should:
 * drop the number of bytes we ship to users
 * drop the fully load time
 * increase the time to first byte (TTFB) from a clear cache due to the time needed for the MobileFormatter to transform the parser output
 * possibly drop the first render time due to the reduction in HTML, but should not make it worse.

Method
MobileFrontend has a library called MobileFormatter which extends the HtmlFormatter in core. We used this to strip any elements in the HTML with the class navbox. On various good quality articles the size of this HTML is significant, e.g it accounts for 15% of all HTML in the Barack Obama article.

The configuration was updated and we observed the first render and fully loaded time before and after the change.

A script was written that would calculate the median and average of values before the change during a specified period of time and after the change during the same specified period of time for a specified article (Barack Obama) on an emulated 2G connection. For example purposes, if the configuration change was made on the 15th January 2016 at midnight we would take the median of the values for a 7 day period before the change and a 7 day period after the change e.g. all test results between 10th-15th January 2016 compared with all test results 15th-20th January 2016.

The config changes were done sequentially and results were measured The commands used were:
 * On the beta cluster in the mobile beta channel. The change went into effect on 3rd February 2016 at 21:00
 * On the beta cluster in the stable channel. The change went into effect on 4th February 2016 at 22:16
 * On the production cluster (in the beta channel). The change went into effect on 9th February 2016 at 22:00

Analysis
For HTML bytes the results were rather predictable. It's worth noting that the production version of the Barack Obama is constantly changing (it's a wiki!) however the beta cluster article is mostly static. The results for bytes saved were relatively consistent.

TTFB, first render and fully loaded time all took a hit in the beta cluster. All showed a negative impact post change.

It was only in the production beta channel where results seemed to match the prediction,- a minor reduction in fully loaded time was observed, first render seemed to improve and TTFB increased for average delta and median delta.

That said, on the 12th February, it came to my attention that another experiment was in progress merged on Wednesday 10th which more than likely impacted results.

It's possible that the results on production stable may have been accurate. By reducing TTFB we may have also reduced render time since the numbers of milliseconds change are so similar. The fact fully loaded time increased however is hard to explain.

Conclusions

 * The webpage test scripts running on the beta cluster are good for predicting changes in bytes we ship to users.
 * Given the fact these tests were not run in isolated conditions (the beta cluster is constantly getting new code enabled on it, rather than once a week like the stable cluster e.g. we were updating the language overlay in parallel) I would suggest running tests in stable and beta modes on the beta cluster at the same time rather than doing them sequentially given that generally their fluctuations are similar. The beta cluster is not an isolated environment. It's used for manual and automated testing of all of WMF's deployed extensions – probably more. There's no one time when the cluster isn't under some load. In constrast to this, on production the majority of requests should go through cache.
 * When measuring bytes, we should ensure the test article remains protected and unedited (we may want to setup a test article with the same content elsewhere to minimise changes to the article)
 * The larger the period of time we have to measure before and after the more accurate our data will be.
 * The beta cluster is not a reliable testing environment for predicting exact values for improvements in production, it however could in theory be used to predict extent of changes e.g. big change or minor change.
 * We may want to use a dedicated test instance to assess changes in future.
 * The production cluster can only be a reliable testing environment if we do better to coordinate changes with other times, particular the performance team.

Next steps

 * We should apply this test to the production stable environment to assess whether there is any correlation between beta cluster / production beta channel and production stable.
 * We should consider running tests on a dedicated instance outside the beta cluster and see if they give us better results.
 * We need a dedicated performance cluster.