6th-8th [FIRING:1] HighTimeToPreview (critical rweb)
Loss of data in analytics pipeline. Fixed byhttps://gerrit.wikimedia.org/r/c/operations/puppet/+/887762
Changes in the page previews extension led to a notable error spike breaking the feature for older browsers https://phabricator.wikimedia.org/T325113
Due to a bug (phab:T305442#7878587) unexpected changes to the sampling rate in web scroll schema:
- Logged in users were sampled at 1% for French Wikipedia, 10% for all other projects, rather than 100%.
- anonymous users set to 0% rather than 1% for French Wikipedia, 10% for all other projects
[Vector 2022 skin] Several group 1 wikis were accidentally opted into the new Vector skin leading to approx 1000 users opting out within 24hrs. phab:T299927
The instrumentation for VirtualPageViews (which tracks page previews) was pointed to an old schema resulting in a loss of data. E-mail alerts were setup to protect against this happening again. T288655
We switched from localStorage to session storage for tracking open sections in the mobile site. Code for cleaning up localStorage entries had a bug, so when the deploy was rolled back it left the mobile site in an error state and an error was thrown for every page view where the new code had been executed. This is recorded in phab:T272638. We backported a fix in the event we might need to roll back again and resumed the deployment. The errors disappeared after the deploy.
We saw over half a million errors in our error logging pipeline during this time (usually we see under 10,000 in a given day). Amazingly nothing collapsed.
An edit to the Portuguese Wikipedia MediaWiki:Common.js led to a huge spike in client side errors (around 20,000+ errors out of 35,000 were caused because of this). Luckily it was caught and fixed within the hour.
Performance regression noted but due to banner campaign. https://phabricator.wikimedia.org/T243105
Site scripts and styles (e.g. MediaWiki:Common.css) were loaded on mobile and swiftly reverted (Caught via grafana). Luckily never hit production. https://phabricator.wikimedia.org/T237050#5800024
Speedindex 3G took a big hit on December 2. Fundraising campaign?
Performance regression noted on site. However we believe this is likely a problem with the tooling or the Chrome browser not the site itself https://phabricator.wikimedia.org/T232174
MobileWebMainMenuClickTracking broken in deploy. Disabled shortly after.
Spike in EventLogging errors during a deploy of the broken main menu schema
MobileWebMainMenuClickTracking was broken in train deploy (T220016)
Error spikes 75k! Seems to be related to T219841
Error spikes 5-15k.
Errors spikes to 12k-13k. SWAT fix for T217820 shortly after has little to no impact so this is likely a new error.
a new deploy causes errors to spike to around 8K an hour (a little less than the spike on 15th). Some of these appear to relate to skins.minerva.top (I4db0551a7661eb5c41d7b2a27e78afb885bb9ce5) which probably should have been shipped in wmf.19 NOT 1.33.0-wmf.18.
Around 2-4k errors as bugs related to caching ceased.
An error spike on MinervaClientError's (12K an hour) up from the usual 3k. The problem seems to mostly effect US and Japanese users on en.wiki and jp.wiki. 1395 errors occurred on a single page in an hour period on Android Chrome Mobile but I couldn't replicate any issues even with the page and browser version available. Finally, I managed to replicate the problem: caching. Explanation here: https://phabricator.wikimedia.org/T208915#4958060. Shipped in 1.33.0-wmf.17, probably should have shipped in 18.
The regression https://phabricator.wikimedia.org/T217820 went live. No notable incident was recorded so it's likely impact was low.
A patch was deployed Explicitly pass in parseHTML with the hope of dealing with many of the issues that appeared on 17th.
Grafana is missing some events (most notably ReadingDepth, VirtualPageViews), although they were recorded correctly in the EL databases.
This was due to a PDU issue that affected
MinervaClientError jumps again from 30 to 120k 2019-01-17 at 18:00:00 UTC(?). Not seeing anything obvious in https://wikitech.wikimedia.org/wiki/Server_Admin_Log or Deployment calendar. Stephen saw a problematic banner so this might also be related. 35% of errors come from iOS and 74% of traffic is on enwiki. The Steward nominations banner does appear to be throwing an error and when looking at referrer traffic for client side errors, the pages impacted do seem to coincide with places the banner is running. This was tracked and fixed but bugs were still at normal levels with the majority coming from iOS. Some of these bugs may be related to the page issues deploy so we are looking more closely...
MinervaClient errors jump from 4k a minute to 40k a minute with the 1.33.0-wmf.12 deploy. Owch. It turned about to be due to QuickSurveys being disabled on English WIkipedia but some surveys still being active in cached HTML. We promptly pushed it back to normal levels.
Bug fix deployed: T211986
[MobileFrontend refactor bug]
Bug T208605 was squashed. Minerva.WebClientError returns to baseline.
[MobileFrontend refactor bug]