Reading/Web/Notable incidents

This page is similar to the Release timeline but specifically to record trends/bugs in the site. For dates different branches went live see MediaWiki_1.33/Roadmap.

August
5th-11th

The instrumentation for VirtualPageViews was pointed to an old schema resulting in a loss of data. E-mail alerts were setup to protect against this happening again. T288655

19th

Japanese Wikipedia gadget causes a JavaScript error spike alert and is fixed.

January
21st

We switched from localStorage to session storage for tracking open sections in the mobile site. Code for cleaning up localStorage entries had a bug, so when the deploy was rolled back it left the mobile site in an error state and an error was thrown for every page view where the new code had been executed. This is recorded in T272638. We backported a fix in the event we might need to roll back again and resumed the deployment. The errors disappeared after the deploy.

We saw over half a million errors in our error logging pipeline during this time (usually we see under 10,000 in a given day). Amazingly nothing collapsed.

October
5th

An edit to the Portuguese Wikipedia MediaWiki:Common.js led to a huge spike in client side errors (around 20,000+ errors out of 35,000 were caused because of this). Luckily it was caught and fixed within the hour.

April
28th

OOUI triggered small performance regression T252844 due to a growth quick survey campaign.

February
11th

Wiki Loves Folklore banner campaign triggers noticeable spike in image payload

January
9th

Performance regression noted but due to banner campaign. https://phabricator.wikimedia.org/T243105

13th

Site scripts and styles (e.g. MediaWiki:Common.css) were loaded on mobile and swiftly reverted (Caught via grafana). Luckily never hit production. https://phabricator.wikimedia.org/T237050#5800024

December
2nd

Speedindex 3G took a big hit on December 2. Fundraising campaign?

September
5th

Performance regression noted on site. However we believe this is likely a problem with the tooling or the Chrome browser not the site itself https://phabricator.wikimedia.org/T232174

August
1st

MobileWebMainMenuClickTracking broken in deploy. Disabled shortly after.

July
30th

Spike in EventLogging errors during a deploy of the broken main menu schema

25th

MobileWebMainMenuClickTracking was broken in train deploy (T220016)

April
8th (1.33.0-wmf.25)

Error spikes 75k! Seems to be related to T219841

March
29th (1.33.0-wmf.23)

Error spikes 5-15k.

7th (1.33.0-wmf.20)

Errors spikes to 12k-13k. SWAT fix for T217820 shortly after has little to no impact so this is likely a new error.

February
21st (1.33.0-wmf.18)

a new deploy causes errors to spike to around 8K an hour (a little less than the spike on 15th). Some of these appear to relate to skins.minerva.top (I4db0551a7661eb5c41d7b2a27e78afb885bb9ce5) which probably should have been shipped in wmf.19 NOT 1.33.0-wmf.18.

20th

Around 2-4k errors as bugs related to caching ceased.

15th (1.33.0-wmf.17)

An error spike on MinervaClientError's (12K an hour) up from the usual 3k. The problem seems to mostly effect US and Japanese users on en.wiki and jp.wiki. 1395 errors occurred on a single page in an hour period on Android Chrome Mobile but I couldn't replicate any issues even with the page and browser version available. Finally, I managed to replicate the problem: caching. Explanation here: https://phabricator.wikimedia.org/T208915#4958060. Shipped in 1.33.0-wmf.17, probably should have shipped in 18.

7th (1.33.0-wmf.16)

The regression https://phabricator.wikimedia.org/T217820 went live. No notable incident was recorded so it's likely impact was low.

January
23rd

A patch was deployed Explicitly pass in parseHTML with the hope of dealing with many of the issues that appeared on 17th.

15th-17th

Grafana is missing some events (most notably ReadingDepth, VirtualPageViews), although they were recorded correctly in the EL databases.

This was due to a PDU issue that affected

17th

MinervaClientError jumps again from 30 to 120k 2019-01-17 at 18:00:00 UTC(?). Not seeing anything obvious in https://wikitech.wikimedia.org/wiki/Server_Admin_Log or Deployment calendar. Stephen saw a problematic banner so this might also be related. 35% of errors come from iOS and 74% of traffic is on enwiki. The Steward nominations banner does appear to be throwing an error and when looking at referrer traffic for client side errors, the pages impacted do seem to coincide with places the banner is running. This was tracked and fixed but bugs were still at normal levels with the majority coming from iOS. Some of these bugs may be related to the page issues deploy so we are looking more closely...

9th

MinervaClient errors jump from 4k a minute to 40k a minute with the 1.33.0-wmf.12 deploy. Owch. It turned about to be due to QuickSurveys being disabled on English WIkipedia but some surveys still being active in cached HTML. We promptly pushed it back to normal levels.

December
20th

Bug fix deployed: T211986

November
5th

[MobileFrontend refactor bug]

Bug T208605 was squashed. Minerva.WebClientError returns to baseline.

October
19th

[MobileFrontend refactor bug]

A suspected iOS Safari bug caused a huge error spike in number of errors in Minerva.WebClientError. (~30k to ~120k) Error in MediaWiki_1.33/wmf.1 (T208605).