Michael Jackson effect

The Michael Jackson effect (also Michael Jackson problem ) is a technical term used in the Wikimedia movement to refer to a cache stampede. A cache stampede is the system failure that results when there is high demand for a computed object that is presently uncomputed.

Event
The term was coined [when?][by whom?] after the death of Michael Jackson on 25 June 2009, which resulted in an unpredented amount of page view and combined edit traffic.

The article received a record-breaking 5.9 million visits on a single day (26 June), of which one million during a single hour.

The article received 1.2 million visits on the day-of, on 25 June 2009, which caused several server overloads that made Wikipedia intermittently unavailable to the public.

Background
When an edit is saved in MediaWiki, it is allocated a revision ID, and the page record is updated to point to this as the "current" revision of that page. Durning the edit save, the submitted wikitext is parsed. For large and complex biographies, this was a costly and time-intensive operation, involving a large amount of CPU work in PHP for processing text markup, templates, and citation references. At the time it was not uncommon for such large articles to take over 30 seconds to process. After the save operation and wikitext conversion is completed, the page's entry in the ParserCache is overwritten with the new article body and associated metadata describing the the revision ID for which it was computed. At this time, the article URL is also purged from the edge cache (Wikimedia CDN).

When a page is viewed by URL, and there is no entry in the CDN, MediaWiki will look in the ParserCache for the then-current revision ID of the requested article, and if it does not contain an entry for it, or if the entry is not for the expected revision ID, then we consider this a "cache miss", at which point the wiki markup is fetched from the database and it is processed on-demand, similar to what would happen during a save operation.

Incident
The rapid editing of the article resulted in repeated purging of the article from the URL, which thus invited a lot of traffic to the MediaWiki servers asking for the "current" version of the article. The definition of "current" of course kept changing, and there were also race conditions where servers processing a page view could perceive the "current" revision ID as either that was either now outdated (the ParserCache has been overwritten to be newer), or not yet finished processing (the ParserCache entry is too old). These together resulted in the MediaWiki web servers essentially all being busy doing the exact same thing: parsing the wikitext content of the Michael Jackson article, often even the same exact revision.

This overload exceeded the combined CPU capacity of the web servers and resulted in reduced availability of Wikipedia overall.

Solution
Shortly after this incident the PoolCounter extension was developed by Tim Starling (together with its associated MediaWiki core interface, and server deamon written in C), which is designed to protect Wikimedia Foundation servers against massive spikes in views like this. And, to avoid massive wastage of CPU capacity due to parallel parsing and cache computation of the same value after it is invalidated.

PoolCounter provides a mutex-like functionality used by MediaWiki to request a lock before it attempts to parse an article.

If the server is the first and only one in line to parse this article, PoolCounter responds immediately with a success message ("LOCKED") and the server goes ahead and parses the article and releases the lock once it is saved to ParserCache.

If another server is already busy doing this, then PoolCounter will hold the server on the line for a while to allow the first one to complete its work first, which, if it completes and releases the lock within the timeout threshold results in PoolCounter responding after this delay with "DONE", indicating the work as already done and it may find it in the ParserCache.

If there are too many servers waiting in line ("QUEUE_FULL") or if it took too long for the lock-holding server to complete its work ("TIMEOUT"), then MediaWiki will not permit itself to parse the article and will instead return a known-stale version of the article. If there is nothing in the ParserCache at all, it will display an error message to the user, asking to try again later.