Michael Jackson effect

The Michael Jackson effect (also Michael Jackson problem ) is a technical term used in the Wikimedia movement to refer to a cache stampede. A cache stampede is the system failure that results when there is high demand for a computed object that is presently uncomputed.

Event
The term was coined [when?][by whom?] after the death of Michael Jackson on 25 June 2009, which resulted in an unprecedented amount of page view and combined edit traffic.

The article received a record-breaking 5.9 million visits on a single day (26 June), of which one million were during a single hour.

The article received 1.2 million visits on the day of Jackson's death (25 June), which caused several server overloads that made Wikipedia intermittently unavailable to the public.

Background
When an edit is saved in MediaWiki, it is allocated a revision ID, and the page record is updated to point to this as the "current" revision of that page. Durning the edit save, the submitted wikitext is parsed. For large and complex biographies, parsing was a costly and time-intensive operation, involving a large amount of CPU work in PHP for processing text markup, templates, and citation references. In 2009, it was not uncommon for such large articles to take over 30 seconds to parse. After the save operation and wikitext conversion is completed, the page's entry in the ParserCache is overwritten with the new article body and associated metadata describing the revision ID for which it was computed. After the ParserCache is updated, the article URL is also purged from the edge cache (Wikimedia CDN).

When a page is viewed by URL, and there is no entry in the CDN, MediaWiki will look in the ParserCache for the then-current revision ID of the requested article, and if ParserCache does not contain an entry for this article, or if the entry is not for the expected revision ID, then we consider this a "cache miss", at which point the wiki markup is fetched from the database and it is parsed on-demand, similar to what would happen during a save operation.

Incident
Any given edit request (correctly) did not purge the URL from the CDN until after is had both saved the database metadata and generated a ParserCache entry. But, the rapid editing of the article resulted in repeated purging of the article URL from the CDN, which thus invited a lot of traffic to the MediaWiki servers asking for the "current" version of the article.

The definition of "current" of course kept changing, with race conditions where servers processing a page view could perceive the database as referring to a "current" revision ID that was now outdated (the ParserCache has been overwritten to be newer by another edit meanwhile), or too new (the ParserCache entry still pointing to a previous one). This resulted in the MediaWiki web servers essentially all being busy doing the exact same thing: parsing the wikitext content of the Michael Jackson article, often even the same exact revision.

This overload exceeded the combined CPU capacity of the web servers and resulted in reduced availability of Wikipedia overall.

Solution
Shortly after this incident the PoolCounter extension was developed by Tim Starling (together with its associated MediaWiki core interface, and server daemon written in C), which is designed to protect Wikimedia Foundation servers against massive spikes in views like this. And, to avoid massive wastage of CPU capacity due to parallel parsing and cache computation of the same value after it is invalidated.

PoolCounter provides a mutex-like functionality used by MediaWiki to request a lock before it attempts to parse an article.

If the server is the first and only one in line to parse this article, PoolCounter responds immediately with a success message ("LOCKED") and the server goes ahead and parses the article and releases the lock once it is saved to ParserCache.

If another server is already busy doing this, then PoolCounter will hold the server on the line for a while to allow the first one to complete its work first, which, if it completes and releases the lock within the timeout threshold, results in PoolCounter responding to the held client with "DONE", indicating the work from the other server is now completed and its result can be found in the ParserCache.

If there are too many servers waiting in line ("QUEUE_FULL") or if it took too long for the lock-holding server to complete its work ("TIMEOUT"), then MediaWiki will not permit itself to parse the article and will instead return a known-stale version of the article. If there is nothing in the ParserCache at all, it will display an error message to the user, asking to try again later.