Reading/Multimedia/Cache Invalidation Misses

There are only two hard things in Computer Science: cache invalidation and naming things. —Phil Karlton

Problem Summary
The Varnish cache servers sometimes present stale content to end users.

The existing cache invalidation mechanism relies on Multicast UDP packets sent from various MediaWiki backend servers to inform the varnish layer that content should be purged. These packets are sent in response to various file change events but can all be traced to SquidUpdate::purge and/or SquidUpdate::HTCPPurge.

This can manifest as stale images for anyone (especially thumbnails) and stale pages for anon users. Generally if the user is aware of the issue they can issue a manual purge request to correct, but this uses the same unreliable signaling mechanism as the original edit triggered purge.

See:
 * https://bugzilla.wikimedia.org/show_bug.cgi?id=49362
 * https://bugzilla.wikimedia.org/show_bug.cgi?id=43449

Possible Causes

 * Packet loss
 * Borked listener (operations/software/varnish/vhtcpd)
 * Borked Multicast relay to EU
 * Gremlins and/or Solar Flares

Possible Fixes

 * Replace Multicast UDP with PGM such as 0mq
 * Change asset URL structure to include timestamp (only solves image cache issues?)
 * Give purge packets a monotonically increasing sequence number and create a mechanism to request missing packets at the listener end (eventually consistent)

Recent Change Monitoring
Robla and Bawolff are interested in adding sample based monitoring to the stack in order to determine extent of the problem (and thus importance of fix). This may also have a benefit for operational monitoring of the vhtcpd listener. The general idea would be to poll the database for the last N most recently overwritten files and verify that Varnish returns appropriate Age headers when queried for the files.

For the records, in case somebody considers working on this:  andre__: just purge a URL, request it, and check its Age header  it should be less than some threshold  http://tools.ietf.org/rfcmarkup?doc=2616#section-5.1.2 43449#c2

Calling  requires being authenticated, so the script will need to handle sessions.