Reading/Multimedia/Cache Invalidation Misses

There are only two hard things in Computer Science: cache invalidation and naming things. —Phil Karlton

Problem Summary
The Varnish cache servers sometimes present stale content to end users.

The existing cache invalidation mechanism relies on Multicast UDP packets sent from various MediaWiki backend servers to inform the varnish layer that content should be purged. These packets are sent in response to various file change events but can all be traced to SquidUpdate::purge and/or SquidUpdate::HTCPPurge.

This can manifest as stale images for anyone (especially thumbnails) and stale pages for anon users. Generally if the user is aware of the issue they can issue a manual purge request to correct, but this uses the same unreliable signaling mechanism as the original edit triggered purge.

See:
 * https://bugzilla.wikimedia.org/show_bug.cgi?id=49362
 * https://bugzilla.wikimedia.org/show_bug.cgi?id=43449

Possible Causes

 * Packet loss
 * This is intrinsic to the current UDP delivery mechanism. In the general case packet loss should be low and cause a relatively small number of failures. Since UDP is a "fire and forget" messaging protocol there is really not a great way to tell when delivery is working and when it isn't.


 * Borked listener (operations/software/varnish/vhtcpd)
 * The new listener collects statistics that should allow this to be monitored but no solution is in place yet.


 * Borked Multicast relay to EU
 * This has apparently happened in the past. Instrumentation and trend monitoring on the new listener should be able to reveal this (several listeners would all drop their inbound rates to zero)


 * Gremlins and/or Solar Flares
 * Non-deterministic failure is a failure of the engineer's imagination in connecting apparently unrelated events. :)

Possible Fixes

 * Replace Multicast UDP with PGM such as 0MQ
 * In theory PGM was designed to solve exactly this sort of problem, but implementation has non-trivial costs for both development and operations. This is may be swatting flies with an elephant gun unless the infrastructure engineering and maintenance costs can be spread across a larger collection of features. If it was implemented the 0MQ network would likely benefit from the addition of a RabbitMQ "device" to provide message permanence and backbone transport.


 * Give purge packets a monotonically increasing sequence number and create a mechanism to request missing packets at the listener end (eventually consistent)
 * There is some work towards this feature in place but packet id's are not guaranteed to by universally unique. Fully implementing this idea would be dangerously close to recreating PGM or some other queueing mechanism from scratch.


 * Change asset URL structure to include timestamp (only solves image cache issues?)
 * Eliminating the need for invalidation by providing versioned URLs would make Roy Fielding and the HATEOAS army very happy. Another solution would need to be found for stale anon pages however.

Recent Change Monitoring
Robla and Bawolff are interested in adding sample based monitoring to the stack in order to determine extent of the problem (and thus importance of fix). This may also have a benefit for operational monitoring of the vhtcpd listener. The general idea would be to poll the database for the last N most recently overwritten files and verify that Varnish returns appropriate Age headers when queried for the files.

For the records, in case somebody considers working on this:  andre__: just purge a URL, request it, and check its Age header  it should be less than some threshold  http://tools.ietf.org/rfcmarkup?doc=2616#section-5.1.2 43449#c2

Calling  requires being authenticated, so the script will need to handle sessions.

vhtcpd Trend Monitoring
Getting the vhtcpd logs parsed and collected into Graphite or a similar time-series system could allow development of a trend monitoring solution to identify servers or regions that are not receiving purge requests at the expected rate.