Reading/Multimedia/Cache Invalidation Misses

There are only two hard things in Computer Science: cache invalidation and naming things. —Phil Karlton

Problem Summary
The Varnish cache servers sometimes present stale content to end users.

The existing cache invalidation mechanism relies on Multicast UDP packets sent from various MediaWiki backend servers to inform the varnish layer that content should be purged. These packets are sent in response to various file change events but can all be traced to SquidUpdate::purge and/or SquidUpdate::HTCPPurge.

This can manifest as stale images for anyone (especially thumbnails) and stale pages for anon users. Generally if the user is aware of the issue they can issue a manual purge request to correct, but this uses the same unreliable signaling mechanism as the original edit triggered purge.

See:
 * 49362
 * 43449

Possible Causes

 * Packet loss
 * This is intrinsic to the current UDP delivery mechanism. In the general case packet loss should be low and cause a relatively small number of failures. Since UDP is a "fire and forget" messaging protocol there is really not a great way to tell when delivery is working and when it isn't.


 * Borked listener (operations/software/varnish/vhtcpd)
 * The new listener collects statistics that should allow this to be monitored but no solution is in place yet.


 * Borked Multicast relay to EU
 * This has apparently happened in the past. Instrumentation and trend monitoring on the new listener should be able to reveal this (several listeners would all drop their inbound rates to zero)


 * Borked PHP
 * "There was a bug a while back where mediawiki didn't send purges for files with lots of old versions." -- bawolff


 * Gremlins and/or Solar Flares
 * Non-deterministic failure is a failure of the engineer's imagination in connecting apparently unrelated events. :)

Possible Fixes

 * Replace Multicast UDP with PGM such as 0MQ
 * In theory PGM was designed to solve exactly this sort of problem, but implementation has non-trivial costs for both development and operations. This is may be swatting flies with an elephant gun unless the infrastructure engineering and maintenance costs can be spread across a larger collection of features. If it was implemented the 0MQ network would likely benefit from the addition of a RabbitMQ "device" to provide message permanence and backbone transport.


 * Give purge packets a monotonically increasing sequence number and create a mechanism to request missing packets at the listener end (eventually consistent)
 * There is some work towards this feature in place but packet id's are not guaranteed to by universally unique. Fully implementing this idea would be dangerously close to recreating PGM or some other queueing mechanism from scratch.


 * Change asset URL structure to include timestamp (only solves image cache issues?)
 * Eliminating the need for invalidation by providing versioned URLs would make Roy Fielding and the HATEOAS army very happy. Another solution would need to be found for stale anon pages however.

Importance of Image Purge Working
The issue of missed purges really effects images strongly due to thumbnail mechanics. Thumbs are kept locally to compute sizes to purge. On send of the purge packet these local copies are deleted. After that the system doesn't know what thumb sizes to send purges for if re-requested. This contrasts with an article which can always be purged via the  handle and will be purged regardless on the next edit.

Additionally article purge misses only effect non-auth users. This isn't an excuse to give them poor service, but it is a reason that we don't get complaints as often from that group. Image purge misses often effect authors. When they know they have just changed an image but still see the old content it creates dissonance in their editing experience. This is a classic problem of eventually consistent architectures that requires careful thought to minimize.

Proactive Monitoring
One of the most painful points related to stale cached content at this point is that we generally only become aware of an issue due to end-user reported glitches. We would like to take a more proactive approach in looking for (and monitoring the rate of) these sorts of issues as waiting for an end-user to complain is poor customer service and difficult to rely on for trending information.

Until more is known about the current rate of failure it will be difficult to determine the reasonable spend on correction or evaluate the benefits of any changes implemented.

Recent Change Monitoring
Robla and Bawolff are interested in adding sample based monitoring to the stack in order to determine extent of the problem (and thus importance of fix). This may also have a benefit for operational monitoring of the vhtcpd listener. The general idea would be to poll the database for the last N most recently overwritten files and verify that Varnish returns appropriate Age headers when queried for the files.

Purge and Check
For the records, in case somebody considers working on this:  andre__: just purge a URL, request it, and check its Age header  it should be less than some threshold  http://tools.ietf.org/rfcmarkup?doc=2616#section-5.1.2 43449

Here's a really simple POC of that sort of activity:

Verify Overwritten Files
Instead of triggering a purge as shown above, we could lookup a list of recently overwritten files via db query or api call: select log_title from logging where log_type = 'upload' and log_action = 'overwrite' order by log_timestamp DESC limit 10; 43449

Armed with this list of candidates to examine, we would request the URLs that should have been purged via http from the frontend cache servers. The Age headers returned should present a value indicating that the content has not been in cache longer than the time since the overwrite event occured. See https://tools.ietf.org/rfcmarkup?doc=2616#section-13.2.3 for a more thorough description of Age header calculation.

After this step we will have a list of URLs, the time they should have been purged and the time that contents have been in cache. We can easily derive a boolean indicating if the purge request was successful or not to augment the data. For the purposes of logging we can then record the total number of files checked and the number of failures for graphite trending. We can also record details about specific requests that failed for forensic investigation. If the number (or percentage?) of failures crosses some predefined threshold we could trigger an alert to Ops.

Additional monitoring of trends could be done via the collected data in graphite. This could result in alerts if the number/percentage of failures breaks an arbitrary ceiling or is out of bounds with respect to historical performance trends.

Issues with polling

 * Paravoid points out that polling the front-end cache will distort the normal LRU expiration by making the polled URLs hotter.

vhtcpd Trend Monitoring
Getting the vhtcpd logs parsed and collected into Graphite or a similar time-series system could allow development of a trend monitoring solution to identify servers or regions that are not receiving purge requests at the expected rate.

Brandon explained the current stats output on #wikimedia-labs : root@cp1040:~# cat /tmp/vhtcpd.stats start:1375268516 uptime:106410 inpkts_recvd:35459921 inpkts_sane:35459921 inpkts_enqueued:35459921 inpkts_dequeued:35459921 queue_overflows:0


 * start: unix gmt timestamp
 * uptime: how long the daemon has been alive, the file is overwritten every so often (30s I think?)
 * inpkts_recvd: number of multicast packets that hit the daemon
 * inpkts_sane: number of multicast packets that parsed correctly and wasn't garbage
 * inpkts_enqueued: number of multicast packets that made it into the sending queue
 * inpkts_dequeued: number of multicast packets that sent as TCP PURGE to all applicable varnish instances
 * queue_overflows: number of times the queue backlogged by such a huge volume of memory that the daemon gave up and wiped out all queued requests
 * queue_size: number of packets in queue
 * queue_max_size: maximum number of packets in queue since startup/overflow


 * A preliminary gmond module was created by bd808. This module is currently hosted at https://github.com/bd808/python-vhtcpd-gmond.
 * bd808's module was used as the basis for a patch submitted as 77975
 * The module was merged into operations / puppet as aab3b88