Does anyone have a good candidate for a tracking bug on this issue? My bugzilla fu is weak enough that I couldn't really find an obvious one.
- Yes, this needs a Phabricator task in the MediaWiki-RfCs project.. phab:T46428 "Handle thumbnail purges when thumbnails are in cache but not in the backend" and phab:T46310 "Thumbnail cache should be automatically discarded after 6 months" seem related. -- SPage (WMF) (talk) 02:01, 26 March 2015 (UTC)
TTL Based Strategies
Would this force the URLs for the thumbnails to include the time at which the source was uploaded or the version of the source? Do they do that already? NEverett (WMF) (talk) 12:40, 8 October 2013 (UTC)
- Thumbnail URLs do not currently contain any version information. At this time there is no provision in this plan to change the URL structure. --BDavis (WMF) (talk) 23:06, 9 October 2013 (UTC)
Similar to the sliding window, could we bump the TTL on a percentage of varnish hits? That'd mostly keep the TTL on popular items high. We could also combine the ideas in either order. The random check should be quick. I don't know about the bump the TTL process or the sliding window check process. NEverett (WMF) (talk) 12:40, 8 October 2013 (UTC)
- This is a great question. I don't know if it would possible or reasonable to have Varnish talk to the backend to announce that a resource had been served from cache, but it should be possible to add something that watched the Varnish log stream and queued "touch" jobs as a result. --BDavis (WMF) (talk) 23:06, 9 October 2013 (UTC)
Cold varnish layer
Four out of five of the drawbacks you list for the "CDN only" solution could be addressed by having an extra layer of varnish servers as parents to the current caches, taking the place of Swift's thumbnail store. It would be like Swift's thumbnail store in terms of hardware, but it would have LRU eviction and wouldn't rely on special support in MediaWiki for purging -- it would share the HTCP stream with the frontend cache.
Reducing cache size increases image scaler CPU, increases latency due to reduced hit rate, and increases the rate at which originals are read, which I am told is likely to require increased read capacity in Swift. So I can understand that it is not necessarily a good idea to reduce cache size. My question is whether an ad-hoc combination of Swift and MediaWiki is really the best software to use for HTTP caching, and whether some purpose-built HTTP cache would be better at the job.
- I added an interpretation of this as an alternate strategy on the RFC. It seems to me that using Varnish backed by spinning disks would a simplier implementation path than the TTL+LRU in Swift options. I don't think it addresses the failure tolerance or vcl_hash issues. I'm of the opinion that the unknowns of (ab)using vcl_hash will be unavoidable in any implementation that gets rid of a listing of all possible thumb URLs. Fault tolerance may be a larger concern. --BDavis (WMF) (talk) 16:47, 17 October 2013 (UTC)
Updates from 2013-12-04 RFC review
The variation previously known as "option 5" is now the primary recommendation of the proposal with an optional variation that would increase the current "backend" Varnish storage capacity rather than adding a new Varnish cluster.
Currently awaiting Ops approval
Varnish persistent storage too unreliable, proposal for an nginx based solution
We've been thinking about this RFC at the Ops off site meeting last week. I mostly agree with it, but I have a few (additional) concerns:
Relying on Varnish persistent storage is suboptimal at the moment. The persistent backend we use in Varnish is of beta quality, and has recently been made proprietary (only available to Plus subscribers). Varnish 4 doesn't offer an Open Source version of the persistent backend either. This means that right now we have no upgrade path, likely will have to fork it or rewrite it ourselves, and also fix any bugs in the current version.
Therefore, I'm not comfortable using Varnish in its current state for this with no thumbnail storage behind it. Squid would actually be better for this, but it doesn't see as active development as it used to now Varnish has gained traction in the industry.
We've been thinking about using a simple nginx based caching solution for this instead, a bit similar to how we've solved thumbnail storage before. nginx can be configured as a cache, with the image scalers as parents. Scaled images can be saved to local disk with arbitrary file naming, so the same directory structure as the
upload.wikimedia.org URL paths can be used, allowing for purging of all thumbnails related to an original by wiping the entire directory. LRU eviction is not supported by nginx in this setup, but would be straightforward to implement using a separate daemon that follows file access using Linux's
inotify systems. (Brandon Black has been looking at possibly implementing such a daemon.) We may or may not be able to do without (backend, disk based) Varnish instances this way, which would avoid the VCL hacks and possible inefficiencies due to Varnish's Vary hashing scheme. Frontend Varnish caches would still be useful to take care of the very hot objects, and to do consistent URL hashing to the nginx backends. -- Mark Bergsma (talk) 13:25, 16 April 2014 (UTC)
- As long as some method of sending the wildcard PURGE is devised and added into nginx and Brandon is willing to do the purge daemon then that looks fine. I've started playing around with MeidaWiki side in https://gerrit.wikimedia.org/r/#/c/126210/. The nginx approach looks fine but if we but Varnish in front for the "very hot objects" we will need to purge those. That will reintroduce the same problem we have now, so would we also being using vcl_hash hacks there too? Aaron (talk) 21:13, 16 April 2014 (UTC)
- I found some documentation on configuring
proxy_cache_purgesettings  for nginx but the docs include the ominous statement: "This functionality is available as part of our commercial subscription." That sounds a lot like the Varnish "open core" problem popping up again. --BDavis (WMF) (talk) 21:36, 16 April 2014 (UTC)
- If we do use a second LRU layer, I'd be worried about that problem of the hot items being in the memory tier and the long-tail request would keep hitting the second persistent tier (with only a trickle of hot item requests). From the second tier perspective, none of the "hot" items would be seen as "hot" so they might get evicted. It would be nice to know that the hottest items (defined by the first layer) are roughly a sub-set of the persistent tier. I'd assume the power outage scenario would better handle with such a guarantee. Aaron (talk) 21:22, 26 April 2014 (UTC)
- Maybe a low-revalidate time would solve this. It would result in semi-frequent If-Modified-Since requests, but those would be back to the local nginx instance on localhost, so it would still be fast. It would effectively notify nginx of the usage of items, so that any LRU could account for that to some extent. Aaron (talk) 00:59, 27 April 2014 (UTC)