Requests for comment/Simplify thumbnail cache


 * TODO
 * ✅ Add more data/options from meeting notes
 * ✅ Send link to Aaron & Faidon
 * Find a tracking bug to tie this to
 * Move to RFC namespace proper
 * Announce on wikitech-l, etc

This is a request for comment about changing the thumbnail storage and caching pipeline for Wikimedia projects.

Background
There is a significant amount of complexity both for software developers and operations engineers related to the management of scaled media files (thumbnails) in the Wikimedia projects. The current implementation tightly couples backend storage with frontend caching somewhat to the detriment of both systems. This topic has been discussed in the past but as yet has no resolution.

Problem

 * Issuing varnish/squid purge messages from php in response to media file change or deletion requires enumerating all potentially cached thumbnails
 * Lots and lots of varnish purge messages may be needed to clean up the thumbnails for a given media delete
 * Swift has been configured somewhat awkwardly to support wildcard listing of stored thumbnails for enumeration
 * Thumbnails take up 60% (FIXME: triple check number with Faidon) of the on disk storage footprint in swift
 * PHP layer has extra complexity to hash a thumbnail's path into right swift collection (FIXME: proper term? True/False?)

FIXME discuss hashed image urls and versioned image urls as other aspects/solutions for the same problem as mentioned by Faidon?

Treat thumbnails as a CDN only concern

 * 1) Configure Varnish so that a single purge message drops all variants of a given media file's thumbnails
 * 2) Stop storing generated thumbnails in Swift
 * 3) Generate individual thumbs in real-time in response to cache misses

Tim, Asher and Mark have all weighed in on this general idea in the past.

As Mark pointed out, Varnish currently tracks items mapped to the same hash key in a linked list. This could become a bottleneck for media such as multi-page TIFF or PDF files that have page variants as well as size variants. Research would be needed to determine a reasonable upper limit for variants to collapse into a single hash and/or find a more efficient data structure to implement in Varnish itself.

Benefits

 * Only one htcp purge message needed
 * Simplifies php code by removing a list generation and traversal
 * Reduces swift load by eliminating wildcard enumeration request
 * No need to delete superseded thumbnail files from swift
 * Reduces swift load
 * Removes a potential point of failure for a delete/move operation on the base file
 * Lots of disk reclaimed from swift
 * Reduces hardware cost of swift cluster
 * Reduces maintenance cost of swift cluster

Drawbacks

 * Increased utilization of image scalers
 * Faidon estimates that image scaler jobs would grow from current ~50/s to ~300/s to handle request volume
 * Increased latency for CDN misses
 * The ~250/s requests that are currently satisfied by swift fetches of generated thumbnails would instead require a fetch of the original media and a scaling transformation
 * May not be reasonable for media types that have high thumbnail generation costs or a potentially huge number of thumbnails

Other strategies
Increased baseline image scaler load and the potential for wasting processing power in the image scaler cluster in order to handle traffic spikes may be issues that are too large to ignore in the solution to this problem. These concerns might be addressable via slightly more complex Swift caching strategies.


 * 1) Rather than storing generated thumbnails as permanent media, add an   header that specifies a TTL for the stored file.
 * 2) * This would allow Swift to purge files after some reasonable time rather than holding them indefinitely.
 * 3) * The right TTL would be one that strikes a balance between the cost of capacity for generating new thumbnails and the cost of storage for storing previously generated ones.
 * 4) Use TTLs in Swift, but make them shorter and refresh the TTL on hit to simulate LRU cache deletion.
 * 5) * Similar benefits as prior option, but TTL tuning may be easier.
 * 6) Only store generated thumbnails in Swift for certain thumbnail sizes that are determined to be in widespread/suggested use. This option has also been presented in the Standardized thumbnails sizes RfC.
 * 7) * Most similar to current behavior but attempts to minimize long term storage costs by selective caching.
 * 8) Store "standard" thumbnails permanently and others with TTL (and possibly last use updating)
 * 9) * This variant would make update calls to the Swift layer less common than other TTL based approaches and still have LRU-ish purge characteristics.