Requests for comment/Simplify thumbnail cache


 * TODO
 * Add more data/options from meeting notes
 * Send link to Aaron & Faidon
 * Find a tracking bug to tie this to
 * Move to RFC namespace proper
 * Announce on wikitech-l, etc

This is a request for comment about changing the thumbnail storage and caching pipeline for Wikimedia projects.

Background
There is a significant amount of complexity both for software developers and operations engineers related to the management of scaled media files (thumbnails) in the Wikimedia projects. The current implementation tightly couples backend storage with frontend caching somewhat to the detriment of both systems. This topic has been discussed in the past but as yet has no resolution.

Things we are not happy about:


 * Issuing varnish/squid purge messages from php in response to media file change or deletion requires enumerating all potentially cached thumbnails
 * Lots and lots of varnish purge messages may be needed to clean up the thumbnails for a given media delete
 * Swift has been configured somewhat awkwardly to support wildcard listing of stored thumbnails for enumeration
 * Thumbnails take up 60% (FIXME: triple check number with Faidon) of the on disk storage footprint in swift
 * PHP layer has extra complexity to hash a thumbnail's path into right swift collection (FIXME: proper term? True/False?)

FIXME could also discuss hashed image urls and versioned image urls as other aspects/solutions for the same problem.

Option 1: Treat thumbs as a CDN only concern

 * 1) Configure Varnish so that a single purge message drops all variants of a given media file's thumbnails
 * 2) Stop storing generated thumbnails in Swift
 * 3) Generate individual thumbs in real-time in response to cache misses

Tim, Asher and Mark have all weighed in on this general idea in the past.

As Mark pointed out, Varnish currently tracks items mapped to the same hash key in a linked list. This could become a bottleneck for media such as multi-page TIFF or PDF files that have page variants as well as size variants. Research would be needed to determine a reasonable upper limit for variants to collapse into a single hash and/or find a more efficient data structure to implement in Varnish itself.

Benefits

 * Only one htcp purge message needed
 * Simplifies php code by removing a list generation and traversal
 * Reduces swift load by eliminating wildcard enumeration request
 * No need to delete superseded thumbnail files from swift
 * Reduces swift load
 * Removes a potential point of failure for a delete/move operation on the base file
 * Lots of disk reclaimed from swift
 * Reduces hardware cost of swift cluster
 * Reduces maintenance cost of swift cluster

Drawbacks

 * Increased utilization of image scalers
 * Faidon estimates that image scaler jobs would grow from current ~50/s to ~300/s to handle request volume
 * Increased latency for CDN misses
 * The ~250/s requests that are currently satisfied by swift fetches of generated thumbnails would instead require a fetch of the original media and a scaling transformation
 * May not be reasonable for media types that have high thumbnail generation costs or a potentially huge number of thumbnails

Option 2: CDN + Swift expiration
Make the hash changes suggested in #|Option 1, but continue to store generated thumbnails in Swift. However rather than storing them as permanent media, add an  header that specifies a TTL for the stored file. This would allow Swift to purge files after some reasonable time rather than holding them indefinitely.

Option 3: CDN + Swift expiration with TTL updates
Same as #|Option 2 but refresh the TTL on hit to simulate LRU cache deletion.

Option 4: CDN + Swift for "common" sizes
Permanently store only the "common" sizes and generate all others on the fly.