Requests for comment/Media file request counts

Background
Since 2008 Wikimedia collects pageview counts for most pages on nearly all wikis. A longstanding request of stakeholders (editors, researchers, GLAM advocates) has been to publish similar counts for media files: images, sounds, videos. A major obstacle to effectuate this was the existing traffic data collecting software. Webstatscollector simply couldn't be scaled up further without incurring huge costs. Even page view counts weren't complete: no per page mobile counts, just per wiki.

In 2014 WMF engineers rolled out a new Hadoop based infrastructure, which makes it possible to collect raw request counts for media files. So a few months after releasing extended pageview counts (with mobile/zero added), the time has come to produce similar data dumps for media files.

Problem
What will be the exact specifications of the new media file request dumps?

Definition of media files
Media files are all images, sound files and videos on the WMF upload servers (url '..//upload.wikimedia.org/..'). These images are mostly embedded in articles, but can be requested separately. Out of scope are therefor images which are served to users from e.g. the bits servers (e.g. navigation icons).

Location

Primary defining criterium is therefore the path where the file resides, secondary criterium file extension:

Currently three folder hierarchies on the upload servers are included:
 *  //upload.wikimedia.org/ -project code- / [archive] / -language code- / [thumb] / @ / @@ / -image file name- 
 * e.g. http://upload.wikimedia.org//wikipedia/commons/4/4a/Commons-logo.svg


 *  //upload.wikimedia.org/math/ @ / @ / @ / -math image file name- 
 * e.g. http://upload.wikimedia.org/math/f/f/f/fffffd30a4febac3dab210ae1537419e.png


 *  //upload.wikimedia.org/ -project code- / @@ /timeline/ -timeline image file name- 
 * e.g. https://upload.wikimedia.org/wikipedia/en/timeline/d2d8e00fd34c75c9be9ce74f63be3517.png

(!! may be incomplete, check)

Legend:
 * each @ stands for one character in range 0-9a-f
 * -xxx- generic description
 * [..] = optional segment
 * language code includes strings like 'meta', 'commons' etc)
 * ignore spaces

Packaging
Like with pageviews all requests per time unit (here: day) will be packaged into one file. Unlike with pageviews not much download speed could be gained from breaking different projects/wikis into separate files (under consideration), as most media files are stored in one wiki: Commons.

Update frequency
The proposed update frequency is daily dump files. Reason to chose for daily updates instead of say hourly updates (like with page views) are:
 * Generation is more cost-effective than hourly updates (one daily job, to be scheduled at the most convenient time of day, in terms of overall system activity)
 * Requires less post-processing for aggregation.
 * More convenient to download (due to much smaller per-day size)
 * Also there will be less need for hour-to-hour analysis of traffic after e.g. major world events, as images tend to appear online with more delay

File paths
File paths other than in Location (see above) will be filtered out? (!! currently e.g. robots.txt and favicon.ico are included)

File extensions
File extensions will be accepted from a hard coded 'white list', see (!! show simple white list, derived from code)

/**	 * Pattern to match urls for plain uploaded media files */	   private static Pattern uploadedPattern = Pattern.compile(	            "(/[^/]*/" + wikiPattern.pattern + ")(/thumb|/transcoded)?(/archive|/temp)?"            + "(/([0-9-a-f])/\\5[0-9-a-f])"            + "/(?:([12][0-9]{3}[01][0-9][0-3][0-9][0-2][0-9][0-5][0-9][0-6][0-9])(?:!|%21))?"            + "([^/]*)" // <-- this is the main file name            + "(/(lossy-)?(?:lossless-)?(page[0-9]+-)?(lang[a-z-]*-)?((?:qlow|mid)-)?(?:0*([1-9]+[0-9]*)px-)?(seek(?:=|%3D)[0-9]+(?:\\.[0-9]*)?)?-?(?:\\7|thumbnail(?:\\.(?:djvu|ogv|pdf|svg|tiff?))?)(?:\\.(?:jpe?g|gif|png|ogg|0*([1-9][0-9]*)p\\.(?:webm|ogv)))?)?");

Status codes
(!! to do)

by project/wiki
This happens implicitly, as -project code- and -language code- are part of the file path. Note that language code can also be e.g. 'commons' or 'meta'.

file extension
Media files can be uploaded in many formats. We consider files with different file extensions as totally unrelated entities. E.g. blob.png and blob.jpg may show the same picture in the same size, etc, but that is not relevant here.

Also png renditions of svg files are considered separate entities, any matching up to be done by client software which processes the data dumps. Reason: avoid needless complication of hive scripts and further aggregation scripts, also more details can be shown by keeping these entities separate rows (lest field count would explode).

by rendered size
Let's make a distinction between streamed and non-streamed content. Streamed content can be shown/listened partially (even with skip backs/rewinds).

Non-streamed content can be requested in a variety of prerendered/custom rendered versions. This goes particularly for images where thumbs can be requested in any size.

(!! not sure about sound files, and whether this is relevant yet)

(!! show two choices: 1 original vs large (arbitrary) vs small thumbs 2 original vs thumbs ) (!! large is relative to image type, e.g. diagrams and texts may scale down poorly) (!! large may be something that is seen differently in 3 years from now) (!! large may be different in different device context)

by revision
(!! treat blob.png and ../archive/../blog.png as one?) (!! how about revisions of timelines ?)

by http referer
One aggregation level could be: source of request (aka http referer): internal/external/unknown.

In Oct 2014 a quick scan showed referers for media files on uploads servers were
 * ~92,5% internal (known)
 * 4,3% unknown
 * 3,2% external (known)

Reason to include external referers separately: Half of the bandwidth in Oct 2014 was consumed by external sites which either include Wikimedia's media files directly in their page (BTW not forbidden by Wikimedia's policies), or provide a direct link to Wikimedia media file. The vast majority of internal media files are thumb images with negligible impact on bandwidth, hence the different breakdown for files counts and bytes sent.

Encoding
File name encoding will be standardized, doing away with many browser artifacts (browsers may encode a url differently). This reduces the number of meaningless variations in actual file names, which otherwise would be counted separately and have to be merged in post-processing. (an issue that exists for pageview counts, where all requested url's are counted as they are received).

File names will be decoded, then encoded (to avoid double encoding) with a custom function IdentifyMediaFileUrl.java

Field separator
Field separator will be space, as in pageview dump files?
 * (Alternative being tabs)

Empty values
Nearly all fields are counts, and 'none found' will just be '0'

Other empty values (if any?) will be shown as '-' (dash)?
 * (Alternative being NULL)

File compression
Files will be published as .gz or .bz2 (preferred for better compression). Choice between either format depends on implementation issues (.gz is default now)

A test file for one day in Oct 2014 contained approx 108MB in bz2 format.

Fields
Some fields are (based on) native fields in Hive table webrequests. Some fields are (based on) derived fields via a user defined function (UDF), see https://gerrit.wikimedia.org/r/#/c/169346/ gerrit patch.

Set of fields to show
The actual set of fields is still under discussion, please chime in. Two (emerging) variation proposals exist:
 * Thumbs all counted in one column
 * mediafile/base_name
 * total requests
 * total requests internal referer, full size
 * total requests internal referer, thumbs, any size
 * total requests external referer, any size
 * total requests unknown referer, any size
 * total byte count


 * Thumbs broken down by thumb size
 * mediafile/base_name
 * total requests
 * total requests internal referer, full size
 * thumb requests internal referer, thumbs at or above size x 
 * thumb requests internal referer, thumbs below size x 
 * total requests external referer, any size
 * total requests unknown referer, any size
 * total byte count


 * Current proposal for x is, see :

// Images with at least that width are considered high quality private final static int HIGH_QUALITY_PIXEL_BOUNDARY_IMAGE = 1024; // Images with at least that height are considered high quality 	(!! read Movies) private final static int HIGH_QUALITY_PIXEL_BOUNDARY_MOVIE = 480;

Rationale for discerning between large/small thumbs is that small thumbs often serve as placeholders in an article, just large enough to recognize an image, and maybe make out some large features, but not large enough to make out many details, and/or appreciate the aesthetic value. Problem however is that this virtual line between thumbnails of each type differs per type of image (painting,photo,diagrams/maps with small type). Even for one specific image, small vs large will be differently appreciated by a user with a large monitor vs a user with a smartphone. And the threshold may evolve over time as all devices gain larger resolution, and bandwidth becomes less of a bottleneck.

(see talk page for examples from test runs).

Field specifics

 * mediafile (aka base_name)
 * The actual url under which the file is available, minus 'http(s)://upload.wikimedia.org/'

(!! base_name? mediafile? it contains part of the path, so its not a filename, yet base_name is rather nondescript, what about media_path? media_base_path? )