Requests for comment/Media file request counts

Background
Since 2008 Wikimedia collects pageview counts for most pages on nearly all wikis. A longstanding request of stakeholders (editors, researchers, GLAM advocates) has been to publish similar counts for media files: images, sounds, videos. A major obstacle to effectuate this was the existing traffic data collecting software. Webstatscollector simply couldn't be scaled up further without incurring huge costs. Even page view counts weren't complete: no per page mobile counts, just per wiki.

In 2014 WMF engineers rolled out a new Hadoop based infrastructure, which makes it possible to collect raw request counts for media files. So a few months after releasing extended pageview counts (with mobile/zero added), the time has come to produce similar data dumps for media files.

Problem
What will be the exact specifications of the new media file request dumps?

Definition of media files
Media files are all images, sound files and videos on the WMF upload servers (url '..//upload.wikimedia.org/..'). These images are mostly embedded in articles, but can be requested separately. Out of scope are therefor images which are served to users from e.g. the bits servers (e.g. navigation icons).

Primary (and only ?) defining criterium is therefore the location:

Currently three folder hierarchies on the upload servers are included:
 *  //upload.wikimedia.org/ -project code- / [archive] / -language code- / [thumb] / @ / @@ / -image file name- 
 * e.g. http://upload.wikimedia.org//wikipedia/commons/4/4a/Commons-logo.svg


 *  //upload.wikimedia.org/math/ @ / @ / @ / -math image file name- 
 * e.g. http://upload.wikimedia.org/math/f/f/f/fffffd30a4febac3dab210ae1537419e.png


 *  //upload.wikimedia.org/ -project code- / @@ /timeline/ -timeline image file name- 
 * e.g. https://upload.wikimedia.org/wikipedia/en/timeline/d2d8e00fd34c75c9be9ce74f63be3517.png

(!! may be incomplete, check)

Legend:
 * each @ stands for one character in range 0-9a-f
 * -xxx- generic description
 * [..] = optional segment
 * language code includes strings like 'meta', 'commons' etc)
 * ignore spaces

(question: do we filter other file paths, or maybe some extensions?) e.g. http://upload.wikimedia.org/robots.txt

Update frequency
The proposed update frequency is daily dump files. Reason to chose for daily updates instead of say hourly updates (like with page views) are:
 * Generation is more cost-effective than hourly updates (one daily job, to be scheduled at the most convenient time of day, in terms of overall system activity)
 * Requires less post-processing for aggregation.
 * More convenient to download (due to much smaller per-day size)

Aggregation level
by wiki (!! happens implicitly, as -project code- and -language code- are part of the file path ) file extension Media files can be uploaded in many formats. We consider files with different file extensions as totally unrelated entities. E.g. blob.png and blob.jpg may show the same picture in the same size, etc, but that is not relevant here.

Also png renditions of svg files are considered separate entities, any matching up to be done by client software which processes the data dumps. Reason: avoid needless complication of hive scripts and further aggregation scripts, also more details can be shown by keeping these entities separate rows (lest field count would explode).

by rendered size Let's make a distinction between streamed and non-streamed content. Streamed content can be shown/listened partially (even with skip backs/rewinds).

Non-streamed content can be requested in a variety of prerendered/custom rendered versions. This goes particularly for images where thumbs can be requested in any size.

(!! not sure about sound files, and whether this is relevant yet)

(!! show two choices: 1 original vs large (arbitrary) vs small thumbs 2 original vs thumbs ) (!! large is relative to image type, e.g. diagrams and texts may scale down poorly) (!! large may be something that is seen differently in 3 years from now) (!! large may be different in different device context)

by revision (!! treat blob.png and ../archive/../blog.png as one?) (!! how about revisions of timelines ?)

by http referer One aggregation level could be: source of request (aka http referer): internal/external/unknown.

In Oct 2014 a quick scan showed referers for media files on uploads servers were
 * ~92,5% internal (known)
 * 4,3% unknown
 * 3,2% external (known)

Reason to include external referers separately: Half of the bandwidth in Oct 2014 was consumed by external sites which either include Wikimedia's media files directly in their page (BTW not forbidden by Wikimedia's policies), or provide a direct link to Wikimedia media file. The vast majority of internal media files are thumb images with negligible impact on bandwidth, hence the different breakdown for files counts and bytes sent.

Fields

 * native filed in hive table webrequests
 * derived fields based on user defined function (UDF)
 * See patch https://gerrit.wikimedia.org/r/#/c/169346/