Requests for comment/Media file request counts

Background
Since 2008 Wikimedia collects pageview counts for most pages on nearly all wikis. A longstanding request of stakeholders (editors, researchers, GLAM advocates) has been to publish similar counts for media files: images, sounds, videos. A major obstacle to effectuate this was the existing traffic data collecting software. Webstatscollector simply couldn't be scaled up further without incurring huge costs. Even page view counts weren't complete: no per page mobile counts, just per wiki.

In 2014 WMF engineers rolled out a new Hadoop based infrastructure, which makes it possible to collect raw request counts for media files. So a few months after releasing extended pageview counts (with mobile/zero added), the time has come to produce similar data dumps for media files.

Problem
What will be the exact specifications of the new media file request dumps?

Definitions

 * Media files:
 * All images, sound files and videos on the WMF upload servers (url '..//upload.wikimedia.org/..'), which are stored in 'white listed' folder (see File paths below), and with a 'white listed' extension (see File extensions below)'.


 * Note 1: These images are mostly embedded in articles, but can be requested separately.
 * Note 2: Out of scope are therefor images which are served to users from e.g. the bits servers (e.g. navigation icons).


 * File requests:
 * (!! to be defined, not 'views', how about 304's ?)


 * File rendition modifier
 * Part of the url, which shows at which size and in which file format a media file was re-rendered (best term here?).
 * This will be the last part of the url, if applicable, after the file name for the original file (see red texts in examples below).


 * If this part of the url is not available, the image request will be considered to be for the original file.
 * If it is available the (vertical) image size will be harvested from the string and used for counting this request as high or low quality image/video.


 * e.g. https://upload.wikimedia.org/wikipedia/commons/thumb/4/4a/Commons-logo.svg/762px-Commons-logo.svg.png
 * e.g. http://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Stevejobs_Macworld2005.jpg/qlow-440px-Stevejobs_Macworld2005.jpg ]

Packaging
Like with pageviews all requests per time unit (here: day) will be packaged into one file. Unlike with pageviews not much download speed could be gained from breaking different projects/wikis into separate files (under consideration), as most media files are stored in one wiki: Commons.

Update frequency
The proposed update frequency is daily dump files. Reason to choose for daily updates instead of say hourly updates (like with page views) are: E.g. with major world events images tend to appear online with more delay (read asynchronously from events developing).
 * Although generation of one daily file is even less efficient than 24 hourly updates, it requires less post-processing for aggregation (getting from hourly to daily, then monthly, totals).
 * More convenient to download (due to much smaller per-day size)
 * A drawback of only daily files seems acceptable, as there will be less need for hour-to-hour analysis of traffic compared to page views.

File paths
Only requests with webrequest field "uri_host='upload.wikimedia.org'" are taken into consideration.

Only media files (inferred from file extension) will be included. Hence favicon.ico will be included, robots.txt will not.

Records for which the url does not correspond to a predefined pattern (as laid down in regular expressions) will be discarded. A separate background task will monitor sampled squid logs for these outliers, so that new patterns will be caught early on. Currently four folder hierarchies on the upload servers are included:
 *  //upload.wikimedia.org/ -project code- / [archive / ] -language code- / [thumb / ] @ / @@ / -image file name- [ / -file rendition modifier-] 
 * e.g. http://upload.wikimedia.org//wikipedia/commons/4/4a/Commons-logo.svg
 * e.g. https://upload.wikimedia.org/wikipedia/commons/thumb/4/4a/Commons-logo.svg/762px-Commons-logo.svg.png (red is -file rendition modifier-)
 * e.g. http://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Stevejobs_Macworld2005.jpg/qlow-440px-Stevejobs_Macworld2005.jpg ] (red is -file rendition modifier-)


 *  //upload.wikimedia.org/math/ @ / @ / @ / -math image file name- 
 * e.g. http://upload.wikimedia.org/math/f/f/f/fffffd30a4febac3dab210ae1537419e.png


 *  //upload.wikimedia.org/ -project code- / @@ /timeline/ -timeline image file name- 
 * e.g. https://upload.wikimedia.org/wikipedia/en/timeline/d2d8e00fd34c75c9be9ce74f63be3517.png


 *  //upload.wikimedia.org/math/score/ @ / @ / -score image file name- 
 * e.g. https://upload.wikimedia.org/score/7/a/7aem9jwwirkhn0ucbewj9gs7aofzc2b/7aem9jww.png

Legend:
 * each @ stands for one character in range 0-9a-f, sometimes 0-9a-z
 * -xxx- generic description
 * [..] = optional segment
 * language code includes strings like 'meta', 'commons' etc)
 * ignore spaces

File extensions
File extensions will be accepted from a hard coded 'white list', see (!! show simple white list, derived from code)

/**	 * Pattern to match urls for plain uploaded media files */	   private static Pattern uploadedPattern = Pattern.compile(	            "(/[^/]*/" + wikiPattern.pattern + ")(/thumb|/transcoded)?(/archive|/temp)?"            + "(/([0-9-a-f])/\\5[0-9-a-f])"            + "/(?:([12][0-9]{3}[01][0-9][0-3][0-9][0-2][0-9][0-5][0-9][0-6][0-9])(?:!|%21))?"            + "([^/]*)" // <-- this is the main file name            + "(/(lossy-)?(?:lossless-)?(page[0-9]+-)?(lang[a-z-]*-)?((?:qlow|mid)-)?(?:0*([1-9]+[0-9]*)px-)?(seek(?:=|%3D)[0-9]+(?:\\.[0-9]*)?)?-?(?:\\7|thumbnail(?:\\.(?:djvu|ogv|pdf|svg|tiff?))?)(?:\\.(?:jpe?g|gif|png|ogg|0*([1-9][0-9]*)p\\.(?:webm|ogv)))?)?");

Hive source
Only requests flagged in Hive table webrequest as "webrequest_source='upload'" are taken into account (other valid values are 'bits', 'text', 'mobile')

Status codes

 * The following status codes will be on the 'white list':


 * 200 OK
 * 206 Partial content (only for first chunk of movie file)
 * 304 Not modified (under serious discussion)

For discussion on 206/304 see talk page.

by project/wiki
This happens implicitly, as -project code- and -language code- are part of the file path. Note that language code can also be e.g. 'commons' or 'meta'.

file extension
Media files can be uploaded in many formats. We consider files with different file extensions as totally unrelated entities. E.g. blob.png and blob.jpg may show the same picture in the same size, etc, but that is not relevant here.

Exception: png renditions of svg files are not considered separate entities as the -file rendition modifier- is stripped from the path Example:e.g. https://upload.wikimedia.org/wikipedia/commons/thumb/4/4a/Commons-logo.svg/762px-Commons-logo.svg.png (red is -file rendition modifier-)

by rendered size
Let's make a distinction between streamed and non-streamed content. Streamed content can be shown/listened partially (even with skip backs/rewinds).

Non-streamed content can be requested in a variety of prerendered/custom rendered versions. This goes particularly for images where thumbs can be requested in any size.

(!! not sure about sound files, and whether this is relevant yet)

by revision
(!! treat blob.png and ../archive/../blog.png as one?) (!! how about revisions of timelines ?)

by http referer
One aggregation level could be: source of request (aka http referer): internal/external/unknown.

In Oct 2014 a quick scan showed referers for media files on uploads servers were
 * ~92,5% internal (known)
 * 4,3% unknown
 * 3,2% external (known)

Reason to include external referers separately: Half of the bandwidth in Oct 2014 was consumed by external sites which either include Wikimedia's media files directly in their page (BTW not forbidden by Wikimedia's policies), or provide a direct link to Wikimedia media file. The vast majority of internal media files are thumb images with negligible impact on bandwidth, hence the different breakdown for files counts and bytes sent.

Encoding
File name encoding will be standardized, doing away with many browser artifacts (browsers may encode a url differently). This reduces the number of meaningless variations in actual file names, which otherwise would be counted separately and have to be merged in post-processing. (an issue that exists for pageview counts, where all requested url's are counted as they are received).

File names will be decoded, then encoded (to avoid double encoding) with a custom function IdentifyMediaFileUrl.java

Headers
It would be helpful to have column headers on the first line of the data file. But generating these directly from the Hive query may be difficult, to be investigated. If postprocessing the file (uncompress, add line, compress) just for this headers line is needed that seems not worth it.

Field separator
Tabs as field separators seems most conforming to new WMF operations conventions, and inclined to break than e.g. comma's or spaces. But chime in if you think different, and please explain why.

Empty values
Nearly all fields are counts, and 'none found' will just be '0'

Other empty values (if any?) will be shown as '-' (dash), rather than Hive default NULL.

File compression
Unlike page view dumps this file will be published as .bz2 (test shows 20% further reduction in size compared to .gz).

A test file for one day in Oct 2014 contained approx 108MB in bz2 format.

Sort order
Sensible options for sort seem to be:
 * alphabetical by first field (will compress best)
 * numerical by total request (lots of tiny navigation icons will top the list)
 * numerical by high quality requests (advised)

The last option had a small added benefit of showing most popular images of decent size first, thus taking away the need of a sort step for e.g. top 100 lists.

Fields
Some fields are (based on) native fields in Hive table webrequests. Some fields are (based on) derived fields via a user defined function (UDF), see https://gerrit.wikimedia.org/r/#/c/169346/ gerrit patch.

Set of fields to show
The actual set of fields is still under discussion, please chime in. Two (emerging) variation proposals exist:
 * Thumbs all counted in one column
 * mediafile/base_name
 * total requests
 * total requests internal referer, full size
 * total requests internal referer, thumbs, any size
 * total requests external referer, any size
 * total requests unknown referer, any size
 * total byte count


 * Thumbs broken down by thumb size
 * mediafile/base_name
 * total requests
 * total requests internal referer, full size
 * thumb requests internal referer, thumbs at or above size x 
 * thumb requests internal referer, thumbs below size x 
 * total requests external referer, any size
 * total requests unknown referer, any size
 * total byte count


 * Current proposal for x is, see :

// Images with at least that width are considered high quality private final static int HIGH_QUALITY_PIXEL_BOUNDARY_IMAGE = 1024; // Images with at least that height are considered high quality 	(!! read Movies) private final static int HIGH_QUALITY_PIXEL_BOUNDARY_MOVIE = 480;

Note 1: Rationale for discerning between large/small thumbs is that small thumbs often serve as placeholders in an article, just large enough to recognize an image, and maybe make out some large features, but not large enough to make out many details, and/or appreciate the aesthetic value. Problem however is that this virtual line between thumbnails of each type differs per type of image (painting,photo,diagrams/maps with small type). Even for one specific image, small vs large will be differently appreciated by a user with a large monitor vs a user with a smartphone. And the threshold may evolve over time as all devices gain larger resolution, and bandwidth becomes less of a bottleneck.

Note 2a: Will a breakdown into internal / external / unknown referers meet actual use cases?

Note 2b: At the other hand, if this distinction is useful, a further breakdown into original, high quality, low quality for all 3 types of referers would be feasible, but probably huge overkill? Most focus will be on majority of internally refered content.

Note 3: For Movies we have no thumbs, so separate high_quality or low_quality counts are questionable?

(see talk page for examples from test runs).

Field specifics
Definition: The actual url under which the original media file can be accessed, minus 'http(s)://upload.wikimedia.org/' 
 * mediafile (aka base_name)

The choice to not further breakdown this field into project code, language code, and file name is for two reasons: (!! base_name? mediafile? it contains part of the path, so its not a filename, yet base_name is rather nondescript, what about media_path? media_base_path? )
 * The data dumps will be processed by scripts only
 * Now it is easy to manually check the actual content of the file by prefixing the field content with 'http(s):////upload.wikimedia.org/' and pasting into a browser address bar.

Definition:  'The number of bytes sent resulting from all requests for a image or sound file, regardless of size or referer' 
 * bytes


 * Note 1: Transport size does not have to match file size, there may transport encoding (but most files are compressed already)
 * Note 2: For video files where data are mostly streamed, and the user can skip back or rewind, the exact definition, or even usefulness of this field is under discussion.
 * Note 3: Where field 'bytes sent' in page views data file has'n seen much (or any) usage over the years, changes are this will be different for media files, as these tend to be very much larger, and resource/bandwidth analysis/monitoring might be more relevant here.