Talk:Requests for comment/Media file request counts

NARA pilot
At the Zuerich HAckaton in May 2014 a 'NARA' pilot was done to nail down relevant Hive queries for GLAM. There is relevance here but no one-on-one translation of requirements surfaced at that time. For one, geo-breakdown of requests is not (yet) available. Breaking down the rows by country in the dataset proposed here would explode the file size.

Test runs
Incomplete examples from two test runs: (ignore extra whitespace for column-based layout):

mediafile                                           req_total  req_int_full  req_int_thumbs  req_ext          bytes /wikipedia/commons/4/4a/Commons-logo.svg            41569730           954         41556404    12372    59198663369 /wikipedia/en/4/4a/Commons-logo.svg                 41382929          1010         41373778     8141    66620759690 /wikipedia/commons/f/fa/Wikiquote-logo.svg          28020764           543         28016206     4015    47132189012 /wikipedia/en/b/bc/Wiki.png                         26643030      26341441             1021   300568   523179005167 /wikipedia/commons/2/23/Icons-mini-file_acrobat.gif 22808095      22521938           221214    64943     6803053997 /wikipedia/commons/4/4c/Wikisource-logo.svg         21153210           234         21149420     3556    39723888186 /wikipedia/en/9/99/Question_book-new.svg            16232299           929         16225236     6134    80921974777 (!! add req_unknown) base_name                                              total   original  high_quality    low_quality /wikipedia/commons/4/4a/Commons-logo.svg            46950570      13380         27129       46910061 /wikipedia/en/4/4a/Commons-logo.svg                 45997381       9261          3133       45984987 /wikipedia/commons/f/fa/Wikiquote-logo.svg          31759104       4587          1981       31752536 /wikipedia/en/b/bc/Wiki.png                         29846080   29844995             0           1085 /wikipedia/foundation/2/20/CloseWindow19x19.png     26264640   26264640             0              0 /wikipedia/commons/2/23/Icons-mini-file_acrobat.gif 24394138   24143535             0         250603 /wikipedia/commons/4/4c/Wikisource-logo.svg         23027261       3805          1811       23021645 /wikipedia/en/9/99/Question_book-new.svg            17954807       7092          1747       17945968 (!! no distinction made by referer in this test run for new UDF)
 * all thumbs in one count
 * large and small thumbs counted separately

On status codes
About 206: applies to timed files: video and sound files
 * 'Partial content' gets sent if browsers got interrupted for a request, and restart it and signal that they only need a certain part of the content. Around media files, this is often due to seeking in a video or audio stream.
 * When seeking in audio or video files, new requests for only part of the file are made. The varnishes respond with 206 HTTP status code. Also, when showing directly within a webpage, the OGV-Viewer does a request for the file that gets responded by a 200 HTTP status code and two requests for the file that get responded to with 206 HTTP status code. Since many 206 are just additional requests for the same file, it seems right to not count all 206s.
 * On closer inspection: the 200 need not be for the media file itself, but may be (and mostly is) for the thumbnail picture for the media file. So we end up having the GET for the real media file respond with a 206. But not all 206 responses are for requests for the full file. So we end up counting 206's for first chunk of data (+/- 9k chunk), (and a few of those may be re-requests for the first chunk, if that chunk got interrupted by the user, but for video files 9k is just a few seconds, for sounds files it may occur a bit more often) (paraphrased from comments by Chris) Erik Zachte (WMF) (talk) 17:32, 14 November 2014 (UTC)

On which file events to count
Several approaches could be taken, some radically different. Please specify which approach you'd prefer, and why

1 Count every occurrence where a request is issued to a Wikimedia server This approach comes close to  file requests . Even when most images will be served from the local cache, the WMF server gets pinged to check if the file has changed (and in most cases receives a 304 for 'not modified'). This approach is most inclusive, but could lead to tens of extra counts for one media file from one user session (especially for recurring navigation icons, and the like). And could count many files the user never got to see.

2 Count every occurrence where a media file is rendered on a page (thus ignoring prefetched imageviewer images, which never get rendered) This approach comes close to  file views .

3 Count actual downloads of a media file, including those downloads which never get rendered on a html page. This approach comes close to  file transfers .

4 Count actual downloads of a media file, minus those downloads which never get rendered on a html page. This approach comes close to  file transfers which reached the user . This approach aims to count each media file only once per session (session length depending on browser cache settings), and only when a user really benefited from prefetch. The never shown images are discarded as not effective (though defensible on technical grounds).


 * Erik Zachte (WMF) (talk) 16:04, 19 November 2014 (UTC) (this approach shows lowest counts (I would prefer to call it: least inflated counts), and more than other approaches attempts to assess where servers were actively involved to feed relevant content to the user). The fact that a disabled or very short-lived browser cache leads to multiple actual file transfers during one user session, is unavoidable, and will make numbers slightly harder to interpret. But in the fast majority of cases each size for each image is counted only once (or not at all for never rendered prefetchs). Seems the kind of approach which a museum director would prefer: "How large an audience did we reach with this donated image". 1 and 2 are relatively unambiguous (not depending on browser cache settings), but who wants to know except WMF ops (who have other information sources)?

Even when most images will be served from the local cache, the WMF server gets pinged to check if the file has changed (and in most cases receives a 304 for 'not modified') - in my testing this doesn't seem to be true. The browser sends If-Modified-Since requests for some images (which are typically answered with a 304), and doesn't send any request for some others. I would guess the difference is in how long ago the image was cached by the browser. --Tgr (WMF) (talk) 00:02, 13 December 2014 (UTC)
 * Thanks, Tgr. That is good to know. But as we propose not to count 304's anyway, this will not affect our counts. Cheers, Erik Zachte (talk) 12:53, 15 December 2014 (UTC)

Context
How about file embedding from 3rd parties? MediaViewer and the StockPhoto gadget generate embedding code, an analytics parameter could be added there. --Tgr (WMF) (talk) 22:22, 10 January 2015 (UTC)
 * There will be one column for external requests, breaking those down further (not sure if that's what you propose) seems overkill Erik Zachte (WMF) (talk) 21:17, 27 January 2015 (UTC)

MediaViewer / Prefetched images
Recap: There has been discussion since the 2014 Amsterdam hackaton if and how prefetched images which are never shown to the user can be sanitized from our metrics.
 * First it seemed best to ignore all downloads by MediaViewer, and to generate a new 'dummy' event at the client the moment an image actually gets presented to the user. This would be accurate, but as it turns out rather complex, more than we initially thought.
 * The next best thing seemed to only count the images which are explicitly requested by a user and which cause the MediaViewer to fire up, and to fully ignore follow-up (prefetched) images (which the user may or may not browse later). This would result in undercounting actually shown images, but probably not by that much. It would require adding a label in the url to flag follow-up images as such. However this extra parameter will adversely affect our cache. See.

So now we are back to square one, and need to weigh our options. I can see three:
 * 1) Somehow fix overcount in hadoop by detecting which images were last-in-chain for some ip address within a given time window and not count these (ignoring finer points of ip:human not being 1:1, at least it would result in our lists with most requested images make more sense).
 * 2) Count all images downloaded by MediaViewer, regardless of whether these were ever shown to the user.
 * 3) Count no images downloaded by MediaViewer at all, until a better solution is available.

The first seems rather complex, and possibly a major performance overhead. The last two are both simple, but they differ totally in basic assumptions. Can I put it this way: would you rather have incomplete but reliable counts, or overcomplete and fuzzy counts? Sure the latter aren't fuzzy in some technical sense, but they are to the metrics consumer. Opting for overcomplete would result in almost 50% of the entries in 'Wikipedia most requested images' to be totally out of place there. Those falsely top-rated images might be actually be shown to almost no user, just because they happen to be the next image on a page, after a hugely popular image. See RFC example where image for Mona Lisa is followed by rather non-descript piece of handwriting.

From a tactical perspective I see two advantages to 3
 * There will be more incentive later to add something important which is missing, than to filter our something unimportant which merely obfuscates.
 * Even more important, incomplete but reliable wouldn't give the new data dump a false start, where quality of the data would be rightfully questioned.


 * I agree, option 3 is the best. For "complete" (i.e. wildly overestimated) information we already have pageviews data and glamtools/baglama2/. Here we must necessarily aim for accuracy, for this exercise to be meaningful. --Nemo 22:24, 27 January 2015 (UTC)