Analytics/Archive/Mobile metrics

Mobile metrics have been refined in the first half of 2012, thanks primarily to the work of Andre Engels. That work has been focused on reports derived from the sampled Squid logs, which are listed under Special => Server Requests on stats.wikimedia.org.

These Squid-based reports include:


 * mobile devices by browser type and apps, and device type
 * traffic that originates from mobile devices, whether that goes to the main site or mobile site
 * this allows us to measure tablet traffic, which is currently not re-directed to the mobile site
 * mobile site traffic from all sources (which is roughly equivalent to the MonthlyMobile traffic)
 * traffic from mobile sources by country

And these reports exclude bot traffic, which in theory results in a more accurate view of page traffic.

However, the Squid logs are sampled at 1:1000, whereas the traditional MonthlyMobile report is not. On the other hand, the MonthlyMobile report includes bot traffic, which may add roughly 3% to the page view numbers.

Here are the relevant links:


 * MonthlyMobile - the current standard for monthly page view metrics

Squid-based reports:
 * Clients
 * User Agents
 * Country
 * Device
 * OS

The mobile team has proposed using the new Squid reports as the main basis of monthly metrics. The following compares the two types of reports and provides a rationale for making a decision.

Overview
The following graphic shows the two ways that data analysis reports are generated internally, and that comScore is a separate source of metrics:



Pros and cons
Here is an overview of the pros and cons of each type of report:

MonthlyMobile
 * Pros
 * Consistent with monthly report of main site
 * Data is not sampled
 * Cons
 * Includes bot traffic (will eventually be removed)
 * Cannot be analyzed flexibly, such as by user agents
 * Fluctuates considerably month-to-month, possibly due to data loss

Squid reports
 * Pros
 * Useful analysis already done
 * Does not include bot traffic
 * Relatively consistent month-to-month
 * Cons
 * Sampled at a high rate

Different data sources and logic
Definition according to webstatscollector Derived from webstatscollector, which produces hourly aggregates based on incoming squid logs.

Relevant filter criteria § The url in a logline contains /wiki/. This excludes /w/index.php? and SpecialPages. § Not all public wikimedia projects are counted (e.g. foundation wiki). § Any article namespace qualifies § Bot hits are not filtered

Definition according to stats.wikimedia.org Derived from the sampled squid log files. Relevant filter criteria: The mime type of the logline is text/html. This includes /w/index.php? and SpecialPages but also 404 and other error codes are counted as article views All public wikimedia projects are sampled Any article namespace qualifies Bot hits are not filtered To normalize these findings, if we want to, we have to make the following two changes: For stats.wikimedia.org: filter criteria should be a combination of mime type and return code (200, 301, 302) For webstatscollector: filter criteria should also include /w/index.php?

There used to be a different emphasis : stats via webstatscollector [1] are traditionally known as page view stats. (and we are going to make that definition more applicable by introduction of a separate set with crawler requests excluded). Stats via sampled squid logs [2] were a breakdown of traffic, aka server requests (and they do what they say, a.o. count html pages). Over time there came more overlap in usage (starting with regional breakdowns, and recently with breakdown per mobile platform) and a shift in interpretation, by looking at the traffic reports as a way to gauge our reach in certain functional areas. And thus the term page views took hold there as well (in the regional reports), but not consistently. In fact the traffic reports are a hybrid between ops oriented stats and usage/reach stats.