Analytics/Pageviews/Webstatscollector

From mediawiki.org

Webstatscollector is a legacy tool which allows us to take sampled logs and generate page view counts for all Wikimedia projects. Until 2015, these page view counts were used in the monthly report card: http://reportcard.wmflabs.org/graphs/pageviews

Webstatscollector's Pageview Definition[edit]

This graph best explains what is considered a page view by webstatscollector https://github.com/wikimedia/analytics-metrics/blob/master/pageviews/webstatscollector/pageview_definition.png

Webstatcollector has a limited definition of what constitutes a page view and over-counts some actions, while under-counting others. After research in 2014/15, a new standard definition was established at https://meta.wikimedia.org/wiki/Research:Page_view . The WMF Research team has also studied the nature of the differences between the new pageview counts and webstatscollector's counts.

Vital Signs[edit]

For a brief period of time (December 2014), Vital Signs displayed pageview data using webstatscolletor's definition. The definition was implemented on Refinery's Hadoop cluster using Hive, processing raw webrequest logs. You can tell which definition is displayed in Vital Signs by clicking on the "Daily Pageviews" title of the graph. The link will take you to the pageview definition used to generate the data in the graph.

Architecture[edit]

https://wikitech.wikimedia.org/wiki/Analytics/Webstatscollector

Storage[edit]

Hive is involved. See wikitech:Analytics/Cluster/Hive