Analytics/Hypercube

Background
We are sketching a vision for an API to query pageview data that we think is extensible and will allow us to address many use-cases. Rome wasn't built in a day and neither will this be. The functionality will be built out incrementally.

The codename for the project is Hypercube -- a hypercube is an 'n-dimensional cube'. After an initial release, our intention is to:


 * expand current dimensions: adding namespaces to pages, adding categories to pages, etc.
 * add new dimensions: country, browser family, etc.
 * improve the datasources: webstatscollector data is inaccurate in a few key ways
 * crawler and regular traffic is now counted in one metric
 * traffic to Wikimedia mobile site is not counted at all yet
 * etc
 * allow longer time horizons to be queried

Doing this without breaking the API requires that we have a general understanding of most potential features.

Questions that the API should (eventually) be able to answer
This is a non-exhaustive list of questions that we have collected over time.


 * Top 10 articles by pageviews
 * Top 10 articles by pageviews, only in English Wikipedia
 * Top 10 countries by pageviews, accessing English and German Wikipedia
 * Top 10 countries by pageviews, accessing English and German Wikipedia, in a monthly time series from 2012-01-01 to 2012-12-31
 * Top N dimension X by metric Y, filtered by dimensions X1, X2, X3, by time series T for period S to E.
 * N: any number
 * possible dimensions X, X1, X2, X3: articles, countries, categories, referrer (categorise referrer as direct/searchengine/webpage), etc.
 * Suggestion: in case of searchengine referrals it would be important to have access to the "q=" parameter (i.e. the query string).
 * possible metrics Y: pageviews, bytes served, etc.
 * time series T: hourly, daily, monthly, yearly
 * dates S and E specified as year[month[day[hour]]]
 * Daily pageviews by article, filtered by a finite article list
 * Pageviews for a month or monthly time series by article, filtered by a finite article list, optionally with daily views
 * Total pageviews per namespace per project per time series
 * Total pageviews per namespace per project per time series mobile only
 * Total pageviews per namespace per project per time series broken down by country/region code
 * Suggestion: pageviews for any given article revision.

First Release and Future Plans
We will build the first prototype against the datastream as generated by webstatscollector. This means that we will only have the following dimensions available:
 * time
 * project
 * article title

And we will only have the following metrics available:
 * pageview count per article (excluding mobile and commons pageview counts)
 * bytes sent per article

We want to release a prototype as soon as possible but that still is useful to the community and is viable from an architectural point of view as well.

API
The exact format of the API is not yet determined. Potential formats for a query could be:

GET https://hypercube.wikimedia.org/v1/json/ ?metric=pageviews &timeseries=daily &order=desc &filters=category:A,B,C;project:enwiki &limit=10 &start-timestamp=2012-04-23T00:00:00.000Z &end-timestamp=2012-09-30T00:00:00.000Z

Or:

GET https://hypercube.wikimedia.org/v1/json/?/pageviews/daily/desc/category:A,B,C;project:enwiki/10/2012-04-23T00:00:00.000Z/2012-09-30T00:00:00.000Z


 * Vote for #1. --Magnus Manske (talk) 14:13, 10 October 2013 (UTC)

Output
The output of a query could look like:

Alternative suggestion (trying to reduce redundancy while pre-grouping data):

Design Questions

 * Should we convert the non-ascii titles to unicode?
 * The titles should be as the page_title in the page table (with or without underscores)


 * Should we expose the bytes sent metric? This means quite a lot more storage space will be needed, depending on how far back we want to go historically.
 * GET option?


 * Is access to the underlying database useful? This may be delayed depending on how big the database is and whether or not we can put it in a cluster of labs machines or it has to live in production.
 * For mass-querying, yes. --Magnus Manske (talk) 14:14, 10 October 2013 (UTC)