Analytics/Hypercube

=Background=

We are sketching a vision for an API to query pageview data that we think is extensible and will allow us to address many use-cases. Rome wasn't built in a day and neither will this be. The functionality will be built out incrementally.

The codename for the project is Hypercube -- a hypercube is an 'n-dimensional cube'. After an initial release, our intention is to:


 * expand current dimensions: adding namespaces to pages, adding categories to pages, etc.
 * add new dimensions: country, browser family, etc.
 * improve the datasources: webstatscollector data is inaccurate in a few key ways
 * crawler and regular traffic is now counted in one metric
 * traffic to Wikimedia mobile site is not counted at all yet
 * etc
 * allow longer time horizons to be queried

Doing this without breaking the API requires that we have a general understanding of most potential features.

=Questions that the API should (eventually) be able to answer=

This is a non-exhaustive list of questions that we have collected over time.


 * Top 10 articles by pageviews
 * Top 10 articles by pageviews, only in English Wikipedia
 * Top 10 countries by pageviews, accessing English and German Wikipedia
 * Top 10 countries by pageviews, accessing English and German Wikipedia, in a monthly time series from 2012-01-01 to 2012-12-31
 * Top N dimension X by metric Y, filtered by dimensions X1, X2, X3, by time series T for period S to E.
 * N: any number
 * possible dimensions X, X1, X2, X3: articles, countries, categories, referrer (categorise referrer as direct/searchengine/webpage), etc.
 * Suggestion: in case of searchengine referrals it would be important to have access to the "q=" parameter (i.e. the query string).
 * possible metrics Y: pageviews, bytes served, etc.
 * time series T: hourly, daily, monthly, yearly
 * dates S and E specified as year[month[day[hour]]]
 * Daily pageviews by article, filtered by a finite article list
 * Pageviews for a month or monthly time series by article, filtered by a finite article list, optionally with daily views
 * Total pageviews per namespace per project per time series
 * Total pageviews per namespace per project per time series mobile only
 * Total pageviews per namespace per project per time series broken down by country/region code
 * Suggestion: pageviews for any given article revision.

=First Release and Future Plans=

We will build the first prototype against the datastream as generated by webstatscollector. This means that we will only have the following dimensions available:
 * time
 * project
 * article title

And we will only have the following metrics available:
 * pageview count per article (excluding mobile and commons pageview counts)
 * bytes sent per article

We want to release a prototype as soon as possible but that still is useful to the community and is viable from an architectural point of view as well.

=API= The exact format of the API is not yet determined. Potential formats for a query could be:

GET https://hypercube.wikimedia.org/v1/json/ ?metric=pageviews &timeseries=daily &order=desc &filters=category:A,B,C;project:enwiki &limit=10 &start-timestamp=2012-04-23T00:00:00.000Z &end-timestamp=2012-09-30T00:00:00.000Z

Or:

GET https://hypercube.wikimedia.org/v1/json/?/pageviews/daily/desc/category:A,B,C;project:enwiki/10/2012-04-23T00:00:00.000Z/2012-09-30T00:00:00.000Z


 * Vote for #1. --Magnus Manske (talk) 14:13, 10 October 2013 (UTC)

Output
The output of a query could look like:

Alternative suggestion (trying to reduce redundancy while pre-grouping data):

=Design Questions=


 * Should we convert the non-ascii titles to unicode?
 * The titles should be as the page_title in the page table (with or without underscores)


 * Should we expose the bytes sent metric? This means quite a lot more storage space will be needed, depending on how far back we want to go historically.
 * GET option?


 * Is access to the underlying database useful? This may be delayed depending on how big the database is and whether or not we can put it in a cluster of labs machines or it has to live in production.
 * For mass-querying, yes. --Magnus Manske (talk) 14:14, 10 October 2013 (UTC)