Analytics/Hypercube

=Background=

We are sketching a vision for an API to query pageview data that we think is extensible and will allow us to address many use-cases. Rome wasn't built in one day and neither will this be.

We call the project Hypercube -- a hypercube is an 'n-dimensional cube'. After an initial release, our intention is to:


 * expand current dimensions: adding namespaces to pages, adding categories to pages, etc.
 * add new dimensions: country, browser family, etc.
 * improve the datasources: webstatscollector data is inaccurate in a few key ways, add pageviews to mobile sites, etc.

Doing this without breaking the API requires that we have a general understanding of most potential features.

=Questions that the API should (eventually) be able to answer=

This is a non-exhaustive list of questions that we have collected over time.


 * Top 10 articles by pageviews
 * Top 10 articles by pageviews, only in english wikipedia
 * Top 10 countries by pageviews, accessing english and german wikipedia
 * Top 10 countries by pageviews, accessing english and german wikipedia, in a monthly timeseries from 2012-01-01 to 2012-12-31
 * Top N dimension X by metric Y, filtered by dimensions X1, X2, X3, by timeseries T for period S to E.
 * N: any number
 * possible dimensions X, X1, X2, X3: articles, countries, categories, referrer, etc.
 * possible metrics Y: pageviews, bytes served, etc.
 * timeseries T: hourly, daily, monthly, yearly
 * dates S and E specified as year[month[day[hour]]]
 * Daily pageviews by article, filtered by a finite article list
 * Total pageviews per namespace per project per time series
 * Total pageviews per namespace per project per time series mobile only
 * Total pageviews per namespace per project per time series broken down by country/region code

=Minimum Viable Prototype=

We will built the first prototype against the datastream as generated by webstatscollector. This means that we will only have the following dimensions available:
 * time
 * project
 * article title

Available Metrics (facts in data warehouse terminology):
 * hourly pageview count per article (excluding mobile and commons pageview counts)
 * hourly bytes sent per article

Initially, we will aggregate these metrics to daily values.

We want to release a prototype as soon as possible but that still is useful to the community and is viable from an architectural point of view as well.

=API=

The API will not be RESTful -- it will only support GET requests and the output will not be compatible with the JSON output as generated by stats.grok.se

The proposed format for a query is:

GET https://hypercube.wikimedia.org/v1/json/ ?metric=pageviews &timeseries=daily &order=desc &filters=articles-category-or:A,B,C; articles:Main,Napoleon;project:enwiki &limit=10 &start-timestamp=2012-04-23T00:00:00.000Z &end-timestamp=2012-09-30T00:00:00.000Z

Output
The output of a query could look like:

Parameters
=Design Questions=


 * Should we convert the non-ascii titles to unicode?
 * Should we expose the bytes sent metric?