Analytics/Hypercube

'''NOTE: This RFC proposes a new Analytics project. This RFC is not proposing to add this functionality to Mediawiki -- either in core or as an extension.'''

=Background=

We are sketching a vision for an API to query pageview data that we think is extensible and will allow us to address many use-cases. Rome wasn't built in one day and neither will this be. The functionality will be built out incrementally.

The codename for the project is Hypercube -- a hypercube is an 'n-dimensional cube'. After an initial release, our intention is to:


 * expand current dimensions: adding namespaces to pages, adding categories to pages, etc.
 * add new dimensions: country, browser family, etc.
 * improve the datasources: webstatscollector data is inaccurate in a few key ways, add pageviews to mobile sites, etc.
 * allow longer time horizons to be queried

Doing this without breaking the API requires that we have a general understanding of most potential features.

=Questions that the API should (eventually) be able to answer=

This is a non-exhaustive list of questions that we have collected over time.


 * Top 10 articles by pageviews
 * Top 10 articles by pageviews, only in English Wikipedia
 * Top 10 countries by pageviews, accessing English and German wikipedia
 * Top 10 countries by pageviews, accessing English and German wikipedia, in a monthly timeseries from 2012-01-01 to 2012-12-31
 * Top N dimension X by metric Y, filtered by dimensions X1, X2, X3, by timeseries T for period S to E.
 * N: any number
 * possible dimensions X, X1, X2, X3: articles, countries, categories, referrer, etc.
 * possible metrics Y: pageviews, bytes served, etc.
 * timeseries T: hourly, daily, monthly, yearly
 * dates S and E specified as year[month[day[hour]]]
 * Daily pageviews by article, filtered by a finite article list
 * Total pageviews per namespace per project per time series
 * Total pageviews per namespace per project per time series mobile only
 * Total pageviews per namespace per project per time series broken down by country/region code

=First Release and Future Plans=

We will build the first prototype against the datastream as generated by webstatscollector. This means that we will only have the following dimensions available:
 * time
 * project
 * article title

And we will only have the following metrics available:
 * pageview count per article (excluding mobile and commons pageview counts)
 * bytes sent per article

We want to release a prototype as soon as possible but that still is useful to the community and is viable from an architectural point of view as well.

=API= The exact format of the API is not yet determined. Potential formats for a query could be:

GET https://hypercube.wikimedia.org/v1/json/ ?metric=pageviews &timeseries=daily &order=desc &filters=category:A,B,C;project:enwiki &limit=10 &start-timestamp=2012-04-23T00:00:00.000Z &end-timestamp=2012-09-30T00:00:00.000Z

Or:

GET https://hypercube.wikimedia.org/v1/json/?/pageviews/daily/desc/category:A,B,C;project:enwiki/10/2012-04-23T00:00:00.000Z/2012-09-30T00:00:00.000Z

Output
The output of a query could look like:

=Design Questions=


 * Should we convert the non-ascii titles to unicode?
 * Should we expose the bytes sent metric?
 * Is access to the underlying database useful?