Analytics/Hypercube

From MediaWiki.org
Jump to: navigation, search
General2013-10-09Diederik van Liere, Dan Andreescudraft
Request for comment
Hypercube
Component General
Creation date 2013-10-09
Author(s) Diederik van Liere, Dan Andreescu
Document status draft

Background[edit | edit source]

We are sketching a vision for an API to query pageview data that we think is extensible and will allow us to address many use-cases. Rome wasn't built in a day and neither will this be. The functionality will be built out incrementally.

The codename for the project is Hypercube -- a hypercube is an 'n-dimensional cube'. After an initial release, our intention is to:

  • expand current dimensions: adding namespaces to pages, adding categories to pages, etc.
  • add new dimensions: country, browser family, etc.
  • improve the datasources: webstatscollector data is inaccurate in a few key ways
    • crawler and regular traffic is now counted in one metric
    • traffic to Wikimedia mobile site is not counted at all yet
    • etc
  • allow longer time horizons to be queried

Doing this without breaking the API requires that we have a general understanding of most potential features.

Questions that the API should (eventually) be able to answer[edit | edit source]

This is a non-exhaustive list of questions that we have collected over time.

  • Top 10 articles by pageviews
  • Top 10 articles by pageviews, only in English Wikipedia
  • Top 10 countries by pageviews, accessing English and German Wikipedia
  • Top 10 countries by pageviews, accessing English and German Wikipedia, in a monthly time series from 2012-01-01 to 2012-12-31
  • Top N dimension X by metric Y, filtered by dimensions X1, X2, X3, by time series T for period S to E.
    • N: any number
    • possible dimensions X, X1, X2, X3: articles, countries, categories, referrer (categorise referrer as direct/searchengine/webpage), etc.
      • Suggestion: in case of searchengine referrals it would be important to have access to the "q=" parameter (i.e. the query string).
    • possible metrics Y: pageviews, bytes served, etc.
    • time series T: hourly, daily, monthly, yearly
    • dates S and E specified as year[month[day[hour]]]
  • Daily pageviews by article, filtered by a finite article list
  • Pageviews for a month or monthly time series by article, filtered by a finite article list, optionally with daily views
  • Total pageviews per namespace per project per time series
  • Total pageviews per namespace per project per time series mobile only
  • Total pageviews per namespace per project per time series broken down by country/region code
  • Suggestion: pageviews for any given article revision.

First Release and Future Plans[edit | edit source]

Parameter Description Required First Release Future
metric The measurement(s) to consider Yes pageviews pageviews, bytes served, etc.
filter How to reduce the scope of the question on a per-dimension basis No article:..., project:... article:..., project:..., category:..., country:..., etc.
timeseries At what level to aggregate the results No none, daily none, hourly, daily, monthly, yearly
start-timestamp The beginning of the period we're interested in Yes day resolution hour resolution
end-timestamp The end of the period we're interested in Yes day resolution hour resolution
order How to sort results by the metric(s) specified No asc, desc asc, desc
limit The maximum number of results to return No a number a number
pretty Whether to make the output human readable No no, yes no, yes

We will build the first prototype against the datastream as generated by webstatscollector. This means that we will only have the following dimensions available:

  • time
  • project
  • article title

And we will only have the following metrics available:

  • pageview count per article (excluding mobile and commons pageview counts)
  • bytes sent per article

We want to release a prototype as soon as possible but that still is useful to the community and is viable from an architectural point of view as well.

API[edit | edit source]

The exact format of the API is not yet determined. Potential formats for a query could be:

GET https://hypercube.wikimedia.org/v1/json/
  ?metric=pageviews
  &timeseries=daily
  &order=desc
  &filters=category:A,B,C;project:enwiki
  &limit=10
  &start-timestamp=2012-04-23T00:00:00.000Z
  &end-timestamp=2012-09-30T00:00:00.000Z

Or:

GET https://hypercube.wikimedia.org/v1/json/?/pageviews/daily/desc/category:A,B,C;project:enwiki/10/2012-04-23T00:00:00.000Z/2012-09-30T00:00:00.000Z
Vote for #1. --Magnus Manske (talk) 14:13, 10 October 2013 (UTC)

Output[edit | edit source]

The output of a query could look like:

{
    "start_timestamp": "2012-04-23T00:00:00.000Z",
    "end_timestamp": "2012-09-30T00:00:00.000Z",
    "limit": 10,
    "order": "desc",
    "timeseries": "daily",
    "filters": "category:A,B,C;project:enwiki",
    "metric": {
        "pageviews": [
            {"project": "enwiki", "namespace": 0, "article":"Main", "timestamp": "2012-04-23T00:00:00.000Z", "count": 1341},
            {"project": "enwiki", "namespace": 0, "article":"Main", "timestamp": "2012-04-24T00:00:00.000Z", "count": 3415}
            ...
        ]
    }
}


Alternative suggestion (trying to reduce redundancy while pre-grouping data):

...
        "pageviews": {
            "enwiki": [
              ["namespace": 0, "article":"Main", "timestamp":{"2012-04-23T00:00:00.000Z":[1341,9999]}], // count, bytes (optional, determined by GET)
              ...
            ],
            ...
        ]
...

Design Questions[edit | edit source]

  • Should we convert the non-ascii titles to unicode?
The titles should be as the page_title in the page table (with or without underscores)
  • Should we expose the bytes sent metric? This means quite a lot more storage space will be needed, depending on how far back we want to go historically.
GET option?
  • Is access to the underlying database useful? This may be delayed depending on how big the database is and whether or not we can put it in a cluster of labs machines or it has to live in production.
For mass-querying, yes. --Magnus Manske (talk) 14:14, 10 October 2013 (UTC)