Analytics/Wikistats/Database API

Request for comment (RFC)
Database API
Component General
Creation date
Author(s) Erik Zachte, Diederik van Liere
Document status draft

Reportcard API[edit]

This document outlines the new API for the Reportcard. Please leave your thoughts at the Talk page. This proposal is incomplete.

Parameter Description Syntax
The MediaWiki API is very rich and the Reportcard API augments this richness. Every call to the Reportcard API should always start with action=analytics followed by a question mark to append new parameters. action=analytics
theme The different metrics are grouped in 8 themes:
  1. readers: this consists of pageview, visitor related data and offline usage. Data sources include: squid pageviews, and comScore.
  2. editors: Data source: SQL database
  3. devices: Data sources: squid logs.
  4. articles: Data source: SQL database
  5. diversity: custom scripts + SQL database
  6. ecosystem: Data source: squid logs (API usage)
  7. media: Data source: squid logs
  8. context: 3rd party statistics
metric The metric indicates for which measure you want to fetch data.

Open Question:How do we expose the available metrics?

country_code We use the ISO 3166-1 country codes. See About ISO 3166-1 Standard. country_code=UK
project The project for which you want to query the data. The default choice is Wikipedia (wp). Valid choices are:

*wb = Wikibooks *wk = Wiktionary *wn = Wikinews *wp = Wikipedia *wq = Wikiquote *ws = Wikisource *wv = Wikiversity *co = Commons *wx = Other projects

language_code Specify the language of the project that you want to query. By default no language is specified and so you will retrieve data for all the languages for a specific project. language_code=FR
from Is a yyyy-mm-dd string that indicates the start of the timeframe. Data returned will be inclusive of this date. Currently, there is no support for HH:MM:SS. to=2012-01-01
to Is a yyyy-mm-dd string that indicates the ending of the timeframe. Data returned will be inclusive of this date. Currently, there is no support for HH:MM:SS. to=2012-06-01
format Currently supported output formats are: JSON. This is the default format and does not need to be explicitly mentioned. format=json
meta Fetch information about relevant actions for the Reportcard API. This action cannot be combined with any of the other parameters. meta=list_metrics, meta=list_geographies

Example JSON output[edit]

Please fill in.

Old API call for data analysis metrics[edit]

concept documentation for new API call
* action=analytics *
Collect data from the analytics database.
  metric       - Type of data to collect. 
                 About metric names: these include source of data, to allow for alternate sources of similar metrics, which likely are defined differently or have other intrinsic issues (e.g. precision/reliability).
                 One value:
                     definition: Unique persons that visited one of the Wikimedia wikis at least once in a certain month
                     filters: selectregions, selectcountries
                     [implementation: table comscore, field visitors] 
                     definition: Percentage of total unique visitors to any web property which also visited a Wikimedia wiki
                     filters: selectregions, selectcountries
                     [implementation: table comscore, field reach] 
                     definition: Total articles (htm component) requested from nearly all Wikimedia wikis (exceptions are mostly special purpose wikis, e.g. wikimania wikis)
                                 Totals are based on the archived 1:1000 sampled squid logs.
                     filters: selectregions, selectcountries, selectwebproperties, selectprojects, selectwikis, selectplatform
                     [implementation: table page_views, field views_non_mobile_raw,views_mobile_raw,views_non_mobile_normalized,views_mobile_normalized depending on normalized and select_platform] 
                     definition: All namespace 0 pages which contain an internal link minus redirect pages (for some projects extra namespaces qualify)
                     filters: selectprojects, selectwikis
                     [implementation: table comscore, field reach]
                     definition: All binary files (nearly all of which are multimedia files) available for download/article inclusion on a wiki
                     filters: selectprojects, selectwikis
                     [implementation: table , field ]
                     definition: All edits on articles (as defined by dumparticlecount)
                     filters: selectprojects, selectwikis
                     [implementation: table wikistats, field edits]
                     definition: All registered editors that in a certain month for the first time crossed the threshold of 10 edits since signing up
                     filters: selectprojects, selectwikis
                     [implementation: table wikistats, field editors_new] 
                     definition: All registered editors that made 5 or more edits in a certain month
                     filters: selectprojects, selectwikis
                     [implementation: table wikistats, field editors_ge_5] 
                     definition: All registered editors that made 100 or more edits in a certain month
                     filters: selectprojects, selectwikis
                     [implementation: table wikistats, field editors_ge_100] 
                     definition: People who access Wikipedia through an offline reader
                     [implementation: table offline, field readers] 
                 other metrics which are likely to follow at some stage (for now included for brainstorm purposes only)
                Parameter is always required  
  startmonth      - First month to include in time series, or single date month to include
                One value:
                  single month as yyyy-mm-dd
                Parameter is  always required

  endmonth      - Last month to include in time series
                One value:
                  single month as yyyy-mm-dd
  select...  - Return data per month per qualifying row of data
                Specify per select parameters the criteria in any of four ways (only cB and cC can be combined):
                  cA: * for all known values, e.g. selectregions=*
                  cB: one or more codes separated by pipe. e.g. selectregions=NA|SA
                  cC: one or more codes separated by plus sign, which returns required data totalled for all specified codes, e.g. selectregions=NA+SA
                  cD: highest n (number) occurences, using values for most recent selected month for ranking, e.g. selectcountries=top:12
                Available select..  parameters: 
                  selectregions cA cB cC
                    for valid region codes see here
                  selectcountries cB cC cD
                    for valid country codes see here
                  selectwebproperties cC cD
                    This parameter requires extra authorisation 
                    Example: selectwebproperties=top:10
                  selectprojects cC
                    for valid project codes see here
                  selectwikis cC
                    specify each wiki code as project:language, e.g. wp:en for English Wikipedia, wq:de for German Wikiquote
                    Example: selectwikis=wp:en|wp:de  
                  selecteditors cB cC  
                    A for anonymous user, R for registered user, B for bot
                    Example: selecteditors=R|A|R+A|B
                  selectedits cB cC  
                    M for manual, B for bot-induced
                    Example: selectedits=M|B
                  selectplatform cB cC (only squidpageviews) 
                    M for mobile N for non-mobile (anyone knows a better term?)
                    Example: selectplatform=M|N|M+N
  normalized   - Y or N
                 Only applies to squidpageviews, where data for each month are recalculated to 30 days (other metrics may follow)
                 Default: N (WMF Report Card will use normalized time series when available)
  data        - One or more type of data to be returned, separated by comma
                    returns ordered list of value pairs, on efor each month within range
                    like timeseries, but each month's value will be relative to oldest month's value which is always 100
                   growth percentages are relative to oldest value (80->100=25%)
                   although trivial, requesting these metrics through API ensures all clients use same calculation
                 Default: timeseries 
  reportlanguage - Language code, used to expand region and country codes into region and country name
                 Default: en
                 Supported: en  
  format       - (csv,json,... see elsewhere)
    returns four sets of metrics (time series plus two percentages) one for United States/mobile, one for United States/non-mobile, one for United Kingdom/mobile, one for United Kingdom/non-mobile

Further details on value ranges[edit]

Filter select_regions[edit]

For comscore_... filters:

AS = Asia Pacific
C  = China
EU = Europe
I  = India
LA = Latin-America
MA = Middle-East/Africa
NA = North-America
US = United States
W  = World

Filter select_countries[edit]

Valid country codes are ISO 3166-1

Filter select_projects[edit]

wb = Wikibooks
wk = Wiktionary
wn = Wikinews
wp = Wikipedia
wq = Wikiquote
ws = Wikisource
wv = Wikiversity
co = Commons 
wx = Other projects 

Filter select_wikis[edit]

Specify as project:language
For valid project codes see select_projects above
For full lists of valid language codes per project see Wikimedia projects 
The following overview of exported wiki databases (aka dumps) can also be useful: lists of Wikimedia dumps

Return value[edit]

The return value will have the outermost group be the name of the metric, along with the API call used to generate this object, the start date of any time series returned, along with the granularity. The objects inside the main metric will have all the appropriate filters that were used to obtain the data, along with any constraints on the data.

  "comscore_views":  //metric name
         "country_code" : "us",  //various filters that apply to this data to interpret it properly 
         "language_code": "en",
         "language_name": "English",
         "normalized": "false",
         "modality": "indexed", 
         "data_type": "time_series", //the data type
         "data": [14,16,72,9034],  //the data itself
         "comments": "COMMENT STRING"  //any additional comments
        "country_code": "uk", "language_code":"en", ......
        "country_code": "fr", "language_code":"fr", ......
    "generated_by":"APICALL", //api call which generated this data
    "time_start": "20110213000000", //start date of all the data in this object in MW timestamp format
    "granularity": "2592000",  //granularity of this data in seconds
    "report_language" : "en"