Analytics/Wikistats/Database API

WMF has started a process for automating the Monthly Report Card generation process. The Report Card has been commissioned in May 2009 by WMF primarily to provide key metrics for Wikimedia board and staff, but it also available to the general public. It comprised about 40 charts, drawn from several internal and external (comScore) data sources. Although most of these sources are generated fully automatically, the Report Card itself is now for a considerable part produced manually, with a large spreadheet as intermediary between csv's and charts.

In the new setup several layers are envisioned which together will make the process more robust, more flexible, and more open: Note: currently the project/process is in exploratory/brainstorming phase. Any feedback is highly appreciated, but please understand no commitment can be made yet on timing or functionality. Primary goal of the project is to automate the provision of key monthly metrics in the current report card, which is based on highly aggregated data. Extensibility and wider reuse of the data are important but secondary objectives for now. The MySQL database could be replaced by some other storage solution later. Focus right now is on the API which ideally should remain stable whenever possible.
 * Current csv files will be streamlined for import in a database.
 * A new MySQL database will provide permanent and better structured data storage.
 * A new Mediawiki API call will allow internal and external processes to extract data from this database.
 * Possibly a new highly modular presentation layer will extract and format data from the database and allow flexible grouping of charts (each chart a widget?)

Participants from WMF (May 2011):
 * Nimish Gautam: Design and implementation
 * Rob Lanphier: Coordination
 * Erik Moeller: Product Ownership
 * Erik Zachte: Design and implementation
 * Mani Pande: Research Advisor

New API call for data analysis metrics

 * concept documentation for new API call

* action=analytics * Collect data from the analytics database. Parameters metric      - Type of data to collect. About metric names: these include source of data, to allow for alternate sources of similar metrics, which likely are defined differently or have other intrinsic issues (e.g. precision/reliability). One value: comscore_unique_visitors definition: Unique persons that visited one of the Wikimedia wikis at least once in a certain month filters: select_regions, select_countries comscore_reach_percentage definition: Percentage of total unique visitors to any web property which also visited a Wikimedia wiki filters: select_regions, select_countries squid_page_views definition: Total articles (htm component) requested from nearly all Wikimedia wikis (exceptions are mostly special purpose wikis, e.g. wikimania wikis) Totals are based on the archived 1:1000 sampled squid logs. filters: select_regions, select_countries, select_web_properties, select_projects, select_wikis, select_platform dump_article_count definition: All namespace 0 pages which contain an internal link minus redirect pages (for some projects extra namespaces qualify) filters: select_projects, select_wikis dump_binary_count definition: All binary files (nearly all of which are multimedia files) available for download/article inclusion on a wiki filters: select_projects, select_wikis dump_edits definition: All edits on articles (as defined by dump_article_count) filters: select_projects, select_wikis dump_new_registered_editors definition: All registered editors that in a certain month for the first time crossed the threshold of 10 edits since signing up                     filters: select_projects, select_wikis dump_active_editors_5 definition: All registered editors that made 5 or more edits in a certain month filters: select_projects, select_wikis dump_active_editors_100 definition: All registered editors that made 100 or more edits in a certain month filters: select_projects, select_wikis other metrics which are likely to follow at some stage (for now included for brainstorm purposes only) squid_page_edits worldbank_population_per_country worldbank_internet_users_per_country Parameter is always required months     - First and last month to include in time series One value: single month as yyyy-mm month range as yyyy-mm;yyyy-mm Parameter is always required select_... - Return data per month per qualifying row of data Specify per select parameters the criteria in any of four ways (only cB and cC can be combined): cA: * for all known values, e.g. select_regions=* cB: one or more codes separated by comma. e.g. select_regions=NA,SA cC: one or more codes separated by plus sign, which returns required data totalled for all specified codes, e.g. select_regions=NA+SA cD: highest n (number) occurences, using values for most recent selected month for ranking, e.g. select_countries=top:12 Available select_.. parameters: select_regions cA cB cC                    for valid region codes see here select_countries cB cC cD                    for valid country codes see here select_web_properties cC cD                    This parameter requires extra authorisation Example: select_web_properties=top:10 select_projects cC                    for valid project codes see here select_wikis cC                    specify each wiki code as project:language, e.g. wp:en for English Wikipedia, wq:de for German Wikiquote Example: select_wikis=wp:en,wp:de select_editors cB cC                      A for anonymous user, R for registered user, B for bot Example: select_editors=R,A,R+A,B select_edits cB cC                      M for manual, B for bot-induced Example: select_edits=M,B select_platform cB cC (only squid_page_views) M for mobile N for non-mobile (anyone knows a better term?) Example: select_platform=M,N,M+N normalized  - Y or N                  Only applies to squid_page_views, where data for each month are recalculated to 30 days (other metrics may follow) Default: N (WMF Report Card will use normalized time series when available) data       - One or more type of data to be returned, separated by comma Values: time_series returns ordered list of value pairs, on efor each month within range time_series_indexed like time_series, but each month's value will be relative to oldest month's value which is always 100 percentage_growth_last_month percentage_growth_last_year, percentage_growth_full_period growth percentages are relative to oldest value (80->100=25%) although trivial, requesting these metrics through API ensures all clients use same calculation Default: time_series lang        - Language code, used for region and country names Default: en                 Supported: en     format       - (csv,json,... see elsewhere) . Examples: api.php?action=analytics&months=2008-03:2011-03&metric=squid_page_views&select_countries=US,UK&select_mobile=M,N&normalized=Y&data=time_series,percentage_growth_last_month,percentage_growth_last_year,format=xml returns four sets of metrics (time series plus two percentages) one for United States/mobile, one for United States/non-mobile, one for United Kingdom/mobile, one for United Kingdom/non-mobile

Filter select_regions
For comscore_... filters: AS = Asia Pacific C = China EU = Europe I = India LA = Latin-America MA = Middle-East/Africa NA = North-America US = United States W = World

Filter select_countries
Valid country codes are ISO 3166-1
 * List of ISO 3166-1 country codes and English name
 * About 3166-1 standard

Filter select_projects
wb = Wikibooks wk = Wiktionary wn = Wikinews wp = Wikipedia wq = Wikiquote ws = Wikisource wv = Wikiversity co = Commons wx = Other projects

Filter select_wikis
Specify as project:language For valid project codes see select_projects above For full lists of valid language codes per project see Wikimedia projects The following overview of exported wiki databases (aka dumps) can also be useful: lists of Wikimedia dumps