Analytics/Wikistats/Database API

WMF has started a process for automating the Monthly Report Card generation process. The Report Card has been commissioned in May 2009 by WMF primarily to provide key metrics for Wikimedia board and staff, but it also available to the general public. It comprised about 40 charts, drawn from several internal and external (comScore) data sources. Although most of these sources are generated fully automatically, the Report Card itself is now for a considerable part produced manually, with a large spreadheet as intermediary between csv's and charts.

In the new setup several layers are envisioned which together will make the process more robust, more flexible, and more open: Note: currently the project/process is in exploratory/brainstorming phase. Any feedback is highly appreciated, but please understand no commitment can be made yet on timing or functionality. Primary goal of the project is to automate the provision of key monthly metrics in the current report card, which is based on highly aggregated data. Extensibility and wider reuse of the data are important but secondary objectives for now. The MySQL database could be replaced by some other storage solution later. Focus right now is on the API which ideally should remain stable whenever possible.
 * Current csv files will be streamlined for import in a database.
 * A new MySQL database will provide permanent and better structured data storage.
 * A new Mediawiki API call will allow internal and external processes to extract data from this database.
 * Possibly a new highly modular presentation layer will extract and format data from the database and allow flexible grouping of charts (each chart a widget?)

Participants from WMF (May 2011):
 * Nimish Gautam: Design and implementation
 * Rob Lanphier: Coordination
 * Erik Moeller: Product Ownership
 * Erik Zachte: Design and implementation
 * Mani Pande: Research Advisor

New API call for data analysis metrics

 * concept documentation for new API call

* action=analytics * Collect data from the analytics database. Parameters metric      - Type of data to collect. About metric names: these include source of data, to allow for alternate sources of similar metrics, which likely are defined differently or have other intrinsic issues (e.g. precision/reliability). One value: comscoreuniquevisitors definition: Unique persons that visited one of the Wikimedia wikis at least once in a certain month filters: selectregions, selectcountries comscorereachpercentage definition: Percentage of total unique visitors to any web property which also visited a Wikimedia wiki filters: selectregions, selectcountries squidpageviews definition: Total articles (htm component) requested from nearly all Wikimedia wikis (exceptions are mostly special purpose wikis, e.g. wikimania wikis) Totals are based on the archived 1:1000 sampled squid logs. filters: selectregions, selectcountries, selectwebproperties, selectprojects, selectwikis, selectplatform dumparticlecount definition: All namespace 0 pages which contain an internal link minus redirect pages (for some projects extra namespaces qualify) filters: selectprojects, selectwikis dumpbinarycount definition: All binary files (nearly all of which are multimedia files) available for download/article inclusion on a wiki filters: selectprojects, selectwikis dumpedits definition: All edits on articles (as defined by dumparticlecount) filters: selectprojects, selectwikis dumpnewregisterededitors definition: All registered editors that in a certain month for the first time crossed the threshold of 10 edits since signing up                     filters: selectprojects, selectwikis dumpactiveeditors5 definition: All registered editors that made 5 or more edits in a certain month filters: selectprojects, selectwikis dumpactiveeditors100 definition: All registered editors that made 100 or more edits in a certain month filters: selectprojects, selectwikis other metrics which are likely to follow at some stage (for now included for brainstorm purposes only) squidpageedits worldbankpopulationpercountry worldbankinternetuserspercountry Parameter is always required months     - First and last month to include in time series One value: single month as yyyy-mm month range as yyyy-mm;yyyy-mm Parameter is always required select... - Return data per month per qualifying row of data Specify per select parameters the criteria in any of four ways (only cB and cC can be combined): cA: * for all known values, e.g. selectregions=* cB: one or more codes separated by pipe. e.g. selectregions=NA|SA cC: one or more codes separated by plus sign, which returns required data totalled for all specified codes, e.g. selectregions=NA+SA cD: highest n (number) occurences, using values for most recent selected month for ranking, e.g. selectcountries=top:12 Available select.. parameters: selectregions cA cB cC                    for valid region codes see here selectcountries cB cC cD                    for valid country codes see here selectwebproperties cC cD                    This parameter requires extra authorisation Example: selectwebproperties=top:10 selectprojects cC                    for valid project codes see here selectwikis cC                    specify each wiki code as project:language, e.g. wp:en for English Wikipedia, wq:de for German Wikiquote Example: selectwikis=wp:en|wp:de selecteditors cB cC                      A for anonymous user, R for registered user, B for bot Example: selecteditors=R|A|R+A|B selectedits cB cC                      M for manual, B for bot-induced Example: selectedits=M|B selectplatform cB cC (only squidpageviews) M for mobile N for non-mobile (anyone knows a better term?) Example: selectplatform=M|N|M+N normalized  - Y or N                  Only applies to squidpageviews, where data for each month are recalculated to 30 days (other metrics may follow) Default: N (WMF Report Card will use normalized time series when available) data       - One or more type of data to be returned, separated by comma Values: timeseries returns ordered list of value pairs, on efor each month within range timeseriesindexed like timeseries, but each month's value will be relative to oldest month's value which is always 100 percentagegrowthlastmonth percentagegrowthlastyear, percentagegrowthfullperiod growth percentages are relative to oldest value (80->100=25%) although trivial, requesting these metrics through API ensures all clients use same calculation Default: timeseries reportlanguage - Language code, used to expand region and country codes into region and country name Default: en                 Supported: en     format       - (csv,json,... see elsewhere) . Examples: api.php?action=analytics&months=2008-03:2011-03&metric=squidpageviews&selectcountries=US|UK&selectmobile=M|N&normalized&data=timeseries|percentagegrowthlastmonth|percentagegrowthlastyear&format=xml returns four sets of metrics (time series plus two percentages) one for United States/mobile, one for United States/non-mobile, one for United Kingdom/mobile, one for United Kingdom/non-mobile

Filter select_regions
For comscore_... filters: AS = Asia Pacific C = China EU = Europe I = India LA = Latin-America MA = Middle-East/Africa NA = North-America US = United States W = World

Filter select_countries
Valid country codes are ISO 3166-1
 * List of ISO 3166-1 country codes and English name
 * About 3166-1 standard

Filter select_projects
wb = Wikibooks wk = Wiktionary wn = Wikinews wp = Wikipedia wq = Wikiquote ws = Wikisource wv = Wikiversity co = Commons wx = Other projects

Filter select_wikis
Specify as project:language For valid project codes see select_projects above For full lists of valid language codes per project see Wikimedia projects The following overview of exported wiki databases (aka dumps) can also be useful: lists of Wikimedia dumps

Return value
The return value will have the outermost group be the name of the metric, along with the API call used to generate this object, the start date of any time series returned, along with the granularity. The objects inside the main metric will have all the appropriate filters that were used to obtain the data, along with any constraints on the data.