Analytics/Wikistats

Work in progress 

Wikistats is an informal but widely recognized name for a set of reports developed by Erik Zachte since 2003, which provide monthly trend information for all Wikimedia projects and wikis. The term is sometimes also used for other reports that target wikis.

Wikistats Portal
For a quick introduction to wikistats see the portal stats.wikimedia.org The portal gives access to
 * Report Card (see below)
 * Monthly wikistats reports, which track per Wikimedia wiki (700+) trends for a myriad metrics (see below), e.g. for all Wikipedia projects
 * 'Special' reports, some are one time publications, some are produced monthly in wikistats job (see below), but unlike the previous set focus on cross wiki comparisons
 * External reports (meaning externally produced) with varying update intervals (not further detailed here)
 * Wikipedia visualizations: a gallery of some of the best visualizations produced anywhere on the web (not further detailed here)
 * About info, which introduces the perl scripts and intermediate public csv files
 * Help info, which shows a breakdown of monthly wikistats reports, and introduces UI elements

General characteristics

 * All scripts are coded in perl.
 * There is hardly any formal documentation (this page is a start), for several reasons:
 * Lack of time: there is always some new report or other data/stats request that takes priority
 * Lack of motivation: for 6 years wikistats was a private volunteer project, with reusability of the code as a lower priority (still Wikia and other sites figured out how to use the scripts)
 * Lack of confidence in inline comments: too often these are out of sync with what the codes really does, or riddles by themselves, or mere generalities (this is no comment on Wikimedia sources, I wouldn't even know). Erik favors self documenting code, descriptive variable and article names are crucial here. Caveat! Some of his scripts, even major ones, are in blatant violation of this rule, and therefore hard to maintain, esp. certain parts of WikiReports scripts.
 * No scripts are in SubVersion yet (a major target for this budget year is to reorganize/streamline parts of the code, then check-in on a per project basis)
 * Some of the scripts date back to 2003, and have been overhauled many times, often to accomodate non trivial database restructuring (pre xml dump era), but even more important: to refactor the scripts when processing time and limited resource (memory) again and again called for new measures. In 2003 the full history of the English Wikipedia could be rebuilt in 10 minutes. In 2010 complete procesing of the English Wikipedia takes 20 days with much more efficient scripts. This is why early 2010 a pragmatic decision was taken to parse only stub dumps on a monthly basis and omit some less important or even trivial metrics, with the intention to update these at least once a year. Most significant loss is word count, which featured as one of few basic metrics in the yearly tax report ('Wikipedia now has x time as many words as Brittanica').
 * Some of the optimizations which were needed years ago with standard server configuration of the time, might not be needed in a well configured resource rich server of today (this is debatable and not substantiated).
 * There are references in the code to Erik's test setup (e.g. local path names on Erik's PC). This needs to be generalized some day. Of course these only come into play on test runs.
 * Some parts of the code are quick hacks or incomprehensible in some other way, shame on you Erik.
 * Some parts of the code are quite readable, rich in functionality and well tuned, kudos Erik ;-)
 * Most code is tested in a test environment (Erik's PC) rather than on our production systems, regression tests with pre/post modification comparison are often part of the procedure.

Report Card

 * Update frequency: monthly
 * Process: Considerable manual intervention needed. See WMF office wiki for details (private wiki).
 * Synopsis: A compact overview over recent trends, primarily targeting WMF management and staff, but publicly available. All charts cover exactly one year.
 * Input:


 * Unique visitor counts and reach percentages per global region: comScore monthly downloads (csv files)
 * Other metrics: output from several wikistats jobs, partly reprocessed with scripts
 * Procedure: several scripts collect, reprocess and format input, followed by a lot of copy & paste into huge spreadsheet which draws tens of charts. These are saved in two step process (Painshop as intermediary to keep best output quality). A script produces three variations of reports from a custom html template (home grown syntax). Manual analyis and crafting of synopsis. Manual publication of files in several sets.
 * Comment: All in all quite a time intensive process. Some day should be further automated, and made more flexible. But challenge for this year is to improve information density and make a version for higher management which is even more concise.

Wikistats Monthly Reports
Here is an overview of reports, clustered by scope, data source and execution platform

Background
In summer 2003 Erik Zachte started to produce reports on Wikipedia trends. First objective was to produce uniform article counts for the entirety of Wikipedia's (short) history. In early days the definition of what constituted an article changed several times. Parsing SQL dumps made it possible to trace the history of Wikipedia from its inception and build trend reports using the latest definition on what constitutes an article.

Over the years parsing SQL dumps became less and less practicle. The table structure changed all too often, reverse engineering database changes sometimes was a challenge. So when XML dumps arrived this was huge relief.

In 2003 the script that recreates all counts for the English Wikipedia ran in about 10 minutes, in Jan 2010 processing the full xml archive for the English Wikipedia took 20 days, and is consequently no longer considered feasible on a monthly basis. Processing has shifted to so called stub dumps (meta info, no article content), with some loss of functionality.

Major Reports

 * Monthly wikistats trend reports, based on xml and sql dumps
 * Platform: wikistats server bayes
 * Purpose: show historic trends for a myriad of metrics that can be inferred from the dumps, for 700+ wikis
 * Input: Wikimedia XML dumps, and some SQL dumps
 * Update frequency: monthly


 * Scripts:


 * WikiCounts.pl + WikiCounts*.pm process a variety of dumps and produce a set of intermediate csv files (publicly available)
 * WikiReports.pl + WikiReports*.pm process csv files and produce tens of different reports, some for 280 wikis, many in 25 languages. All in all thousands of reports per month.
 * Perl and shell scripts reside on bayes: /home/ezachte/wikistats
 * Major consideration for wikistats is to provide equal information for all Wikimedia projects and all language wikis that constitute a project. All 700+ wikis get the same treatment (except wikis with less than 10 articles). :Reporting for many reports happens in 25+ languages. But most translations are far from complete, and some reports are only available in English. There has been little maintenance on the language specific message files in recent years. Some day language support should be overhauled, and replaced by an interface to Siebrand's TranslateWiki.
 * Every run all csv files are rebuilt from scratch. Reusing previous months data was of course considered, but new functionality is added often to WikiCounts job and of course there is the occasional bug fix. Wherever possible new metrics are supplied for all months since Wikipedia's inception in 2001. Facilities for reusing old counts would thus add complication with little benefits.


 * Monthly wikistats trend reports, indirectly based on squid logs
 * Data source: Hourly page view files, per article title, and per project, indirectly based on squid logs after aggregation with Domas Mituzas' scripts.
 * Platform: wikistats server bayes
 * Update frequency: daily
 * Input udp2log on locke collects udp messages from all squids, writes sampled data to disk for 3 months storageproduces two hourly files, both downloadable from dammit.lt(in fact these are now produced on WMF server so dammit.lt server can be taken out of the process some day.
 * Hourly files with total page views per wiki are used for report card and monthly page view stats. The reports are available in 4 variations: for desktop and mobile traffic, normalized (for months of 30 days) or original.


 * Monthly wikistats reports, providing breakdowns of traffic directly based on (sampled) squid logs
 * Platform: squid log aggregation server locke
 * Update frequency: manually, ideally monthly (automation is in progress)
 * tbc

Minor Reports

 * Progress report on sql/xml dump production
 * Platform: Erik's server account at infodisiac.com (hosted by Lunarpages)
 * Objective: Provide at a glance concise overview of dump progress, esp whether they are ready for processing by wikistats
 * Highlights: color coding, clustering per major project, very concise notation for at a glance overview (not meant for general public)
 * Update frequency: Report is built on the fly on user access


 * Progress report on wikistats data gathering
 * Platform: wikistats server bayes
 * Highlights: color coding, clustering per major project, very concise notation for at a glance overview (not meant for general public)
 * Update frequency: Reports is refreshed 4 times per hour


 * Mailing list activity
 * Platform: Erik's account at infodisiac.com
 * Objective: provide monthly activity stats on Wikimedia mailing lists
 * Input: zipped mailing list archives (e.g. from )
 * Update frequency: Reports are refreshed daily


 * Total monthly posts per mailing list
 * Major contributors to the mailing lists, with some profiling
 * Monthly activity per mailing list per contributor, e.g. foundation-l


 * Rankings and 12 months trends for major web properties and global reference sites, based on comScore data.
 * Platform: Erik's home PC
 * Input: csv files downloaded from comScore site. Explicitly confidential. Thanks to comScore for providing us important web analytics data for free.
 * Synopsis: rankings and unique visitors counts for last 12 months, enriched with trending parameters
 * Procedure: See WMF office wiki for details (private wiki).

Ideas for improvement

 * Split WikiCounts job into two jobs to improve restartability. Now a bug in the processing stage could mean days or even weeks of data gathering would be lost.
 * 1) Data gathering (parsing the dumps),
 * 2) Data sort / processing / export to csv files
 * Some intermediate data sets are huge: many gigabytes. In order to sort these files with limited internal memory multiple bins are used, consider using merge sort (if there is a cross platform solution).