Analytics/Reportcard

From MediaWiki.org
Jump to: navigation, search

The beta of the new Reportcard dashboard is at reportcard.wmflabs.org! The Reportcard is powered by Limn. The "Report card" as separate from Limn is deprecated for status tracking purposes; if you want to find out what's going on with the report card, look at the Limn status.

Goals[edit | edit source]

This project aims to replace the current stats.wikimedia.org/reportcard. It should deliver the following high-level objectives:

  • Fully automated
  • Metrics themed by the following groups: readers, editors, devices, articles, ecosystem, files
  • Create a nicer and more intuitive look and feel.
  • Fully extension-ify so that graphs can be invoked from any Wikimedia wiki

Documentation[edit | edit source]

  • (tbd)

Notes[edit | edit source]

Data gathering[edit | edit source]

monthly comScore metrics[edit | edit source]

Intro: WMF has access to a certain comScore metrics (MyMetrix domain) by kind donation. Publication of all international metrics occurs roughly 3 weeks after closure of a month. Several WMF employees have a sign-in code (a.o. Erik Zachte, Diederik van Liere)

  • Sign in on comScore
  • Click MyMetrix (top left)
  • Depending on your sign-in code pre-generated reports will be available (click bottom right 'Ready for you', ignore duplicates)
  • Download one copy of the following (download button is double arrow top left):
    • Multi-Country Media Trend UV's
    • Multi-Country Media Trend, Pages Viewed
    • Multi-Country Media Trend, UV's by region
    • Multi-Country Media Trend, % reach by region
    • Top 1000 properties, UV trend
    • Reference sites UV
  • These files will be vital for future reference, as each report only contains last 14 months. Therefore edit each file name to make range of months contained in the report self-evident, e.g. Multi-Country Media Trend, Pages Viewed_1711864.csv becomes Multi-Country Media Trend, Pages Viewed_(Jan 12- Jan 13) (beware, number of months differs per report!)
  • Add to git repository 'git add -f ../analytics/csv/comscore' (-f force to overrule .gitignore)
  • Send a copy of 'Multi Country Media Trend Page Views ...' and same UV's ...' to Global Dev (Jessie Wild, Evan Rosen, Anasuya Sengupta)

monthly Wikistats dump metrics[edit | edit source]

Intro: at least once a month all roughly 800 Wikimedia wikis are dumped to xml files. Wikistats parses these dumps and rebuilds almost all stats from scratch. Some of the metrics collected are input for monthly report card. Especially dumps for large wikis sometimes are only available near the end of the month which makes the whole process a bit more time-critical every month. When one or more large wikis (say one or more of the 25 largest wikipedias, in number of articles) have not been processed generating a report card is pointless and should be cancelled. It would give a distorted view on our project-wide metrics, e.g. total number of editors.

To monitor the process of dump generation and dump processing there are two status pages which are constantly updated: dump generation status and dump processing status. The color of each code indicates how up to date the input/output of Wikistats are. Superscripted numbers behind each language code show the age of the dump, or how long ago the dump was processed (depending on the status report).

As a rule of thumb dump generation should be completed around the 20th, if not ask Ariel Glenn to manually start missing dump jobs). Note this schedule is too time sensitive and leave little room for error correction. Ariel is studying how the process can be speeded up. (on both status reports near the end of the month all codes should be green, don't be confused when they are recolored dark blue from the start of the next month). For large wikis the stub dump is processed (no article content). For smaller wikis the full archive dump.

More on the dump processing jobs, and manual steps needed to finalize the process, later.

monthly Page View metrics[edit | edit source]

A daily cron job run on server stat1 (account ezachte) runs /a/wikistats_git/dumps/bash/pageviews_monthly.sh. This job updates all reports listed at [1], and updates a file wikilytics_in_pageviews.csv for the report card. This file can actually be more up to date than other input files for the report card, depending on when the job is run. The Metrics Meeting is always scheduled first Thursday of the month (except in January where it is the second Thursday). For most up to date page view metrics in the report card run prep_csv.sh close before the Metrics Meeting.

generation of input files for Report Card[edit | edit source]

  • Run ../analytics/bash/prep_csv.sh (until bash file is fixed, set var's yyyymm/yyyymm2/yyyymm_rc manually)
yyyymm=2013-02    # last month to report on
yyyymm2=2013-01   # previous month for comparisons
yyyymm_rc=2013-04 # month of RC meeting

This will generate wikilytics....csv files for input to Limn, and two comparison... text files that list discrepancies with the previous ly published metric for quic sanity checking. All files relevant to Limn are packaged in stat1:/a/wikistats_git/analytics/csv/[yyyymm]/rc-[yyyymm_rc].zip

About the comparison files
As all wikistats data are regenerated each month and each new dump has more articles deleted for every historic month, discrepancies are expected, but mostly for one-but-newest month (i.e. on its second appearance). After two months the numbers are pretty stable. This choice to regenerate all metrics each month has been a repeated source of confusion and debate. For pro's and con's of refreshing all metrics each month see also this post on the analytics mailing list.
Also in the comparison files some languages will be flagged as new. This is because only the 25 largest wikis are listed in the input and that may vary from month to month.