Analytics

From MediaWiki.org
Jump to: navigation, search
wmf analytics: logos mean srs bsns

Howdy from the Wikimedia Analytics team! We're currently working on building a Data Services Platform (codenamed Kraken) to power the next-generation of intelligence and analytics, as well as a new and wondrous Reportcard.

If you're enthusiastic about analytics, you might want to meet the team, or subscribe to the Analytics Mailing List (don't worry, it's low-traffic). If you want to see Kraken in action, have a look at the Hive tutorial


Contents

Team[edit]

Planning[edit]


Stakeholder Corner[edit]

  • Product Codes — required application identifier needed to get data into the cluster.
  • Metrics Definitions — canonical definition of metrics.
  • Dreams — features, metrics/queries, and visualizations people would like to see someday. Importantly, don't worry about these things being reasonable, well-scoped, or put in the right place. Just add stuff here you think would be useful.


Projects[edit]

Kraken[edit]

  • "Kraken" is the codename for the cluster and software which powers the Data Services Platform.
  • Project Info
  • Product Codes
  • Cluster Dataflow Diagram
  • Latest status: 2013-05-monthly:
    We continued our efforts of increasing our monitor coverage of the different webrequest dataflows. On the udp2log side, we added monitoring per DC/server role. Every month, we work on improving the robustness and security of the analytics-related servers that we run: we moved the multicast relay from Oxygen to Gadolinium, we upgraded Oxygen to Ubuntu Precise, and we moved all the Limn-based dashboards from the Kripke labs instance to the Limn0 labs instance. Continous integration for webstatscollector, wikistats and udp-filters now works. The puppet module for Hadoop has been merged in the Operations reposotiry; this is a big step forward in moving Kraken from beta to production status. Magnus Edenhill demonstrated varnishkafka based on Kafka 0.8; on a local machine varniskafka was able to process 140k msgs/s and we are planning to do production testing mid June. Last, we separated the Kraken machines from the other production servers by installing network ACLs.

Limn[edit]

  • Limn is a drop-in GUI toolkit for building visualizations. It powers the WMF Monthly Reportcard.
  • Project Info
  • Source: https://github.com/wikimedia/limn
  • Issues: https://github.com/wikimedia/limn/issues
  • Latest status: 2013-05-monthly:
    For the mobile team, we started collecting pageview counts for both official and non-official Wikipedia apps. We changed our Kafka import configuration so that the raw webrequest folders are directly queryable using Hive. The decision was made to re-platform the UMAPI codebase; we have spent quite some time specifying user stories and had productive discussions about the architecture during the Amsterdam Hackathon. On the development side, the 'page count' metric was introduced. We adapted Ori Livneh's Mediawiki Vagrant VM to also support UMAPI in combination with test data. This will make it much easier to debug issues and open development up to community members. We also fixed numerous stability bugs.

Reportcard[edit]

Wikistats[edit]

Logging Infrastructure[edit]

Data Releases[edit]

See Also[edit]

Research & Notes[edit]

Management[edit]