Analytics/Archive/Roadmap

David Schoonover, Diederik van Liere &mdash; March 2012

Disclaimer: This document is a work-in-progress; as a sketch of our roadmap, it provides a good understanding of our thinking, our goals, and our needs for the coming year for budgetary purposes. We welcome feedback and suggestions. With ‘our’ we mean the Analytics Team; this is not official policy of the Wikimedia Foundation.

Our overarching vision is to give the wiki movement a true data services platform: a cluster capable of providing realtime insight into community activity and a new view of humanity's knowledge to power applications, mash up into websites, and stream to devices. It must be powerful enough to keep pace with our ample institutional motivation and energy, and robust enough to service needs that are as-of-yet hidden from view.

Executive Summary
The Analytics Team has two overarching goals for the fiscal year of 2013:

A Data Services Platform
The construction of a 40 node compute cluster to store, analyze, and query all incoming data of interest to the community, built so as to keep pace with our ample institutional motivation and energy.

This platform includes the following:
 * Traffic Data: Structured queries including typical web analytics (with geo) for visitor data, anonymized editor data, wiki page, media, and search views.
 * Application Instrumentation: Structured queries on usage data about our mobile applications, click tracking, conversion funnel tracking, and revision tagging.
 * Self-Service Data Access: Unified access to data services infrastructure under a self-service model, via a dashboard, a public API, and an internal query console.
 * Spare capacity to accommodate traffic growth and to respond quickly to new, unforeseen data needs.

See Hardware Planning for more detail on the technology, hardware needs, and concrete goals of the platform, specifically Tranche A for a hardware profile that can accommodate all these features.

Actionable Intelligence on Institutional Goals
Intermediate solutions for high-priority institutional goals and urgent needs:
 * Reportcard automation
 * Editors by Geography
 * Pageviews by Mobile Carrier
 * Tracking of WMF Mobile apps
 * Support the new Editor Engagement Experiments team

Mission
The Analytics Team sees as its primary responsibility making Wikimedia related data available for querying and analysis to both WMF and the different Wiki communities and stakeholders. We want all our users to be as much self-servicing as possible, we can conduct analysis if need to.

Starting Point
Our current analytics capabilities are underdeveloped: we do not have an infrastructure to capture visitor, editor, clickstream and device data in a way that is easy accessible for analysis, our analytics efforts are distributed among different departments, our data is fragmented over different systems and databases, many of our analytics tools are ad-hoc and not embedded within the Foundation and we have very few developers who can work on this full-time.

The Challenge
Our institutional short-term and long-term goals are ambitious: we want to retain more editors, we want to grow our mobile readership, we want to diversify our communities and we want to roll out new features to the MediaWiki platform. These goals have been formulated but we cannot monitor the progress towards these goals in a timely manner. The data that shows our progress towards these goals should be accessible as fast as possible.

We need to abstract away the extraction, transformation, anonymization, enriching and loading (ETAEL) of data from our customers1. Important reasons why they should not worry about this:
 * It does not answer their questions, it is a necessary evil.
 * It makes it harder to compare different analyses because we cannot be sure that different analyses used the same data and cleaned the data in the same way.
 * It is scaffolding, every analysis needs to go through this and hence their scale efficiencies to be achieved.
 * It makes ad-hoc analysis vulnerable to upstream changes like db schema changes, changes to the XML files, etc.

Toward a Data Services Platform
The Analytics team will develop a data services platform that consists of:
 * Extraction, transformation, aggregation, anonymization and storage of visitor, anonymous editor and clickstream data
 * Extraction, transformation, aggregation and storage of registered editors, article and device data
 * Enriching visitor and editor data with geographic data (and possibly demographic data)
 * An API for both internal and external customers who can query this data using a RESTful interface.
 * A system for authentication of both humans and computers (oauth), quotas and throttling of requests and availability information about the RESTful interface.
 * A web-based presentation layer that provides a high level overview of the health of the different WikiMedia communities

Anonymization: this means that the data repository should never store raw IP addresses or other information that would identify visitors and editors (with the exception of registered editors).

Tall Trees Have Deep Roots
We have an amazing opportunity to develop an analytics infrastructure that will service the WMF and the community for many years to come. To make sure that we are able to deliver timely, accurate and reliable data, we need to take time to develop a robust infrastructure. By making our analytics infrastructure open from day one, we ensure that we can maximize community involvement, crowdsource some of our analytics needs and adapt to changes in our strategic priorities.

The data services platform will also serve better the current organizational design of WMF: data analysts from the different departments will be able to query the data and generate the reports that are required. We envision this platform to be self-servicing as much as possible and this will also lead to fewer conflicting priorities.

Transition
We currently have a number of tools that create analytics: wikistats.org for overall community health, wikipride / editor trends study software for analyzing editor retention rates across different projects and other ad-hoc tools. These tools need to be used on the short-term while the platform is built, once the platform matures, we have to transition these tools to the platform and start fetching data from the cluster.

As long as we have a distributed organizational design, the primary responsibility of transitioning our current toolset is with the customers (with the exception of ED and Board of Directors). The development of an overall dashboard for ED / Board of Directors falls squarely within the responsibility of the Analytics Team. To make the data platform a success, we need to make it the primary and overwhelming target of our organizational investment in analytics. As a result of this responsibility, the Analytics Team needs to be highly responsive to the analytics needs of the organization.

The web-based presentation layer that allows people to interactively inspect the data and generate charts, this could include dashboards that present current health of the WikiMedia communities. If customers do not have the resources themselves to develop this then it can become a priority for the Analytics Team, but the first responsible is the customer self.

Querying, Not Just Counting
The purpose of the this platform is not to just count raw numbers but it is to give data analysts, community members and other interested folks the possibility to analyze our data and help us find “somewhere, something, incredible is waiting to be known” (Carl Sagan). This also means that ultimately sampling is not a long-term solution. Sampling can be used for single point estimates of large datasets, but structured queries involving joins on sampled data compounds the error and results in an exponential decline in the confidence-interval. As unreliable data cannot serve as the solid basis for decision-making, the platform will use advanced methods of cardinality estimation to provide bounded-error counts that are still able to structurally queried.

Analytics Stack Experimentation
Currently, the Analytics Team has 10 Cisco nodes to compare different analytics stacks. Once these nodes are up and running we intend to create two 5 node clusters, one powered by Hbase, another by Cassandra.

This will enable us to compare the stacks in terms of:
 * Performance
 * Deployment
 * Maintainability
 * User-friendliness on the front-end for data analysts

The results of this experiment will help us in making a final recommendation about which analytics stack we want to use.

Hardware Planning
A detailed discussion of our hardware needs can be found in the 2012-2013 Analytics Hardware Planning document.