Wikimedia Product/Better use of data

The Better Use of Data program aims to ensure a more reliable, efficient, and accessible means of collecting, interpreting, and sharing data.

Annual plans

 * FY 2018-2019

Artifacts

 * Instrumentation DACI
 * Data Dictionary

Subprojects

 * Data collection: Instrumentation
 * Report stewardship: New Content Program Metrics Reports

October 2018

 * 1-October-2018: Further review of roles and responsibilities regarding event logging was carried out over the past month and is now available as an Instrumentation DACI. Additionally, an offer has been accepted for the Software Engineer who will work on data-oriented components in the Infrastructure team in Readers Engineering, as well as for the management position in Product Analytics.

August 2018

 * 28-August-2018: Better Use of Data memo discussed between Audiences management and Analytics Engineering - figuring out division of labor, timelines, etc. Audiences is standardizing for FY 2018-2019 on Superset, Jupyter notebooks (SWAP/PAWS), and Turnilo. The working group has further iterated on the DACI for instrumentation. Data engineer recruiting continues. Wiki segmentation ideas being generated (see T188391).
 * 7-August-2018: Working group met. DACI going through final review. Provisional FY 18-19 reporting tool approach identified; this helps clarify part of the training curriculum for Output 2.1: Measurement expectations.

July 2018

 * 30-July-2018: DACI shared more broadly. Not setting up specific Phabricator board yet. Work group participants identified, first meeting scheduled.
 * 3-July-2018: DACI shared with managers for first review. Looking into setting up a Phabricator board. Work group participants still being identified.

June 2018

 * Annual plan updated
 * Marshall Miller is writing a memo with recommendations
 * Request for participants on working group issued
 * Roadmap formed

October 2018

 * Kate Zimmerman hired to run Better Use of Data program as manager of Product Analytics
 * Jason Linehan hired to assist Better Use of Data program as software engineer in Reading Infrastructure

December 2018

 * Better Use of Data Phabricator board created

Introduction
Originally called the Modern Event Platform (MEP) because the motivation for the project was adoption and integration of modern event systems and standards, the Event Platform is composed of technologies and conventions that allow us to do more with events than what we previously could. Our original system, EventLogging, was not designed to scale and be used in ways that we wanted to use it. It was initially intended for instrumenting features for telemetry – tracking interactions and recording measurements to give us insights into how features were used by actual users. You couldn't have a system, for example, that responded to events – that took actions based on information it received.

Today, events logged via EventLogging don't go to a MySQL database, for instance, but instead to Hadoop, which enables a greater volume of events to be stored. Similarly, the pipeline for processing received events them has also been replaced by Kafka-based EventGate, which is one of Event Platform's core components and can ingest a much greater volume of events much faster.

This ability to process a greater volume of incoming events is required for


 * EventBus, which produces events from changes on MediaWiki, and
 * Change Propagation system, which reacts to those change events.

In order to be as robust and fast as it is, EventGate requires a different way of specifying schemas for events. The legacy system, EventLogging, uses schemas stored as JSON on Meta wiki and those schemas are edited like any wiki page. The new system uses schemas stored in schema repositories as Git repositories, which allows us to do development, continuous integration (CI), versioning and deployment for schemas the same way we do any code project. This core component of Event Platform employs the JSON Schema vocabulary which allows us to annotate schemas and validate events in a way that is more robust and more controlled. Furthermore, this system allows schema fragments to be re-used (imported) and shared, which promotes consistency and standardization.

Finally, Event Platform introduces the concept of streams. Inside the Event Platform, a stream is a contiguous collection of events. In the legacy system, instrumentation declares which specific revision of a schema its events conform to (since the legacy system stores schemas on wiki) and that revision of the schema is used for validation. Those events, after being validated, would flow into a table named after that schema. There was no way to simply re-use a schema without making a differently named duplicate. In the new system, we declare a stream and the schema it uses, and in turn the events logged to each stream end up in separate tables – one table for each one stream.

The stream configuration – made available through EventStreamConfig – allows us to declare and configure those streams. Besides specifying which schema events adhere to, we can also specify the stream's sampling settings. These configurable settings are used by the Event Platform Clients to change how data is collected, without requiring any changes to the instrumentation and thus no separate deployments of new versions of clients. Practically, this means we can adjust sampling rates without building a new version of a mobile app and going through the process of releasing it on an app store. On MediaWiki side, we can change sampling rates without going through the process of patching WikimediaEvents, waiting for it to be reviewed, and waiting for it to be deployed along with other patches to MediaWiki Core and other MediaWiki extensions.

Instrumentation
In the same way that the Event Platform is an evolution of the server-side component of EventLogging, the Event Platform Client (EPC) is an evolution of the client-side component of EventLogging. EPC is:


 * a specification of how clients work with the Event Platform
 * a standardization of algorithms, behaviors, and basic necessities
 * a set of libraries for different platforms adhering to that specification

Previously, different teams implemented their own EventLogging-based analytics solutions, isolated from each other. EPC is an effort to unify that previous work and to establish consistency across platforms. That uniformity and consistency makes it possible to leverage data from multiple platforms to yield insights into how our users use our whole ecosystem of products in unison. It also enables analysts to support teams which are not their primary teams – to be more portable. The legacy system, in which every instrumentation has its own quirks and naming is inconsistent, places a heavy burden on each analyst to learn and remember the specifics of their assigned teams' data; and if another analyst had to come in as back-up, they too would need to learn those specifics. EPC takes care of:


 * obtaining stream configuration
 * using the stream config to control data collection
 * determining which streams are in-sample or out-of-sample based on the specific identifier (pageview, session, device) using a standardized algorithm
 * cc'ing streams
 * attaching the necessary metadata to logged events such as client-side timestamp recording when the event was generated
 * standardized session ID generation, consistent across MediaWiki, Android, and iOS