Wikimedia Product/Better use of data/Event Platform Clients

Introduction
Originally called the Modern Event PlatformToday, events logged via EventLogging don't go to a MySQL database, for instance, but instead to Hadoop, which enables a greater volume of events to be stored. Similarly, the pipeline for processing received events them has also been replaced by Kafka-based EventGate, which is one of Event Platform's core components and can ingest a much greater volume of events much faster.

This ability to process a greater volume of incoming events is required for


 * EventBus, which produces events from changes on MediaWiki, and
 * Change Propagation system, which reacts to those change events.

In order to be as robust and fast as it is, EventGate requires a different way of specifying schemas for events. The legacy system, EventLogging, uses schemas stored as JSON on Meta wiki and those schemas are edited like any wiki page. The new system uses schemas stored in schema repositories as Git repositories, which allows us to do development, continuous integration (CI), versioning and deployment for schemas the same way we do any code project. This core component of Event Platform employs the JSON Schema vocabulary which allows us to annotate schemas and validate events in a way that is more robust and more controlled. Furthermore, this system allows schema fragments to be re-used (imported) and shared, which promotes consistency and standardization.

Finally, Event Platform introduces the concept of streams. Inside the Event Platform, a stream is a contiguous collection of events. In the legacy system, instrumentation declares which specific revision of a schema its events conform to (since the legacy system stores schemas on wiki) and that revision of the schema is used for validation. Those events, after being validated, would flow into a table named after that schema. There was no way to simply re-use a schema without making a differently named duplicate. In the new system, we declare a stream and the schema it uses, and in turn the events logged to each stream end up in separate tables – one table for each one stream.

The stream configuration – made available through EventStreamConfig – allows us to declare and configure those streams. Besides specifying which schema events adhere to, we can also specify the stream's sampling settings. These configurable settings are used by the Event Platform Clients to change how data is collected, without requiring any changes to the instrumentation and thus no separate deployments of new versions of clients. Practically, this means we can adjust sampling rates without building a new version of a mobile app and going through the process of releasing it on an app store. On MediaWiki side, we can change sampling rates without going through the process of patching WikimediaEvents, waiting for it to be reviewed, and waiting for it to be deployed along with other patches to MediaWiki Core and other MediaWiki extensions.

Instrumentation
In the same way that the Event Platform is an evolution of the server-side component of EventLogging, the Event Platform Client (EPC) is an evolution of the client-side component of EventLogging. EPC is:


 * a specification of how clients work with the Event Platform
 * a standardization of algorithms, behaviors, and basic necessities
 * a set of libraries for different platforms adhering to that specification

Previously, different teams implemented their own EventLogging-based analytics solutions, isolated from each other. EPC is an effort to unify that previous work and to establish consistency across platforms. That uniformity and consistency makes it possible to leverage data from multiple platforms to yield insights into how our users use our whole ecosystem of products in unison.

It also enables analysts to support teams which are not their primary teams – to be more portable. The legacy system, in which every instrumentation has its own quirks and naming is inconsistent, places a heavy burden on each analyst to learn and remember the specifics of their assigned teams' data; and if another analyst had to come in as back-up, they too would need to learn those specifics. EPC takes care of:


 * obtaining stream configuration
 * using the stream config to control data collection
 * determining which streams are in-sample or out-of-sample based on the specific identifier (pageview, session, device) using a standardized algorithm
 * cc'ing streams
 * attaching the necessary metadata to logged events such as client-side timestamp recording when the event was generated
 * standardized session ID generation, consistent across MediaWiki, Android, and iOS