Wikimedia Product/Analytics Infrastructure/Stream configuration

This page describes features of the Event Platform Client (EPC) as it pertains to stream configuration in the Modern Event Platform (MEP). The stream config is defined via YAML and is made available through Extension:EventStreamConfig. Stream config is loaded via ResourceLoader on the web and is available as JSON through the MediaWiki API, so that non-MediaWiki clients such as the Wikipedia mobile apps can download it.

Required components
The following fields are required for each stream:


 * : the path to the schema the stream uses and the specific version of that schema
 * : the URI of EventGate endpoint where events will be sent to via HTTP POST

The following fields are optional:


 * : specifies the sampling logic, see below

More fields may potentially be added in the future.

Sampling logic
The  field of a stream specifies how events are determined to be in-sample (sent) or out-of-sample (thrown away).

The primary point to note and remember here is that this controls data collection, NOT data generation. It is up to the individual instrumentation to determine when to  events – this is the data generation part. The sampling field controls which -ed events actually get sent to be processed and put into the database – this is the data collection part. It is very important to be aware of this distinction. When your team instruments A/B test, the sampling logic in the streaming configuration cannot be used to determine which clients get one UX/feature variant or the other – that is up to the instrumentation to make that determination based on session ID, user ID, a special flag in the user properties table, etc.

The two main knobs to turn are:


 * : the ID to use for determination
 * "session" by default, meaning if a session is determined to be in-sample (as specified by ), all events generated by that session will be in-sample and sent to
 * web-specific streams can be configured to use "pageview" token for determination
 * will cause the determination to be made on a page-by-page basis
 * can be useful for getting a random sample of page views, not sessions
 * mobile app-specific streams can be configured to use "device" (app_install_id on iOS and Android)
 * will cause the determination to be made on a device-by-device basis
 * if a device is determined to be in-sample, all of their sessions and events will be in-sample
 * useful for retention metrics, cohort & longitudinal analyses, cross-session analysis
 * : proportion of identifiers that are considered in-sample
 * 1.0 (100%) by default, can be overridden in individual streams
 * set to 0.0 to disable the stream (if you want to keep the stream in the config but prevent events from being sent to it)
 * uses "widening the net" approach: IDs determined to be in-sample at lower rates will be determined to be in-sample at higher rates
 * : specific exemptions, see below

Sampling rate
For example: Suppose we have 4 streams: A, B, C, and D with sampling rates 0.01, 0.1, 0.25, 0.5, respectively. Those streams could be using the same schema or different ones. But specifically, those streams use the same identifier – let's say it's the session token. Remember, in the MEP paradigm streams map to tables inside the database. Here's what you should expect to see in those tables for any time period:


 * Table A will have data from approximately 1% of active sessions in that time period
 * Table B will have data from approx. 10% of active sessions at that time, but definitely all of the sessions found in table A
 * Table C " " " " ~25% of active sessions at that time, but definitely all of the sessions found in tables A & B
 * Table D " " " " ~half of active sessions at that time, but definitely all of the sessions found in tables A, B, and C

Specifying exemptions with rules
The  field can be used to override the stream's   in specific situations. The core use-cases we wanted to support are:


 * being able to specify per- sampling rates, for example:
 * to decrease volume of events sent from English Wikipedia
 * to increase volume of events sent from Czech and Korean Wikipedias
 * being able to specify per- sampling rates, for example:
 * to enable a stream on desktop but not mobile web
 * to disable a stream on desktop and mobile web, but not mobile apps
 * being able to specify sampling rates based on key-value pairs in persistent storage
 * to only enable a stream if a  has a specific   (assuming the key exists at all in the persistent storage)
 * to only enable a stream if a  has one of several values
 * to only enable a stream for specific combinations of

The various ways to specify rules can be combined together, resulting in very specific sampling logic.

Examples
Here are some examples that illustrate how the streams can be configured to have specific sampling behaviors.

Per-wiki sampling rates
In this case we have a "reading_depth" stream and we know that only web-based instrumentation logs to, so we don't have to specify platform in the rules. First, this stream overrides the default sampling rate of 1.0 by setting it to 0.5. This means that, in general, 50% of sessions (the default identifier for in-sample/out-of-sample determination) that generate (log) data for this stream will have their data collected (sent). Then we start checking specific conditions:


 * For Wikipedia sites in general, collect data from 25% of sessions
 * For top 10 Wikipedia languages not including English, collect data from 10% of sessions
 * For English Wikipedia specifically, collect data from 1% of sessions

Per-platform enabling/disabling
For this example, you may want to acquaint yourself with the notion of stream cc-ing (described in the dedicated section below). Basically, stream cc-ing allows you to log events to multiple streams without writing multiple log calls yourself. Instrumentation simply needs to log an event to the top-level stream and EPC will take care of logging it to any derivative streams.

Suppose we have the following stream configuration: There's a lot going on here at first glance, so let's break down what precisely this configuration is doing:


 * First, in this case all instrumentation directly logs only to the "reading_depth" stream. Data generated by approximately half of all sessions, across all platforms will be collected (sent to ).
 * The "reading_depth.web" stream (cc'd when events are logged to "reading_depth") allows us to collect data from all sessions on desktop and mobile web sites, and none of the sessions on mobile apps.
 * The "reading_depth.apps" stream (cc'd when events are logged to "reading_depth") allows us to collect data from all mobile apps sessions, and none of the web sessions.

In case of the web and apps sub-streams, we've set 0% sampling rate and then specified the conditions in which the sampling rate should be 100%.

But it also works the other way – we could have just as easily set a default rate of 100% and then specified the opposite conditions in which it should be 0%.

Key-based enabling/disabling
As with the previous example, you may want to acquaint yourself with the notion of stream cc-ing (described in the dedicated section below) first. Suppose we have the following stream configuration: As with the previous example, there is also a lot going on here at first glance. In this scenario, the Editing team wants to test a new user experience related to section editing on mobile web. Instead of modifying instrumentation to log events directly to different streams if the user is enrolled in the A/B test, they want to piggy-back on existing VisualEditor instrumentation and have editing activity data for users in the test go into separate tables that are easier to query.

Let's break down what precisely this configuration is doing:


 * First, in this case all VisualEditor instrumentation directly logs the "visual_editor" stream.
 * Data generated by approximately 10% of all sessions will be collected (sent to ).
 * Editing activity data from users in control group of the A/B test still goes to the "visual_editor" table, so their data contributes to any metrics calculated from that table.
 * For users in the test group, their data does not end up in this table to make sure that any KPIs are unaffected by changes in behavior that result from the UX/workflow being tested.


 * The "visual_editor.mobile_section_test" stream
 * cc'd when events are logged to "visual_editor"
 * We disable the stream by default and only enable it when (1) there is a key in persistent storage called, and (2) when the value associated with that key is either "control group" or "test group".
 * The "visual_editor.mobile_section_test.new_user" stream
 * cc'd when events are logged to "visual_editor.mobile_section_test", which is to say after they've been cc'd to that stream from being logged to "visual_editor"
 * We disable the stream by default and only enable it when both of the following are true:
 * (1) there is a key in persistent storage called, and (2) when the value associated with that key is either "control group" or "test group".
 * (1) there is a key in persistent storage called  which has been set by the instrumentation to flag newly registered users, and (2) when the value associated with that key is.
 * Sessions which do not have new_user flag do not end up in the "visual_editor_mobile_section_test_new_user" table in the database.

In all of these cases, EPC needs to check for the presence of a key in persistent storage and check its associated value. This will be different for different platforms (e.g. session cookie storage on MediaWiki), so it is up to the instrumentation to make any key-value pairs available in the persistent storage that is accessible to EPC.

On the web, EPC does not have access to things like the user properties MW table where the team might store user's enrollment in the A/B test (if they want to maintain a consistent experience for all users for the duration of the test, so the same user does not get different experiences from session to session), nor does it have access to user table to find out when the user registered. It is up to the instrumentation to calculate both of those (figure out whether user is in the A/B test, randomly assign them to a group, determine if the user can be considered a newly registered user).

Stream cc-ing
EPC supports cc-ing to streams sharing prefixes. This makes it possible to direct or copy events to different streams without additional instrumentation work. Specifically, the cc-ing feature lets engineers log events to additional streams without having to perform multiple  calls and without having to specify the stream.

To illustrate this concept, suppose we have an  schema and the following stream: As detailed in the sampling logic section above, this is a stream where events are determined to be in-sampled (and are sent) for 10% of sessions – since this stream uses the default identifier (session token) – with the exception of English Wikipedia, where approximately 1% of sessions are in-sample.

The Growth team wants to collect editing behavior data on Czech and Korean Wikipedias, but sampled at a higher rate. They have a couple of options:


 * 1) Add a rule to rules, specifying the rates to use for those wikis in the same way that an enwiki-specific rate is specified.
 * 2) Create a new stream (e.g. "edit.growth") that events should be cc'd to.

Remember, in the MEP paradigm streams map to tables. The second option would give the Growth team a separate table "edit_growth" to work with, and that they can apply a different retention policy to – for example, if data in the "edit" table is stored for 90 days maximum but Growth team has an exemption from Legal to retain data for 270 days, that can be applied to the "edit_growth" table.

It might look something like: With this feature, no changes to the instrumentation need to be made or deployed. Users' editing activities are tracked using the already deployed instrumentation, and when that activity happens on Czech and Korean Wikipedias on the desktop and mobile sites, 100% of sessions' generated data is collected into a separate "edit_growth" table.

cc'd streams
The streams to be cc'd are determined by shared prefixes separated by dot, starting at the beginning and up to a maximum depth of 1 level (to prevent duplication).

Suppose we have 4 streams in a (loaded) stream configuration:


 * a
 * a.b
 * a.b.c
 * b.c

and that we log 4 separate events in the instrumentation, one to each stream. Here's what happens:


 * data1 is posted to stream "a" depending on its
 * data1 is cc'd to the only derivative stream for "a" ("a.b", NOT "a.b.c") via
 * data1 is posted to stream "a.b" depending on its
 * data1 is cc'd to the only derivative stream for "a.b" ("a.b.c") via
 * data1 is posted to stream "a.b.c" depending on its
 * data2 is posted to stream "a.b" depending on its
 * data2 is cc'd to the only derivative stream for "a.b" ("a.b.c") via
 * data2 is posted to stream "a.b.c" depending on its
 * data3 is posted to stream "a.b.c" depending on its
 * data4 is posted to stream "b.c" depending on its
 * data3 is posted to stream "a.b.c" depending on its
 * data4 is posted to stream "b.c" depending on its
 * data4 is posted to stream "b.c" depending on its
 * data4 is posted to stream "b.c" depending on its

Assuming the sampling logic evaluates to TRUE in all cases, here's what data ends up in each table:


 * a:
 * data1 (logged directly)
 * a_b:
 * data1 (per cc when logging to "a")
 * data2 (logged directly)
 * a_b_c:
 * data1 (per cc when logging to "a" and cc-ing to "a.b")
 * data2 (per cc when logging to "a.b")
 * data3 (from direct logging)
 * b_c
 * data4 (from direct logging)

Stream cc-ing is a powerful feature, but with great power comes great responsibility.