Wikimedia Product/Analytics Infrastructure/Stream configuration

This page describes features of the Event Platform Client (EPC) as it pertains to stream configuration in the Modern Event Platform (MEP), which – for analytics purposes – is set in PHP with  in wmf-config/InitialiseSettings.php (from EventStreamConfig).

Required components
The following fields are required for each stream:


 * : the name of the stream
 * : just the title of the schema the stream uses (note: schema name and schema version should be specified by the instrumentation)
 * : which EventGate instance to send events to. For analytics events logged with EventLogging's  or Event Platform Client on Android/iOS, this will be "eventgate-analytics-external" and those events will be sent to intake-analytics.wikimedia.org/v1/events

The following fields are optional:


 * the  settings defines the rules for determining whether an event is sent or thrown away, see below

Sampling settings
The  field of a stream specifies how events are determined to be in-sample (sent) or out-of-sample (thrown away).

The primary point to note and remember here is that this controls data collection, NOT data generation. It is up to the individual instrumentation to determine when to  events – this is the data generation part. The sampling field controls which -ed events actually get sent to be processed and put into the database – this is the data collection part. It is very important to be aware of this distinction. When your team instruments A/B test, the sampling logic in the streaming configuration cannot be used to determine which clients get one UX/feature variant or the other – that is up to the instrumentation to make that determination based on session ID, user ID, a special flag in the user properties table, etc.

The two main knobs to turn are:


 * : the identifier to use for determination
 * "session" by default, meaning if a session is determined to be in-sample (as specified by ), all events generated by that session will be in-sample and sent to
 * On mobile apps, the session ID is generated by EPC using the same algorithm as  in mediawiki.user.js and EPC manages it based on state change notifications from the application.
 * On the web – in the initial rollout of EPC –  is used as the session ID and its persistence managed by MediaWiki Core, not EPC.
 * web-specific streams can be configured to use "pageview" token for determination
 * will cause the determination to be made on a page-by-page basis
 * can be useful for getting a random sample of page views, not sessions
 * mobile app-specific streams can be configured to use "device" (app_install_id on iOS and Android)
 * will cause the determination to be made on a device-by-device basis
 * if a device is determined to be in-sample, all of their sessions and events will be in-sample
 * useful for retention metrics, cohort & longitudinal analyses, cross-session analysis
 * : proportion of identifiers that are considered in-sample
 * 1.0 (100%) by default, can be overridden in individual streams
 * set to 0.0 to disable the stream (if you want to keep the stream in the config but prevent events from being sent to it)
 * uses "widening the net" approach: IDs determined to be in-sample at lower rates will be determined to be in-sample at higher rates

Sampling rate
For example: Suppose we have 4 streams: A, B, C, and D with sampling rates 0.01, 0.1, 0.25, 0.5, respectively. Those streams could be using the same schema or different ones. But specifically, those streams use the same identifier – let's say it's the session token. Remember, in the MEP paradigm streams map to tables inside the database. Here's what you should expect to see in those tables for any time period:


 * Table A will have data from approximately 1% of active sessions in that time period
 * Table B will have data from approx. 10% of active sessions at that time, but definitely all of the sessions found in table A
 * Table C " " " " ~25% of active sessions at that time, but definitely all of the sessions found in tables A & B
 * Table D " " " " ~half of active sessions at that time, but definitely all of the sessions found in tables A, B, and C

Stream cc-ing
EPC supports cc-ing to streams sharing prefixes. This makes it possible to direct or copy events to different streams without additional instrumentation work. Specifically, the cc-ing feature lets engineers and analysts log events to additional streams without having to perform multiple  calls manually.

To illustrate this concept, suppose we have an  schema and the following stream: As detailed in the sampling logic section above, this is a stream where events are determined to be in-sampled (and are sent) for 10% of sessions – since this stream uses the default identifier (session token).

The Growth team wants to collect editing behavior data on Czech and Korean Wikipedias, but sampled at a higher rate. They can create a new stream (e.g. "edit.growth") for those wikis (which will use the default 100% rate) and when the instrumentation logs events to "edit" stream, those events would be logged to "edit.growth" stream automatically, without the need for a separate  call (  on MediaWiki): Remember, in the MEP paradigm streams map to tables. The second option would give the Growth team a separate table "edit_growth" to work with, and that they can apply a different retention policy to – for example, if data in the "edit" table is stored for 90 days maximum but Growth team has an exemption from Legal to retain data for 270 days, that can be applied to the "edit_growth" table.

cc'd streams
The child streams to be cc'd are determined by shared prefixes separated by dot, starting at the beginning and up to a maximum depth of 1 level (direct child). To prevent duplication, only direct children are cc'd. The parent stream does not need to exist in the stream configuration for its children to be cc'd. See example below for clarification.

Suppose we have 4 streams in a (loaded) stream configuration:


 * a
 * a.b
 * a.b.c
 * b.c

and that we log 4 separate events in the instrumentation, one to each stream. Here's what happens:


 * data1 is posted to stream "a" depending on its
 * data1 is cc'd to the only child stream for "a" ("a.b", NOT "a.b.c") via
 * data1 is posted to stream "a.b" depending on its
 * data1 is cc'd to the only child stream for "a.b" ("a.b.c") via
 * data1 is posted to stream "a.b.c" depending on its
 * data2 is posted to stream "a.b" depending on its
 * data2 is cc'd to the only child stream for "a.b" ("a.b.c") via
 * data2 is posted to stream "a.b.c" depending on its
 * data3 is posted to stream "a.b.c" depending on its
 * data4 is NOT posted to stream "b" because there's no stream by that name in the configuration
 * HOWEVER, data4 IS cc'd to the only child stream for "b" ("b.c") via
 * data4 is posted to stream "b.c" depending on its
 * data5 is posted to stream "b.c" depending on its
 * data4 is NOT posted to stream "b" because there's no stream by that name in the configuration
 * HOWEVER, data4 IS cc'd to the only child stream for "b" ("b.c") via
 * data4 is posted to stream "b.c" depending on its
 * data5 is posted to stream "b.c" depending on its
 * data5 is posted to stream "b.c" depending on its

Assuming the sampling logic evaluates to TRUE in all cases, here's what's we end up with: There's no table "b" because there is no stream "b" in the configuration, even though data4 was logged to that stream.

Notice that when logging to parent stream "a", only its direct child ("a.b") is cc'd. The stream "a.b.c" is not a direct child of "a". Imagine if all levels of children were considered: data1 would have been cc'd to "a.b.c" twice – once from "a" and once from "a.b". Also notice that even though b.c's parent stream "b" does not exist in the stream config, "b.c" still got cc'd.

Stream cc-ing is a powerful feature, but with great power comes great responsibility.

Specifying exemptions
In a later version of the Modern Event Platform Client Libraries we'd like a more detailed, more sophisticated targeting solution. One way we could achieve that is by adding a new configurable to :

An  field which can be used to override the stream's   in specific situations. The core use-cases we wanted to support are:
 * being able to specify per- sampling rates, for example:
 * to decrease volume of events sent from English Wikipedia
 * to increase volume of events sent from Czech and Korean Wikipedias
 * being able to specify per- sampling rates, for example:
 * to enable a stream on desktop but not mobile web
 * to disable a stream on desktop and mobile web, but not mobile apps
 * being able to specify sampling rates based on key-value pairs in persistent storage
 * to only enable a stream if a  has a specific   (assuming the key exists at all in the persistent storage)
 * to only enable a stream if a  has one of several values
 * to only enable a stream for specific combinations of

The various ways to specify exemptions can be combined together, resulting in very specific sampling logic. Here are some examples that illustrate how the streams can be configured to have specific sampling behaviors.