Wikimedia Product/Analytics Infrastructure/Stream configuration

From mediawiki.org

This page describes features of the Event Platform Client (EPC) as it pertains to stream configuration in the Modern Event Platform (MEP), which – for analytics purposes – is set in PHP with wgEventStreams in wmf-config/InitialiseSettings.php (from EventStreamConfig).

Required components[edit]

The following fields are required for each stream:

  • stream: the name of the stream
  • schema: just the title of the schema the stream uses (note: schema name and schema version should be specified by the instrumentation)
  • destination_event_service: which EventGate instance to send events to. For analytics events logged with EventLogging's mw.eventLog.submit() or Event Platform Client on Android/iOS, this will be "eventgate-analytics-external" and those events will be sent to intake-analytics.wikimedia.org/v1/events

The following fields are optional:

  • the sampling settings defines the rules for determining whether an event is sent or thrown away, see below

Sampling settings[edit]

The sampling field of a stream specifies how events are determined to be in-sample (sent) or out-of-sample (thrown away).

The primary point to note and remember here is that this controls data collection, NOT data generation. It is up to the individual instrumentation to determine when to log() events – this is the data generation part. The sampling field controls which log()-ed events actually get sent to be processed and put into the database – this is the data collection part. It is very important to be aware of this distinction. When your team instruments A/B test, the sampling logic in the streaming configuration cannot be used to determine which clients get one UX/feature variant or the other – that is up to the instrumentation to make that determination based on session ID, user ID, a special flag in the user properties table, etc.

The two main knobs to turn are:

  • unit: the identifier to use for determination
    • "session" by default, meaning if a session is determined to be in-sample (as specified by rate), all events generated by that session will be in-sample and sent to destination
      • On mobile apps, the session ID is generated by EPC using the same algorithm as generateRandomSessionId() in mediawiki.user.js and EPC manages it based on state change notifications from the application.
      • On the web – in the initial rollout of EPC – mw.user.sessionId() is used as the session ID and its persistence managed by MediaWiki Core, not EPC.
    • web-specific streams can be configured to use "pageview" token for determination
      • will cause the determination to be made on a page-by-page basis
      • can be useful for getting a random sample of page views, not sessions
    • mobile app-specific streams can be configured to use "device" (app_install_id on iOS and Android)
      • will cause the determination to be made on a device-by-device basis
      • if a device is determined to be in-sample, all of their sessions and events will be in-sample
      • useful for retention metrics, cohort & longitudinal analyses, cross-session analysis
  • rate: proportion of identifiers that are considered in-sample
    • 1.0 (100%) by default, can be overridden in individual streams
    • set to 0.0 to disable the stream (if you want to keep the stream in the config but prevent events from being sent to it)
    • uses "widening the net" approach: IDs determined to be in-sample at lower rates will be determined to be in-sample at higher rates

Sampling rate[edit]

For example: Suppose we have 4 streams: A, B, C, and D with sampling rates 0.01, 0.1, 0.25, 0.5, respectively. Those streams could be using the same schema or different ones. But specifically, those streams use the same identifier – let's say it's the session token. Remember, in the MEP paradigm streams map to tables inside the database. Here's what you should expect to see in those tables for any time period:

  • Table A will have data from approximately 1% of active sessions in that time period
  • Table B will have data from approx. 10% of active sessions at that time, but definitely all of the sessions found in table A
  • Table C " " " " ~25% of active sessions at that time, but definitely all of the sessions found in tables A & B
  • Table D " " " " ~half of active sessions at that time, but definitely all of the sessions found in tables A, B, and C

Future developments[edit]

Stream cc-ing[edit]

EPC supports cc-ing to streams sharing prefixes. This makes it possible to direct or copy events to different streams without additional instrumentation work. Specifically, the cc-ing feature lets engineers and analysts log events to additional streams without having to perform multiple log() calls manually.

To illustrate this concept, suppose we have an analytics/editing/attempt-step-v2 schema and the following stream:

'wgEventStreams' => [
    'default' => [
        [
            'stream' => 'edit',
        	'schema_title' => 'analytics/editing/attempt-step-v2',
        	'destination_event_service' => 'eventgate-analytics-external',
        	'sampling' => [
        	    'rate' => 0.1,
        	],
        ],
    ],
]

As detailed in the sampling logic section above, this is a stream where events are determined to be in-sampled (and are sent) for 10% of sessions – since this stream uses the default identifier (session token). The Growth team wants to collect editing behavior data on Czech and Korean Wikipedias, but sampled at a higher rate. They can create a new stream (e.g. "edit.growth") for those wikis (which will use the default 100% rate) and when the instrumentation logs events to "edit" stream, those events would be logged to "edit.growth" stream automatically, without the need for a separate log() call (mw.eventLog.submit on MediaWiki):

'wgEventStreams' => [
    'default' => [
        [
            'stream' => 'edit',
        	'schema_title' => 'analytics/editing/attempt-step-v2',
        	'destination_event_service' => 'eventgate-analytics-external',
        	'sampling' => [
        	    'rate' => 0.1,
        	],
        ],
    ],
    'cswiki' => [
        [
            'stream' => 'edit.growth',
        	'schema_title' => 'analytics/editing/attempt-step-v2',
        	'destination_event_service' => 'eventgate-analytics-external',
        ],
    ],
    'kowiki' => [
        [
            'stream' => 'edit.growth',
        	'schema_title' => 'analytics/editing/attempt-step-v2',
        	'destination_event_service' => 'eventgate-analytics-external',
        ],
    ],
]

Remember, in the MEP paradigm streams map to tables. The second option would give the Growth team a separate table "edit_growth" to work with, and that they can apply a different retention policy to – for example, if data in the "edit" table is stored for 90 days maximum but Growth team has an exemption from Legal to retain data for 270 days, that can be applied to the "edit_growth" table.

cc'd streams[edit]

The child streams to be cc'd are determined by shared prefixes separated by dot, starting at the beginning and up to a maximum depth of 1 level (direct child). To prevent duplication, only direct children are cc'd. The parent stream does not need to exist in the stream configuration for its children to be cc'd. See example below for clarification.

Suppose we have 4 streams in a (loaded) stream configuration:

  • a
  • a.b
  • a.b.c
  • b.c

and that we log 4 separate events in the instrumentation, one to each stream. Here's what happens:

  • log("a", "/analytics/example/1.0.0", data1)
    • data1 is posted to stream "a" depending on its sampling
    • data1 is cc'd to the only child stream for "a" ("a.b", NOT "a.b.c") via log("a.b", "/analytics/example/1.0.0", data1)
      • data1 is posted to stream "a.b" depending on its sampling
      • data1 is cc'd to the only child stream for "a.b" ("a.b.c") via log("a.b.c", "/analytics/example/1.0.0", data1)
        • data1 is posted to stream "a.b.c" depending on its sampling
  • log("a.b", "/analytics/example/1.0.0", data2)
    • data2 is posted to stream "a.b" depending on its sampling
    • data2 is cc'd to the only child stream for "a.b" ("a.b.c") via log("a.b.c", "/analytics/example/1.0.0", data2)
      • data2 is posted to stream "a.b.c" depending on its sampling
  • log("a.b.c", "/analytics/example/1.0.0", data3)
    • data3 is posted to stream "a.b.c" depending on its sampling
  • log("b", "/analytics/example/1.0.0", data4)
    • data4 is NOT posted to stream "b" because there's no stream by that name in the configuration
    • HOWEVER, data4 IS cc'd to the only child stream for "b" ("b.c") via log("b.c", "/analytics/example/1.0.0", data4)
      • data4 is posted to stream "b.c" depending on its sampling
  • log("b.c", "/analytics/example/1.0.0", data5)
    • data5 is posted to stream "b.c" depending on its sampling

Assuming the sampling logic evaluates to TRUE in all cases, here's what's we end up with:

What data ends up in which tables inside database
Table Event data In instrumentation Explanation
a data1 log("a", "/analytics/example/1.0.0", data1) Logged directly
a_b data1 log("a", "/analytics/example/1.0.0", data1) Logged via cc
a_b data2 log("a.b", "/analytics/example/1.0.0", data2) Logged directly
a_b_c data1 log("a", "/analytics/example/1.0.0", data1) Logged via cc
a_b_c data2 log("a.b", "/analytics/example/1.0.0", data2) Logged via cc
a_b_c data3 log("a.b.c", "/analytics/example/1.0.0", data3) Logged directly
b_c data4 log("b", "/analytics/example/1.0.0", data4) Logged via cc
b_c data5 log("b.c", "/analytics/example/1.0.0", data5) Logged directly

There's no table "b" because there is no stream "b" in the configuration, even though data4 was logged to that stream.

Notice that when logging to parent stream "a", only its direct child ("a.b") is cc'd. The stream "a.b.c" is not a direct child of "a". Imagine if all levels of children were considered: data1 would have been cc'd to "a.b.c" twice – once from "a" and once from "a.b". Also notice that even though b.c's parent stream "b" does not exist in the stream config, "b.c" still got cc'd.

Stream cc-ing is a powerful feature, but with great power comes great responsibility.

Specifying exemptions[edit]

In a later version of the Modern Event Platform Client Libraries we'd like a more detailed, more sophisticated targeting solution. One way we could achieve that is by adding a new configurable to sampling:

An exemptions field which can be used to override the stream's rate in specific situations. The core use-cases we wanted to support are:

  • being able to specify per-wiki sampling rates, for example:
    • to decrease volume of events sent from English Wikipedia
    • to increase volume of events sent from Czech and Korean Wikipedias
  • being able to specify per-platform sampling rates, for example:
    • to enable a stream on desktop but not mobile web
    • to disable a stream on desktop and mobile web, but not mobile apps
  • being able to specify sampling rates based on key-value pairs in persistent storage
    • to only enable a stream if a key has a specific value (assuming the key exists at all in the persistent storage)
    • to only enable a stream if a key has one of several values
    • to only enable a stream for specific combinations of keys

The various ways to specify exemptions (wiki, platform, key, keys) can be combined together, resulting in very specific sampling logic. Here are some examples that illustrate how the streams can be configured to have specific sampling behaviors.