Wikimedia Product/Analytics Infrastructure/Standard

From mediawiki.org

READ THIS FIRST

This document is a draft. Don't take it too seriously. It's a way to imagine the complete system as we work on it together.

  • Most names are placeholders of some kind and need to be discussed in a group name-a-thon.
  • Parts of other documentation was used or adapted to fill out certain sections in order to imagine a complete document.
  • No example or anything in here is authoritative, and some of it is pure speculation!


Currently supported platforms
Spec version Release date MediaWiki Android iOS KaiOS
0.3 Q2 FY20-21
0.2 BETA Q1 FY20-21 X
0.1 BETA Q4 FY19-20
No support n/a X X X

The data acquisition system thing (actual name TBD) is a set of standardized APIs for designing, implementing, and maintaining software instrumentation and controlled experiments. It allows data scientists and engineers to create high-quality data sets that are continually updated with new data from actual software running across millions of devices in live production. Using the data acquisition system, instrumentation can be centrally coordinated and iterated without the need for a software release. Its practices promote software re-use, and make instrumentation more adaptable, rigorous, and safe.

The data acquisition system is a middleware layer. It provides native device environments (web browsers, mobile apps, etc.) with software libraries that extend the capabilities of the Wikimedia Event Platform. The libraries perform operations like randomness generation, timestamp generation, statistical sampling, and more, to create complex events that are specialized for data science and instrumentation. The libraries follow a single cross-platform standard that ensures statistical coherence and portability of events. Relying on the library makes instrumentation code easier to write, easier to debug, and less prone to statistical bias.

This document serves as the data acquisition system living standard (DAS-LS). It describes the three principal resource types: event, event schema, and event stream of the event system, the event lifecycle, and the client libraries algorithms and API. It should be updated regularly in accordance with software changes. It is maintained by the Product Data Engineering team.

System Description[edit]

The data acquisition system is used to create data sets with events streamed from live software running in production. In addition to writing instrumentation code, developers have the ability to design their own instrumentation events using an event schema, collect events from one or more instruments together into different event streams, and manage their instruments remotely using an event stream configuration.

Creating a new data set is organized as a project called a study. Running a study involves a number of stakeholders, described in this DACI. Data scientists, engineers, and product owners specify the study question and the data needed to answer the question. They work with engineers to produce an instrumentation plan, which describes how the data will be collected on the particular product(s). The instrumentation plan undergoes reviews for performance and privacy. Once the instrumentation plan is approved, implementation of the instrument can proceed.

In the data acquisition system, an instrument is a purpose-built unit of application code that exists to detect a specific situation of interest, such as an error or a button click. When detection occurs, the instrument will produce a set of data properties that form the core of what will become the instrumentation event. It passes these data properties to the instrumentation platform library, along with the name of an event stream, where it wants the event to be sent.

The instrumentation platform library performs various checks on behalf of the stream, including client configuration and statistical sampling. If the submitted event passes these checks, the library will assign it additional properties and values, constructing what will become the final event. The final event is then scheduled for transmission over the network. If network transmission succeeds, the event is received by EventGate, the stream intake service of the event platform. EventGate will run additional checks, and verify the event's schema. If the event passes these checks, it will be inserted into a distributed queue, and be loaded into a database table or other storage backend within a few hours.

EXAMPLE: Instumenting a basic application.

Application (before instrumentation)

To start using the data acquisition system, you must have a software application you want to instrument, and that application should have access to an instrumentation platform library. Consider this simple website with embedded JavaScript, which renders a clickable button, and will print a message to the browser console when the button is clicked. Perhaps we want to instrument that click, so that the message is also sent as an instrumentation event.

<html>
<head>
<script>
function handleClick() {
  console.log( 'Thanks for the click!' );
}
</script>
</head>
<body>
<button onclick="handleClick()">Click Me!</button>
</body>
</html>

Event schema

Our instrument will need to make an event. Start by defining the event. Events are defined with event schema. To carry the click message, our event will need a property called message that takes values of type string, so let's make a schema with that.

title: analytics/clickMessageSchema
description: >
    Sends a message when the button gets clicked.
$id: /analytics/clickMessageSchema/1.0.0
$schema: https://json-schema.org/draft-07/schema#
type: object
allOf:
  - properties:
      message:
        type: string 
        description: >
            Message from the click

Event stream

Where will the events go? We need to configure a new event stream where we can send the events. The event stream is what will become our data set. By default, event streams will be directed to a Hive table in the data lake, with a table name that matches the name of the stream. See Where is my data?

'clickMessageStream': {
  stream_name: 'clickMessageStream',
  schema_title: 'clickMessageSchema',
}

Application (after instrumentation)

Once we have a schema and a stream, we just need to write the instrumentation code to detect the event and .submit() the message data to the instrumentation platform library.

<html>
<head>
<script>
window.addEventListener( 'load', function() {
   InstrumentationLibrary.initialize();
} );
function handleClick() {
   console.log('Thanks for the click!');

   InstrumentationLibrary.submit( 'clickMessageStream', {
      '$schema': '/analytics/clickMessageSchema/1.0.0#',
      'message': 'Thanks for the click!'
   } );
}
</script>
</head>
<body>
<button onclick="handleClick()">Click Me!</button>
</body>
</html>

Output instrumentation event

If all goes well, the library will transmit the following event, which consists of the manual properties we passed in, along with a number of automatic properties.

{
  meta: {
    stream: 'clickMessageStream'
  },
  client_dt: '2020-01-05T13:30:03.33',
  pageview_id: '27d4918051a51b381234',
  session_id: '5a3138c61384d2910000',
  $schema: '/analytics/clickMessageSchema/1.0.0#',
  message: 'Thanks for the click!'
}


Events[edit]

An instrumentation event.
{
  meta: {
    stream: 'example'
  },
  client_dt: '2020-01-05T13:30:03.33',
  pageview_id: '27d4918051a51b381234',
  session_id: '5a3138c61384d2910000',
  $schema: '/instrumentation/example/1.0.0#',
  message: 'nice example!'
}

An instrumentation event (sometimes abbreviated iev or simply event) is an event platform event that is specialized for instrumentation and data analysis. These events follow enhanced standards for portability and statistical coherence, and have a number of features aimed at making them easier to author and deploy safely, including a catalog of pre-defined schema fragments, automatically managed properties, and a strict observational policy. All events produced by the instrumentation platform library are instrumentation events.

An instrumentation event is encoded as a JSON[1] string. It supports all native JSON value types, i.e.

  • object
  • array
  • string
  • number
  • "true"
  • "false"
  • "null"

Every event must be an instance of a pre-defined event schema. It must include a reference to this event schema (see the $schema property, below). Every event must be addressed to an event stream, and must be an instance of that stream's allowed schema. If the event's schema does not match the event stream's allowed schema, or if the event is not a valid instance of that schema, the event will be rejected by the stream.

Events are constructed on a client (IPL), and transmitted to a server (EventGate). Construction begins with the originating instrument, which passes initial data to the IPL, which filters, appends, and schedules the event for transmission over the network. Once the event is transmitted, EventGate may make minor alterations to the event, before validating its stream and schema, and moving it into the data backend. A full explanation can be found in the event lifecycle.

Automatic properties[edit]

Automatic properties have their values assigned automatically by either IPL or EventGate. All automatic properties are defined in schema fragments. The rules for where and how the values of automatic properties are assigned is laid out in the event lifecycle. Values are changed according to the operations

Manual properties[edit]

Manual properties are not set or modified by any software in the instrumentation platform. They are the "essential data" for a schema, and are what make it what it is. They are to be set exclusively by the originating instrument. They typically have low generalizability and vary the most between schemas.

EXAMPLE: Producing an instrumentation event using the IPL.

We call the IPL.submit() method with the name of our stream, 'exampleStream', and any manual properties.

IPL.submit( 'exampleStream', {
  $schema: '/instrumentation/example/1.0.0#',
  message: 'nice example!'
} );

If all goes well, the IPL will transmit the following event, which consists of the manual properties we passed in, along with a number of automatic properties.

{
  meta: {
    stream: 'exampleStream'
  },
  client_dt: '2020-01-05T13:30:03.33',
  pageview_id: '27d4918051a51b381234',
  session_id: '5a3138c61384d2910000',
  $schema: '/instrumentation/example/1.0.0#',
  message: 'nice example!'
}
Can I produce an instrumentation event without the IPL?
It is possible (but inconvenient), to create an instrumentation event without using the instrumentation platform library. This practice is not recommended except in specific cases that have been reviewed and approved by the maintainers. Instrumentation event metadata is used for statistical analysis, and the IPL has strict controls to ensure coherence of these values.
Why do instrumentation events have to be observational?
This property allows instrumentation to be enabled or disabled at will, and ensures that regressions or interruptions in instrumentation do not degrade the actual software under instrumentation. This makes it safer for instrumentation to be managed independently, and allows systems which carry instrumentation events to be held to a lower service tier.
Caution
Manually assigning a value to an automatic property will result in undefined behavior.


Schema[edit]

An event schema.
title: analytics/clickMessageSchema
description: >
    Sends a message when the button gets clicked.
$id: /analytics/clickMessageSchema/1.0.0
$schema: https://json-schema.org/draft-07/schema#
type: object
allOf:
  - $ref: /fragment/analytics/identifiers/1.0.0#
  - $ref: /fragment/analytics/page/1.0.0#
  - properties:
      message:
        type: string 
        description: >
            Message from the click

A schema (also event schema) is a JSONSchema[2] file. Schema are written to define a certain kind of JSON object. Given a schema file, a JSONSchema validator can check whether a JSON object is an instance of that schema. This is called schema validation. Schema are what allow events, which are loosely-typed JSON strings, to map to datastores such as a database table, which is strongly typed. To avoid database table migrations, schema must follow a backwards compatibility convention.

Schema are authored in YAML[3] and automatically converted into JSONSchema by the jsonschema-tools npm package. Schema can reference other schema, called fragments, in order to make it easy to reuse commonly-defined properties. Converting from YAML to JSONSchema recursively resolves all references, resulting in a single JSONSchema file. This process is called materialization. Schema have a name, a version, and a uri. The URI is the schema's address in the public schema repositories, which can be browsed at schema.wikimedia.org.

Schema are stored in two respositories. The primary repository contains schema for events that are used by the application to drive behavior or application logic. Changing these schema requires higher access priveleges due to the potential for affecting production behavior of software products. The secondary repository contains schema for instrumentation events. Because of the observational policy, regressions in these schema will degrade data collection but not application behavior, and therefore require a lower privlege level.

To create a schema, you must

  1. Clone the relevant repository
  2. Create a branch
  3. Create the YAML file
  4. Materialize it with jsonschema-tools
  5. Commit the changes
  6. Submit the patch for code review in Gerrit.

Unlike in the legacy system, event schema define structure and do not define routing. Multiple event streams may accept events of the same schema, and direct those events into different data backends.

Fragments[edit]

A schema fragment is JSONSchema[2] file that can be included into another JSONSchema file. As the name implies, they are not intended to be be used as standalone schema, but as "building blocks" for data scientists to choose what properties they need out of a standard menu. This standardizes commonly-used sets of properties across schema. By doing this, we make database field names more predictable, and allow privacy and performance engineers to analyze the properties and their interactions in advance, making approval of a schema simpler and faster for all parties involved.

Like other JSONSchema files, schema fragments have their own URI, and are included into a JSONSchema file using a $ref property. The reference will be resolved when the jsonschema-tools program is run to materialize the schema. A fragment typically defines a set of properties. This set of properties will appear in every schema which includes the fragment. In the data acquisition system, the properties of schema fragments are typically automatic properties, that is, their values are assigned automatically by the IPL. This means that referencing fragments in a schema does not mean that your instrument has to do more work.

Because the schema fragments introduce event properties which are controlled by the IPL library, development of fragments is connected to the development of the IPL. The value of these properties is sometimes tricky to compute consistently, and certain platforms may not be able to support certain fields. This is noted where appropriate. In general, the system aims for wide support, and all diagnostically-relevant properties should eventually be covered on all platforms.

EXAMPLE: Authoring a schema using schema fragment references.

As an example, we can modify the clickMessageSchema from #example 1 to use a couple fragments.

title: analytics/clickMessageSchema
description: >
    Sends a message when the button gets clicked.
$id: /analytics/clickMessageSchema/1.0.0
$schema: https://json-schema.org/draft-07/schema#
type: object
allOf:
  - $ref: /fragment/analytics/identifiers/1.0.0#
  - $ref: /fragment/analytics/page/1.0.0#
  - properties:
      message:
        type: string 
        description: >
            Message from the click
Table of schema fragments and supported platforms.
Fragment $ref IPL automatic value support
MediaWiki Android iOS KaiOS
Identifiers /fragment/analytics/identifiers/1.0.0# No No No No
Activity sequencing /fragment/analytics/activity_seq/1.0.0# No No No No
User /fragment/analytics/user/1.0.0# No No No No
Page /fragment/analytics/page/1.0.0# No No No No
User Interface /fragment/analytics/ui/1.0.0# No No No No
A/B Testing /fragment/analytics/ab_testing/1.0.0# No No No No
Campaign attribution /fragment/analytics/utm_parameters/1.0.0# No No No No
Caution:
The definition of instrumentation schema fragments is controlled by this specification. To add or modify a fragment, you should contact the team to help you out.

Identifiers[edit]

Provides core identifiers that are used to associate events that are part of the same scope. All properties specified in this fragment are automatic properties, and reserved for the use of the instrumentation platform library. Within the library, they are serviced by the association controller.

pageview_id

Identifies a page view. Only available on web browsers or platforms with a concrete notion of pageview.

session_id

Identifies a session. On MediaWiki, a session last for the lifetime of the browser process (refer to T223931 for additional information). On iOS and Android apps, where the app is allowed to enter a background state, sessions expire after 15 minutes of inactivity. If the app returns to the foreground after 15 minutes, a new session ID is generated.

device_id

Identifies a device. Only available on app platforms with an "app install ID". Enables calculation of retention metrics for anonymous users since we do not have a user ID for those.
$ref: /fragment/analytics/identifiers/1.0.0#

Sequence[edit]

Provides identifiers for associating events that are part of the same activity group or funnel. All properties specified in this fragment are automatic properties, and reserved for the use of the instrumentation platform library. Within the library, they are serviced by the association controller.

activity_id

Identifies a sequence of actions in the same context or funnel. Useful for grouping together impressions with corresponding clicks, and for grouping together steps in a process such as making an edit. Activity identifier can be randomly generated or a counter.

sequence_id

Starting at 1, this is a counter for reconstructing the order of events in the same activity. Rather than the timestamp of the event, this sequence_id can be used to established the exact sequence of events.
$ref: /fragment/analytics/sequence/1.0.0#

Validation[edit]

Schema validation is performed by EventGate, the stream intake service. In order to pass through EventGate, an event must identify a schema (done with its $schema required property), the schema specified in $schema must match the schema supported by the stream identified in meta.stream, and finally the event must be classified as a valid instance of the schema by the schema validator. Events which do not pass validation are discarded immediately.

ALGORITHM: Schema validation

To intake event, run the following algorithm:

  1. Verify the event is well-formed
    1. If event.meta.stream is NULL or undefined, return.
    2. If event.$schema is NULL or undefined, return.
  2. Verify that the stream and schema exist
    1. If there is no stream configuration for event.meta.stream, return.
  3. If event.$schema cannot be resolved into a JSONSchema file, return.
  4. Fetch the stream configuration and schema file
    1. Let streamConfig equal the stream configuration for event.meta.stream.
    2. Let schemaFile equal the file at event.$schema.
  5. Verify that the stream configuration and schema file is well-formed
    1. If streamConfig.schema_title does not exist, return.
    2. If schemaFile.title does not exist, return.
  6. Verify that the schemas match in name (not neccesarily in version)
    1. If streamConfig.schema_title does not equal schemaFile.title, return.
  7. Return the (boolean) result of running the JSONSchema validation algorithm on event, schemaFile.


Streams[edit]

An event stream configuration.
'exampleStream': {
  name: 'exampleStream',
  accept: {
    schema: 'example',
    sample: {
      per: 'session',
      random: .01,
    },
  },
}

A stream (also event stream) is a named, globally-writable destination for events. Events addressed to a stream will flow to a common datastore defined by that stream. Events may be submitted to a stream from any different instrument, instance, or platform. Inside a stream, events are ordered like a queue, using a first-in, first-out (FIFO) policy.

Streams are defined by a stream configuration file. This file defines the stream's name, which uniquely identifies it, an expected schema, which defines the type of events it will accept, and an additional set of rules for accepting or rejecting events. These rules are loaded by the IPL during its capability negotiation phase. When an event is submitted to the stream, the IPL will execute the accept/reject rules specified by the stream configuration in situ.

Stream configuration is a cornerstone of releaseless iteration. The stream, and its configuration, provides global control over all instruments and platform clients that seek to produce events into a data set. Changes made to stream configuration can go live in the same day, and require no application software deployment.

Stream configuration[edit]

Stream configuration is a collection of properties that define a stream. At a minimum, a stream configuration must specify a name, which uniquely identifies the stream, and a schema, which identifies the one type of schema that the stream will accept. In addition to these basic requirements, stream configuration supports a rich set of properties to control the flow of events to the stream.

Its canonical format is a JSON[1] string. There is no single source for stream configuration, but there are a number of interfaces available for products to use with their custom fetch algorithms.

Accept[edit]

The accept section contains rules that must be met in order for an event to be accepted for processing. Being accepted means that the event will be produced to the output buffer. The accept check is computed on the client, inside the IPL. The computation takes place at the sampling step of the event lifecycle. The check computes the logical AND of the accept conditions. If the result is true, then the accept check succeeds. Otherwise, it fails. If the accept check fails, the event is discarded. The accept conditions are enforced by the IPL, and will not be computed or honored if the event is submitted directly.

Copy[edit]

Most streams will have events addressed to them by an instrument.

Example: Copying events from one stream to another.

The instrument code might resemble something like:

IPL.submit( "exampleStream", { 
  '$schema':'/instrumentation/example-1.0.0', 
  'message':'Hello, world'
} );

Suppose this instrument runs on a lot of projects, so it is sampled quite severely. Perhaps it looks something like

'exampleStream': {
  name: 'exampleStream',
  accept: {
    schema: 'example',
    sample: {
      per: 'session',
      random: .01,
    },
  },
}

If a team wanted to take a closer look at the events from exampleStream, they can do so without creating a new instrument. All they need to do is create a new stream, and set it to copy 'exampleStream'. This means that whenever the IPL receives event data addressed for 'exampleStream', it will also process a copy for 'exampleStream2'.

'exampleStream2': {
  name: 'exampleStream2',
  copy: 'exampleStream',
  accept: {
    schema: 'example',
    domain: 'hw.wikipedia.org',
    sample: {
      per: 'session',
      random: 1,
    },
  },
}
Note that 'exampleStream2' is not a child of 'exampleStream'. They are not in a cascading relationship. It is not like a Unix pipeline. The events in 'exampleStream2' are not necessarily a subset of the events in 'exampleStream'. The events are processed independently, and each stream is given a chance to run through its accept and reject logic. They may both accept the event, both reject the event, or one may accept and the other reject.


Event lifecycle[edit]

The instrumentation event pipeline defines a sequence of processing stages that an instrumentation event will pass through in its journey from its originating instrument to its ultimate datastore. As it progresses from stage to stage, the event may be discarded or duplicated, and its automatic properties will be changed in various ways by different software handlers.

Instrumentation event lifecycle diagram. Time flows down.
Stage Description Location
Submission The event data is all that the instrument has provided.
{
    '$schema': 'example.0.1',
    'message': 'hello!'
}
Client
Config check Events stop here for misconfigured clients.
CC-ing The event is copied to additional streams, as specified.
Sampling The event is assessed to be either in- or out-sample. Out-sample events stop here.
Production The event data has been decorated by the instrumentation platform library into a full event.
{
    'meta': {
        'stream': 'example'
    },
    'client_dt': '2020-01-05T13:30:03.33',
    'pageview_id': '27d4918051a51b381234',
    'session_id': '5a3138c61384d2910000',
    '$schema': 'example.0.1',
    'message': 'hello!'
}
Transmission Events are sent to the network. Events that can't be sent stop here.
Transit Events are sent across the internet. Events lost in transit stop here. Network
Receipt Events are received by the server. Events rejected by the server stop here. Server
Intake The event is decorated with information gathered at the server intake point
{
    'http': {
        'method': 'POST',
        'status_code': 200,
        'client_ip': '10.3.3.101',
        'has_cookies': true,
        'request_headers': 'user-agent: Mozilla/5.0',
        'response_headers': ''
    },
    'meta': {
        'stream': 'example'
    },
    'client_dt': '2020-01-05T13:30:03.33',
    'pageview_id': '27d4918051a51b381234',
    'session_id': '5a3138c61384d2910000',
    '$schema': 'example.0.1',
    'message': 'hello!'
}
Validation Event is validated against its schema. Invalid events stop here.
Processing Identifying information is removed, transforms are applied to bucket or categorize variables.
{
    'geolocation': 'tasmania',
    'user_agent': 'Mozilla/5.0',
    'meta': {
        'stream': 'example'
    },
    'client_dt': '2020-01-05T13:30:03.33',
    'pageview_id': '27d4918051a51b381234',
    'session_id': '5a3138c61384d2910000',
    '$schema': 'example.0.1',
    'message': 'hello!'
}

Event rejection[edit]

Event rejection refers to the deliberate discarding of events (as opposed to event loss due to error). Event rejection is the major controlling factor in the event pipeline. Imagined as a flow of events, the pipeline resembles a funnel, and a rather harsh one. The vast, vast majority of events are rejected at some stage.

An event will be rejected by IPL (the client) if:

  1. The client configuration algorithm returns false.
  2. The event in sample algorithm returns false.

An event will be rejected by EventGate (the server) if:

  1. The schema validation algorithm returns false.

Event duplication[edit]

An event bound for one stream can be copied to other streams in a prescribed fashion. This makes it convenient to perform multiple simultaneous studies using the same instrument for data collection. This allows the entire study to be implemented without alteration of application code.

Event loss due to error[edit]

Event loss due to error (as opposed to rejection), refers to the loss of events due to factors beyond the control of the event pipeline. It can occur due to error at any point of the pipeline, but in practice is concentrated in the network interface between the client and the server.

The following conditions are classified as event loss due to error:

  • Network connectivity loss
  • Network route failure (firewall, packet loss, proxy, etc.)
  • Intake server timeout (handling too many concurrent requests)
  • Intake server misconfiguration
  • Client malformed HTTP POST request
  • Client malformed JSON object in POST body
  • Failure to fetch schema for validation
  • Failure to compute validation
  • Failure to perform property value replacement

Event gain due to error[edit]

Event gain due to error (as opposed to event duplication) refers to the duplication of events due to factors beyond the control of the event pipeline. It is only known to occur during transmission of an event to the network, and only in some legacy browsers.

The following conditions are classified as event gain due to error:

  • Duplicate events sent as a result of browser error

Automatic property value assignment[edit]

Automatic property value assignment refers to the assignment of values to automatic properties by the IPL and (less frequently) the EventGate. Assignment of automatic property values must be completed prior to schema validation. After schema validation, property values are treated as immutable by policy. This is because validation is the last reliable time to verify that insertion into the destination storage will be successful.

Automatic property value replacement[edit]

Automatic property value replacement refers to the replacement of an automatic property's existing value with a different value. The replacement value is often computed from the original value. Replacement is most commonly done as a form of cardinality reduction. Cardinality reduction is a catch-all term for any procedure that maps values from of a large set to values from a smaller (typically much smaller) set. A common example of cardinality reduction in web analytics is mapping the (large) set of user agent strings to the (much smaller) set of browsers.

If a property value will be replaced, the replacement must happen prior to schema validation. To help the instrumentation writer understand, the property should be flagged as 'expected to be replaced' or 'dynamic' or 'to-be-reduced' somehow (TODO), and the schema must provide sufficient criteria so that if the reduction has not been successful, schema validation will fail. In other words, assume the motivation for property value replacement is cardinality reduction. We want to map members of a large set to members of a smaller set. If this mapping is unsuccessful, we want the schema validation to fail. Then we must write the schema so that the only valid values for the property in question, are members of the small set.

Automatic property value censorship[edit]

Automatic property value censorship refers to the intentional destruction of a property and its value, typically for the purposes of privacy. Some property values may contain sensitive information (or information that could become sensitive if stored with other values). This need not only refer to PII as legally defined. In general, care should be taken collecting property values that will need to be censored. If possible, this information should be handled on the server.

Instrumentation platform library (IPL)[edit]

The instrumentation platform library (IPL) (name TBD) is a multiplatform family of library software packages implementing the common instrumentation interface (CII) (name TBD). Instruments which use the IPL are in conformance with the Wikimedia Instrumentation Standard (WIS) (name TBD). The IPL ensures statistical #coherence across platforms.

  • Automatic, portable assignment of the managed event property values defined in standard schema fragments.
  • Identical algorithms for randomness, sampling, and targeting. event rejection, event gain, error handling, etc
  • Identical response to event stream configuration directives.
  • Identical transmission behavior

It also manages certain hard problems related to instrumentation such as

  • Battery life preservation
  • Loss of network connectivity
  • Overload of events

Library organization[edit]

The library is constructed in modules, called controllers. Each controller is driven by #instrumentation event stream configuration controls which are dynamically loaded at runtime, allowing the instrumentation to be changed remotely without additional software deployment.

Stream controller
Responsible for loading the #event stream configuration controls, and providing them to the library at runtime.
Sampling controller
Responsible for sampling-based event rejection. Uses the information from the stream controller to determine whether an event or a stream is in- or out- sample.
Automatic property controllers
Each of these controllers manages a set of properties defined in a schema fragment with corresponding name.
Identifiers
Controls the assignment of random and unique tokens that associate events belonging to a common #scope (consider scope controller?).
Activity
User
Page


Library core algorithms[edit]

Submit[edit]

Submit is the main algorithm of the library and is the only driver of runtime behavior post-initialization. Its steps are carried out in a particular order.

Algorithm: Submit

To submit an instrumentation event carrying data eventData to event stream streamName, the submit method must run these steps:

  1. Verify the event stream streamName is recognized
    1. Let config be the result of running getStreamConfig on streamName.
    2. If config is NULL, return void.
      The StreamController could not recognize streamName.
      We assume the client is misconfigured, i.e., the instrumentor has failed to create an event stream configuration, make an event stream configuration available for loading, or has specified the wrong streamName. Rather than produce potentially inconsistent data, the event submission does not proceed.
  2. Verify eventData is well-formed
    1. If eventData is NULL, undefined (implementation-dependent), or empty, return.
      Every event must define at least its #schema reference, therefore an empty eventData can never be valid.
    2. If eventData.$schema is NULL, undefined, or the empty string, return.
      Every event must define its #schema reference. See #why does the instrument need to say the schema?.
  3. Assign the event's #stream name.
    1. Let eventData.meta.stream be streamName.
  4. Assign the event's #submission time. Jump to the first appropriate substep:
    1. TODO: Actually doesn't this need to be done for ALL of the properties that are set by the library, not just this one? Yes, I think this is the case.
      If eventData.client_dt is not set
      1. Let eventData.client_dt be the current #client datetime to millisecond resolution, formatted as an #ISO-8601 datetime string.
      If eventData.client_dt is set
      1. Do nothing. Assume that this is a #Stream CC event, which should have the same timestamp as its original.
  5. Dispatch copies of eventData to any designated copy target streams
    1. For each copyRecipientStreamName in config.copyToStreamNames[ streamName ]
    2. Let copyOfEventData be a copy of eventData
    3. Run submit( copyRecipientStreamName, copyOfEventData )
  6. Schedule the event for transmission
    1. If STREAM_INTAKE_SERVICE_URL is not defined, return.
    2. Let eventDataSerialized be the result of running a JSON string serialization algorithm on eventData.
    3. Run #schedule event transmission on streamIntakeServiceURI and eventDataSerialized.

Configuration check algorithm[edit]

Configuration conflict is a form of involuntary sampling. Clients which do not meet certain requirements are not able to receive events from certain streams. This is not classified as loss due to error as it is defined behavior, but does not reflect the specification of the instrument developer.

Algorithm: Configuration check

Given an event stream streamName, to check the client's configuration:

  1. Event stream not provided
  2. The client's clock, randomness, or other compatibility has caused the library to disable itself
  3. The client's submission rate has climbed so high that the library has disabled itself

Out-of-sample algorithm[edit]

Sampling is interpreted broadly to mean any detemination made about whether or not to send an event on a particular stream on a particular client. Fine-grained control of sampling and targeting is performed by the IPL via the stream configuration. The IPL itself can also place the entire client out-sample, independent of any particular stream.

Algorithm: Sampling check

Given an event stream streamName,

  1. Stream disabled
  2. User opt-out
  3. User not in sample (according to sampling logic)

Network controller[edit]

Schedule[edit]

Algorithm: Schedule

To schedule an instrumentation event eventDataSerialized to a destination streamIntakeServiceURI, the network controller must run these steps:

  1. Enqueue eventDataSerialized and streamIntakeServiceURI
  2. When the queue wakes, run the #send algorithm on each item in the queue

Send[edit]

Algorithm: Send

To send an instrumentation event eventDataSerialized to a destination streamIntakeServiceURI, the network controller must run these steps:

  1. Create an HTTP POST request with URL streamIntakeServiceURI and request body eventDataSerialized
  2. Issue the HTTP POST request asynchronously and do not wait for or handle response or timeout ("fire-and-forget")

Stream controller[edit]

The stream controller is responsible for making the #event stream configuration available to for use in the library. The library uses the event stream configuration to drive various aspects of event submission.

type streamConfigItem 

interface StreamController {
  private readonly dict streamConfig;
  public streamConfigItem getStreamConfig( string streamName );
  public [string]? getCopyTargets( string streamName );
  constructor( streamConfigItem fetchedStreamConfig )
}

Constructor[edit]

Algorithm: Construct stream configuration

The constructor loads fetched stream configuration into a read-only property. After it has been fetched as fetchedStreamConfig, using a #event stream configuration fetch procedure, run the following steps

  1. Initialize streamConfig to the empty dict
  2. For each key streamName in fetchedStreamConfig, let StreamController.streamConfig[ streamName ] be fetchedStreamConfig[ streamName ].
    Note that the creation of stream CC routes has been delegated to the #stream configuration service, so we don't need to have extra work here constructing them.

getStream[edit]

Algorithm: getStream

To retreive a stream configuration for stream streamName, run the following steps

  1. If a key matching streamName exists in StreamController.streamConfig, and StreamController.streamConfig[ streamName ] has type dict, return StreamController.streamConfig[ streamName ].
    TODO, how to verify it's a real stream config? Do it in submit() then.
    The value under the key may be an empty dict, but NULL or undefined values are not acceptable.
  2. Return NULL
    The stream streamName is not recognized.

getCopyTargets[edit]

Algorithm: getCopyTargets

To retreive the copy targets for stream streamName, run the following steps

  1. If a key matching streamName exists in StreamController.streamConfig, and StreamController.streamConfig[ streamName ] has type dict, and StreamController.streamConfig[ streamName ].copyTargets exists and has type array of string, return StreamController.streamConfig[ streamName ].copyTargets.
    The value under the key may be an empty dict, but NULL or undefined values are not acceptable.
  2. Return NULL

Association controller[edit]

type tokenString

interface AssociationController {
	tokenString PAGEVIEW_ID 
	tokenString SESSION_ID
	[tokenString] ACTIVITY_TABLE
	integer ACTIVITY_COUNT
	const string activityTableStorageKey
	const string activityCountStorageKey
	const string sessionIdStorageKey
}

The association controller has a PAGEVIEW_ID, initially NULL, which represents the pageview token for the current page.

The association controller has a SESSION_ID, initially NULL, which represents the session token for the current session.

The association controller has an ACTIVITY_TABLE, initially NULL, which holds activity tokens for each stream.

The association controller has an ACTIVITY_COUNT, initially NULL, which holds a monotonic increasing sequence of integers, incremented for each new activity started.

pageview_id[edit]

Algorithm: pageview_id

To retreive the current pageview id, run the following steps:

  1. If PAGEVIEW_ID is NULL
    1. Let newPageviewId be the result of running #generate random id.
    2. Let PAGEVIEW_ID equal newPageviewId
  2. Return PAGEVIEW_ID

session_id[edit]

Algorithm: session_id

To retreive the current session id, run the following steps:

  1. If SESSION_ID is NULL, run the following steps
    1. Let sessionIdPersisted be the result of running #get persisted value on the sessionIdStorageKey.
    2. If sessionIdPersisted is not NULL, let SESSION_ID equal sessionIdPersisted, otherwise run the following steps
      1. Let sessionIdGenerated be the result of running #generate random id.
      2. Run #set persisted value on the sessionIdStorageKey and sessionIdGenerated.
      3. Let SESSION_ID equal sessionIdGenerated.
  2. Return SESSION_ID.

activity_id[edit]

Algorithm: activity_id

To retreive the current activity id for stream streamName and activity activityName, run the following steps:

  1. If ACTIVITY_COUNT or ACTIVITY_TABLE is NULL, run the following steps:
    1. Let ACTIVITY_COUNT be the result of running #get persisted value on the activityCountStorageKey
    2. Let ACTIVITY_TABLE be the result of running #get persisted value on the activityTableStorageKey
    3. If ACTIVITY_COUNT or ACTIVITY_TABLE are NULL, run the following steps:
      1. Let ACTIVITY_COUNT equal 1
      2. Let ACTIVITY_TABLE equal the empty object
      3. Run #set persisted value on activityCountStorageKey and ACTIVITY_COUNT
      4. Run #set persisted value on activityTableStorageKey and ACTIVITY_TABLE
  2. If streamName is NULL, undefined, or the empty string, return.
  3. If ACTIVITY_TABLE[ streamName ] is not set, run the following steps
    1. Let ACTIVITY_TABLE[ streamName ] equal ACTIVITY_COUNT + 1.
    2. Let ACTIVITY_COUNT equal ACTIVITY_COUNT + 1.
      1. Run #set persisted value on activityCountStorageKey and ACTIVITY_COUNT
      2. Run #set persisted value on activityTableStorageKey and ACTIVITY_TABLE
  4. Let currentCount equal ACTIVITY_TABLE[ streamName ].
  5. Return activityName+(currentCount+0x10000).toString(16).slice(1).

begin_new_session[edit]

Algorithm: begin_new_session

To begin a new session, run the following steps:

  1. Let PAGEVIEW_ID equal NULL
    Pageviews are nested in sessions; a change of session necessitates a change of pageview.
    Pageviews are not persisted, so they do not need to be removed from persistent storage.
  2. Let SESSION_ID equal NULL
  3. Run #delete persisted value on sessionIdStorageKey.
  4. Let ACTIVITY_TABLE equal NULL
  5. Let ACTIVITY_COUNT equal NULL
  6. Run #delete persisted value on activityTableStorageKey.
  7. Run #delete persisted value on activityCountStorageKey.

begin_new_activity[edit]

Algorithm: begin_new_activity

To begin a new activity for stream streamName, run the following steps:

  1. Run activity_id().
    This ensures ACTIVITY_TABLE and ACTIVITY_COUNT are loaded from the persistent store, or generated.
  2. If ACTIVITY_TABLE[ streamName ] is set, run the following steps:
    1. Unset ACTIVITY_TABLE[ streamName ]
      I.e., delete or remove the key streamName and its value from ACTIVITY_TABLE.
      1. Run #set persisted value on activityTableStorageKey and ACTIVITY_TABLE

Sampling controller[edit]

in_sample[edit]

in_sample( streamName, ??? )

.

Network controller[edit]

TODO: write out logic of the output queue and when it is flushed, with illustrations from state diagrams of each platform.

schedule[edit]

send[edit]

send_all_scheduled[edit]

Library Integration Functions[edit]

The core algorithms make use of a number of platform-specific functions that are defined in the integration layer. These functions and their contracts are outlined below.

get_persisted_value[edit]

get_persisted_value( key )

.

set_persisted_value[edit]

set_persisted_value( key, value )

.

del_persisted_value[edit]

del_persisted_value( key )

.

serializeJSON[edit]

unserializeJSON[edit]

HTTP POST request manager[edit]

Visibility state change[edit]

PRNG[edit]

Time[edit]

Input buffer[edit]

Output buffer[edit]

Fetch[edit]

void fetch( void )

Fetching of stream configuration is a platform-dependent process that may be tailored to the needs of the platform and its requirements. Stream configuration data may be injected as configuration (e.g. ResourceLoader) via one or more HTTP requests to the #Stream configuration service API (e.g. apps), or by locally provisioning it (e.g. during development or testing).

Stream configuration fetch can consist of any number of steps, and should not be confused with #Load stream configuration, which can be run once and only once per application runtime.

References[edit]

  1. ↑ 1.0 1.1 [1] ECMA-404 The JSON Data Interchange Syntax
  2. ↑ 2.0 2.1 [2] JSONSchema standard
  3. ↑ [3] YAML Ain't Markup Language Version 1.2, 3rd ed.