Wikimedia Product/Analytics Infrastructure/Standard

The data acquisition system thing (actual name TBD) is a set of standardized APIs for designing, implementing, and maintaining software instrumentation and controlled experiments. It allows data scientists and engineers to create high-quality data sets that are continually updated with new data from actual software running across millions of devices in live production. Using the data acquisition system, instrumentation can be centrally coordinated and iterated without the need for a software release. Its practices promote software re-use, and make instrumentation more adaptable, rigorous, and safe.

The data acquisition system is a middleware layer. It provides native device environments (web browsers, mobile apps, etc.) with software libraries that extend the capabilities of the Wikimedia Event Platform. The libraries perform operations like randomness generation, timestamp generation, statistical sampling, and more, to create complex events that are specialized for data science and instrumentation. The libraries follow a single cross-platform standard that ensures statistical coherence and portability of events. Relying on the library makes instrumentation code easier to write, easier to debug, and less prone to statistical bias.

This document serves as the data acquisition system living standard (DAS-LS). It describes the three principal resource types: event, event schema, and event stream of the event system, the event lifecycle, and the client libraries algorithms and API. It should be updated regularly in accordance with software changes. It is maintained by the Product Data Engineering team.

System Description
The data acquisition system is used to create data sets with events streamed from live software running in production. In addition to writing instrumentation code, developers have the ability to design their own instrumentation events using an event schema, collect events from one or more instruments together into different event streams, and manage their instruments remotely using an event stream configuration.

Creating a new data set is organized as a project called a study. Running a study involves a number of stakeholders, described in this DACI. Data scientists, engineers, and product owners specify the study question and the data needed to answer the question. They work with engineers to produce an instrumentation plan, which describes how the data will be collected on the particular product(s). The instrumentation plan undergoes reviews for performance and privacy. Once the instrumentation plan is approved, implementation of the instrument can proceed.

In the data acquisition system, an instrument is a purpose-built unit of application code that exists to detect a specific situation of interest, such as an error or a button click. When detection occurs, the instrument will produce a set of data properties that form the core of what will become the instrumentation event. It passes these data properties to the instrumentation platform library, along with the name of an event stream, where it wants the event to be sent.

The instrumentation platform library performs various checks on behalf of the stream, including client configuration and statistical sampling. If the submitted event passes these checks, the library will assign it additional properties and values, constructing what will become the final event. The final event is then scheduled for transmission over the network. If network transmission succeeds, the event is received by EventGate, the stream intake service of the event platform. EventGate will run additional checks, and verify the event's schema. If the event passes these checks, it will be inserted into a distributed queue, and be loaded into a database table or other storage backend within a few hours.

Events
An instrumentation event (sometimes abbreviated iev or simply event) is an event platform event that is specialized for instrumentation and data analysis. These events follow enhanced standards for portability and statistical coherence, and have a number of features aimed at making them easier to author and deploy safely, including a catalog of pre-defined schema fragments, automatically managed properties, and a strict observational policy. All events produced by the instrumentation platform library are instrumentation events.

An instrumentation event is encoded as a JSON string. It supports all native JSON value types, i.e.
 * object
 * array
 * string
 * number
 * "true"
 * "false"
 * "null"

Every event must be an instance of a pre-defined event schema. It must include a reference to this event schema (see the  property, below). Every event must be addressed to an event stream, and must be an instance of that stream's allowed schema. If the event's schema does not match the event stream's allowed schema, or if the event is not a valid instance of that schema, the event will be rejected by the stream.

Events are constructed on a client (IPL), and transmitted to a server (EventGate). Construction begins with the originating instrument, which passes initial data to the IPL, which filters, appends, and schedules the event for transmission over the network. Once the event is transmitted, EventGate may make minor alterations to the event, before validating its stream and schema, and moving it into the data backend. A full explanation can be found in the event lifecycle.

Automatic properties
Automatic properties have their values assigned automatically by either IPL or EventGate. All automatic properties are defined in schema fragments. The rules for where and how the values of automatic properties are assigned is laid out in the event lifecycle. Values are changed according to the operations
 * assignment
 * replacement
 * censorship

Manual properties
Manual properties are not set or modified by any software in the instrumentation platform. They are the "essential data" for a schema, and are what make it what it is. They are to be set exclusively by the originating instrument. They typically have low generalizability and vary the most between schemas.

Schema
A schema (also event schema) is a JSONSchema file. Schema are written to define a certain kind of JSON object. Given a schema file, a JSONSchema validator can check whether a JSON object is an instance of that schema. This is called schema validation. Schema are what allow events, which are loosely-typed JSON strings, to map to datastores such as a database table, which is strongly typed. To avoid database table migrations, schema must follow a backwards compatibility convention.

Schema are authored in YAML and automatically converted into JSONSchema by the jsonschema-tools npm package. Schema can reference other schema, called fragments, in order to make it easy to reuse commonly-defined properties. Converting from YAML to JSONSchema recursively resolves all references, resulting in a single JSONSchema file. This process is called materialization. Schema have a name, a version, and a uri. The URI is the schema's address in the public schema repositories, which can be browsed at schema.wikimedia.org.

Schema are stored in two respositories. The primary repository contains schema for events that are used by the application to drive behavior or application logic. Changing these schema requires higher access priveleges due to the potential for affecting production behavior of software products. The secondary repository contains schema for instrumentation events. Because of the observational policy, regressions in these schema will degrade data collection but not application behavior, and therefore require a lower privlege level.

To create a schema, you must
 * 1) Clone the relevant repository
 * 2) Create a branch
 * 3) Create the YAML file
 * 4) Materialize it with
 * 5) Commit the changes
 * 6) Submit the patch for code review in Gerrit.

Unlike in the legacy system, event schema define structure and do not define routing. Multiple event streams may accept events of the same schema, and direct those events into different data backends.

Fragments
A schema fragment is JSONSchema file that can be included into another JSONSchema file. As the name implies, they are not intended to be be used as standalone schema, but as "building blocks" for data scientists to choose what properties they need out of a standard menu. This standardizes commonly-used sets of properties across schema. By doing this, we make database field names more predictable, and allow privacy and performance engineers to analyze the properties and their interactions in advance, making approval of a schema simpler and faster for all parties involved.

Like other JSONSchema files, schema fragments have their own URI, and are included into a JSONSchema file using a  property. The reference will be resolved when the  program is run to materialize the schema. A fragment typically defines a set of properties. This set of properties will appear in every schema which includes the fragment. In the data acquisition system, the properties of schema fragments are typically automatic properties, that is, their values are assigned automatically by the IPL. This means that referencing fragments in a schema does not mean that your instrument has to do more work.

Because the schema fragments introduce event properties which are controlled by the IPL library, development of fragments is connected to the development of the IPL. The value of these properties is sometimes tricky to compute consistently, and certain platforms may not be able to support certain fields. This is noted where appropriate. In general, the system aims for wide support, and all diagnostically-relevant properties should eventually be covered on all platforms.

Identifiers
Provides core identifiers that are used to associate events that are part of the same scope. All properties specified in this fragment are automatic properties, and reserved for the use of the instrumentation platform library. Within the library, they are serviced by the association controller.


 * Identifies a page view. Only available on web browsers or platforms with a concrete notion of pageview.


 * Identifies a session. On MediaWiki, a session last for the lifetime of the browser process (refer to T223931 for additional information). On iOS and Android apps, where the app is allowed to enter a background state, sessions expire after 15 minutes of inactivity. If the app returns to the foreground after 15 minutes, a new session ID is generated.


 * Identifies a device. Only available on app platforms with an "app install ID". Enables calculation of retention metrics for anonymous users since we do not have a user ID for those.

Sequence
Provides identifiers for associating events that are part of the same activity group or funnel. All properties specified in this fragment are automatic properties, and reserved for the use of the instrumentation platform library. Within the library, they are serviced by the association controller.


 * Identifies a sequence of actions in the same context or funnel. Useful for grouping together impressions with corresponding clicks, and for grouping together steps in a process such as making an edit. Activity identifier can be randomly generated or a counter.


 * Starting at 1, this is a counter for reconstructing the order of events in the same activity. Rather than the timestamp of the event, this sequence_id can be used to established the exact sequence of events.

Validation
Schema validation is performed by EventGate, the stream intake service. In order to pass through EventGate, an event must identify a schema (done with its  required property), the schema specified in   must match the schema supported by the stream identified in , and finally the event must be classified as a valid instance of the schema by the schema validator. Events which do not pass validation are discarded immediately.

Streams
A stream (also event stream) is a named, globally-writable destination for events. Events addressed to a stream will flow to a common datastore defined by that stream. Events may be submitted to a stream from any different instrument, instance, or platform. Inside a stream, events are ordered like a queue, using a first-in, first-out (FIFO) policy.

Streams are defined by a stream configuration file. This file defines the stream's name, which uniquely identifies it, an expected schema, which defines the type of events it will accept, and an additional set of rules for accepting or rejecting events. These rules are loaded by the IPL during its capability negotiation phase. When an event is submitted to the stream, the IPL will execute the accept/reject rules specified by the stream configuration in situ.

Stream configuration is a cornerstone of releaseless iteration. The stream, and its configuration, provides global control over all instruments and platform clients that seek to produce events into a data set. Changes made to stream configuration can go live in the same day, and require no application software deployment.

Stream configuration
Stream configuration is a collection of properties that define a stream. At a minimum, a stream configuration must specify a name, which uniquely identifies the stream, and a schema, which identifies the one type of schema that the stream will accept. In addition to these basic requirements, stream configuration supports a rich set of properties to control the flow of events to the stream.

Its canonical format is a JSON string. There is no single source for stream configuration, but there are a number of interfaces available for products to use with their custom fetch algorithms.

Accept
The accept section contains rules that must be met in order for an event to be accepted for processing. Being accepted means that the event will be produced to the output buffer. The accept check is computed on the client, inside the IPL. The computation takes place at the sampling step of the event lifecycle. The check computes the logical AND of the accept conditions. If the result is true, then the accept check succeeds. Otherwise, it fails. If the accept check fails, the event is discarded. The accept conditions are enforced by the IPL, and will not be computed or honored if the event is submitted directly.

Copy
Most streams will have events addressed to them by an instrument.

Event lifecycle
The instrumentation event pipeline defines a sequence of processing stages that an instrumentation event will pass through in its journey from its originating instrument to its ultimate datastore. As it progresses from stage to stage, the event may be discarded or duplicated, and its automatic properties will be changed in various ways by different software handlers.

Event rejection
Event rejection refers to the deliberate discarding of events (as opposed to event loss due to error). Event rejection is the major controlling factor in the event pipeline. Imagined as a flow of events, the pipeline resembles a funnel, and a rather harsh one. The vast, vast majority of events are rejected at some stage.

An event will be rejected by IPL (the client) if:
 * 1) The client configuration algorithm returns false.
 * 2) The event in sample algorithm returns false.

An event will be rejected by EventGate (the server) if:
 * 1) The schema validation algorithm returns false.

Event duplication
An event bound for one stream can be copied to other streams in a prescribed fashion. This makes it convenient to perform multiple simultaneous studies using the same instrument for data collection. This allows the entire study to be implemented without alteration of application code.

Event loss due to error
Event loss due to error (as opposed to rejection), refers to the loss of events due to factors beyond the control of the event pipeline. It can occur due to error at any point of the pipeline, but in practice is concentrated in the network interface between the client and the server.

The following conditions are classified as event loss due to error:
 * Network connectivity loss
 * Network route failure (firewall, packet loss, proxy, etc.)
 * Intake server timeout (handling too many concurrent requests)
 * Intake server misconfiguration
 * Client malformed HTTP POST request
 * Client malformed JSON object in POST body
 * Failure to fetch schema for validation
 * Failure to compute validation
 * Failure to perform property value replacement

Event gain due to error
Event gain due to error (as opposed to event duplication) refers to the duplication of events due to factors beyond the control of the event pipeline. It is only known to occur during transmission of an event to the network, and only in some legacy browsers.

The following conditions are classified as event gain due to error:
 * Duplicate events sent as a result of browser error

Automatic property value assignment
Automatic property value assignment refers to the assignment of values to automatic properties by the IPL and (less frequently) the EventGate. Assignment of automatic property values must be completed prior to schema validation. After schema validation, property values are treated as immutable by policy. This is because validation is the last reliable time to verify that insertion into the destination storage will be successful.

Automatic property value replacement
Automatic property value replacement refers to the replacement of an automatic property's existing value with a different value. The replacement value is often computed from the original value. Replacement is most commonly done as a form of cardinality reduction. Cardinality reduction is a catch-all term for any procedure that maps values from of a large set to values from a smaller (typically much smaller) set. A common example of cardinality reduction in web analytics is mapping the (large) set of user agent strings to the (much smaller) set of browsers.  "firefox" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6" => "firefox" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36 OPR/54.0.2952.64" => "opera"

In data science, cardinality reduction is typically done for properties whose values will be used to partition the whole dataset. If the cardinality of the value set is too large, the database cannot keep an efficient index and will not partition efficiently, particularly for large datasets. It is also done as a form of censorship: the mapping from a large set to a small set involves a loss of information, which can benefit privacy (for example, mapping an IP address to a country code). It makes sense to perform mappings like these prior to insertion, rather than at query time or as part of a regular table turnover, because the large cost of the operation is amortized.


 * All property value replacement must occur prior to schema validation (and none after validation)
 * The originating instrument may perform property value replacement
 * Property value replacement should not be performed between the originating instrument and the instrumentation platform library (see ). Additional layers of pre-processing which take place outside of the scope of this documented standard will lead to a loss of a developer's ability to reason about how an event evolves.
 * Property value replacement must not be performed by the instrumentation client library (it does perform ). Properties and property values that the originating instrument provides to the  function are faithfully handled.
 * Property value replacement at the server (pre-validation) is indicated only when
 * The property value that will be reduced is assigned at the server (pre-validation), OR
 * The cardinality reduction is too complicated to be computed on the client

The mapping of user agent strings to browsers is an example of a property value replacement that is best sited at the server (pre-validation), for both reasons. First, the client is actually not very good at providing its user agent string, but it can be found in an HTTP request header, which is easily accessible to the server. Second, computing the mapping requires a high degree of sophistication due to the number of cases involved (see e.g. |UA Parser). It is not desirable to perform this computation on the client because it would require a large amount of code being sent for only this purpose, which harms performance.

{{Hint|Policy: annotating replaced properties in schema

Up until, an event can have any property with any value that it wants. All properties and values get checked for schema conformance once and only once, at schema validation. -->

If a property value will be replaced, the replacement must happen prior to schema validation. To help the instrumentation writer understand, the property should be flagged as 'expected to be replaced' or 'dynamic' or 'to-be-reduced' somehow (TODO), and the schema must provide sufficient criteria so that if the reduction has not been successful, schema validation will fail. In other words, assume the motivation for property value replacement is cardinality reduction. We want to map members of a large set to members of a smaller set. If this mapping is unsuccessful, we want the schema validation to fail. Then we must write the schema so that the only valid values for the property in question, are members of the small set.

Automatic property value censorship
Automatic property value censorship refers to the intentional destruction of a property and its value, typically for the purposes of privacy. Some property values may contain sensitive information (or information that could become sensitive if stored with other values). This need not only refer to PII as legally defined. In general, care should be taken collecting property values that will need to be censored. If possible, this information should be handled on the server.

Instrumentation platform library (IPL)
The instrumentation platform library (IPL) (name TBD) is a multiplatform family of library software packages implementing the common instrumentation interface (CII) (name TBD). Instruments which use the IPL are in conformance with the Wikimedia Instrumentation Standard (WIS) (name TBD). The IPL ensures statistical across platforms.
 * Automatic, portable assignment of the managed event property values defined in standard schema fragments.
 * Identical algorithms for randomness, sampling, and targeting. event rejection, event gain, error handling, etc
 * Identical response to event stream configuration directives.
 * Identical transmission behavior

It also manages certain hard problems related to instrumentation such as
 * Battery life preservation
 * Loss of network connectivity
 * Overload of events

Library organization
The library is constructed in modules, called controllers. Each controller is driven by which are dynamically loaded at runtime, allowing the instrumentation to be changed remotely without additional software deployment.


 * Stream controller
 * Responsible for loading the, and providing them to the library at runtime.


 * Sampling controller
 * Responsible for sampling-based event rejection. Uses the information from the stream controller to determine whether an event or a stream is in- or out- sample.


 * Automatic property controllers
 * Each of these controllers manages a set of properties defined in a schema fragment with corresponding name.
 * Identifiers
 * Controls the assignment of random and unique tokens that associate events belonging to a common (consider scope controller?).
 * Activity
 * User
 * Page
 * Page
 * Page

<!--

Supporting portability
The instrumentation library concentrates the majority of the unportable code in itself, where care and attention can be focused on ensuring that portability is working property. Instruments become simpler, data quality and portability are improved.

Portability is important, because the goal of the instrumentation platform is to allow as many client platforms as possible to access a common set of instrumentation capabilities. This set of capabilities is defined in the common instrumentation definition [WIP].

Each target platform receives an instrumentation library that implements the CID[WIP]. Most target platforms will not use the same programming language or runtime environment. To ensure portability, core algorithms are built with a small but well-chosen set of primitives that can be implemented in a transparent style, avoiding the use of language-specific abstractions. This makes it easier to verify critical behavior in a new target language. This core is then wrapped by an integration layer that implements platform-specific functions according to the specified contract. -->

Submit
Submit is the main algorithm of the library and is the only driver of runtime behavior post-initialization. Its steps are carried out in a particular order.

Configuration check algorithm
Configuration conflict is a form of involuntary sampling. Clients which do not meet certain requirements are not able to receive events from certain streams. This is not classified as loss due to error as it is defined behavior, but does not reflect the specification of the instrument developer.

Out-of-sample algorithm
Sampling is interpreted broadly to mean any detemination made about whether or not to send an event on a particular stream on a particular client. Fine-grained control of sampling and targeting is performed by the IPL via the stream configuration. The IPL itself can also place the entire client out-sample, independent of any particular stream.

Stream controller
The stream controller is responsible for making the available to for use in the library. The library uses the event stream configuration to drive various aspects of event submission.

Association controller
The association controller has a PAGEVIEW_ID, initially, which represents the pageview token for the current page.

The association controller has a SESSION_ID, initially, which represents the session token for the current session.

The association controller has an ACTIVITY_TABLE, initially, which holds activity tokens for each stream.

The association controller has an ACTIVITY_COUNT, initially, which holds a monotonic increasing sequence of integers, incremented for each new activity started.

in_sample
.

Network controller
TODO: write out logic of the output queue and when it is flushed, with illustrations from state diagrams of each platform.

Library Integration Functions
The core algorithms make use of a number of platform-specific functions that are defined in the integration layer. These functions and their contracts are outlined below.

get_persisted_value
.

set_persisted_value
.

del_persisted_value
.

Fetch
Fetching of stream configuration is a platform-dependent process that may be tailored to the needs of the platform and its requirements. Stream configuration data may be injected as configuration (e.g. ResourceLoader) via one or more HTTP requests to the (e.g. apps), or by locally provisioning it (e.g. during development or testing).

Stream configuration fetch can consist of any number of steps, and should not be confused with, which can be run once and only once per application runtime.