Wikimedia Product/Analytics Infrastructure/Specification

The instrumentation platform library (IPL) (name TBD) is a multiplatform family of library software packages implementing the common instrumentation interface (CII) (name TBD). Instruments which use the IPL are in conformance with the Wikimedia Instrumentation Standard (WIS) (name TBD). The IPL ensures statistical across platforms.
 * Automatic, portable assignment of the managed event property values defined in standard schema fragments.
 * Identical algorithms for randomness, sampling, and targeting. event rejection, event gain, error handling, etc
 * Identical response to event stream configuration directives.
 * Identical transmission behavior

It also manages certain hard problems related to instrumentation such as
 * Battery life preservation
 * Loss of network connectivity
 * Overload of events

Design principles
Most developers are familiar with examples of a particular kind of cross-platform library -- POSIX, HTML DOM, and OpenGL, to name a few. While self-identified as "interface", "model", and "library", respectively, they are all examples of a metalibrary, specified in a metalanguage (a language-independent specification), sometimes with a reference implementation in a real programming language but otherwise implemented across various programming languages and system environments. A pervasive example of this pattern that sometimes escapes attention is a programming language itself - the language is the specification, and the implementation is the compiler or interpreter for that language.

The library is constructed in modules, called controllers. Each controller is driven by which are dynamically loaded at runtime, allowing the instrumentation to be changed remotely without additional software deployment.


 * Stream controller
 * Responsible for loading the, and providing them to the library at runtime.


 * Sampling controller
 * Responsible for sampling-based event rejection. Uses the information from the stream controller to determine whether an event or a stream is in- or out- sample.


 * Automatic property controllers
 * Each of these controllers manages a set of properties defined in a schema fragment with corresponding name.
 * Identifiers
 * Controls the assignment of random and unique tokens that associate events belonging to a common (consider scope controller?).
 * Activity
 * User
 * Page
 * Page
 * Page

<!--

Supporting portability
The instrumentation library concentrates the majority of the unportable code in itself, where care and attention can be focused on ensuring that portability is working property. Instruments become simpler, data quality and portability are improved.

Portability is important, because the goal of the instrumentation platform is to allow as many client platforms as possible to access a common set of instrumentation capabilities. This set of capabilities is defined in the common instrumentation definition [WIP].

Each target platform receives an instrumentation library that implements the CID[WIP]. Most target platforms will not use the same programming language or runtime environment. To ensure portability, core algorithms are built with a small but well-chosen set of primitives that can be implemented in a transparent style, avoiding the use of language-specific abstractions. This makes it easier to verify critical behavior in a new target language. This core is then wrapped by an integration layer that implements platform-specific functions according to the specified contract. -->

Submit
Submit is the main algorithm of the library and is the only driver of runtime behavior post-initialization. Its steps are carried out in a particular order.

Configuration check algorithm
Configuration conflict is a form of involuntary sampling. Clients which do not meet certain requirements are not able to receive events from certain streams. This is not classified as loss due to error as it is defined behavior, but does not reflect the specification of the instrument developer.

Out-of-sample algorithm
Sampling is interpreted broadly to mean any detemination made about whether or not to send an event on a particular stream on a particular client. Fine-grained control of sampling and targeting is performed by the IPL via the stream configuration. The IPL itself can also place the entire client out-sample, independent of any particular stream.

Stream controller
The stream controller is responsible for making the available to for use in the library. The library uses the event stream configuration to drive various aspects of event submission.

Association controller
The association controller has a PAGEVIEW_ID, initially, which represents the pageview token for the current page.

The association controller has a SESSION_ID, initially, which represents the session token for the current session.

The association controller has an ACTIVITY_TABLE, initially, which holds activity tokens for each stream.

The association controller has an ACTIVITY_COUNT, initially, which holds a monotonic increasing sequence of integers, incremented for each new activity started.

in_sample
.

Network controller
TODO: write out logic of the output queue and when it is flushed, with illustrations from state diagrams of each platform.

Library Integration Functions
The core algorithms make use of a number of platform-specific functions that are defined in the integration layer. These functions and their contracts are outlined below.

get_persisted_value
.

set_persisted_value
.

del_persisted_value
.

Fetch
Fetching of stream configuration is a platform-dependent process that may be tailored to the needs of the platform and its requirements. Stream configuration data may be injected as configuration (e.g. ResourceLoader) via one or more HTTP requests to the (e.g. apps), or by locally provisioning it (e.g. during development or testing).

Stream configuration fetch can consist of any number of steps, and should not be confused with, which can be run once and only once per application runtime.