Wikimedia Product/Analytics Infrastructure/Specification

From mediawiki.org

READ THIS FIRST

This document is a draft. Don't take it too seriously.

  • Most names are placeholders of some kind.
  • Some parts are going to be wrong.
  • Some parts might be not even wrong.


The instrumentation platform library (IPL) (name TBD) is a multiplatform family of library software packages implementing the common instrumentation interface (CII) (name TBD). Instruments which use the IPL are in conformance with the Wikimedia Instrumentation Standard (WIS) (name TBD). The IPL ensures statistical #coherence across platforms.

  • Automatic, portable assignment of the managed event property values defined in standard schema fragments.
  • Identical algorithms for randomness, sampling, and targeting. event rejection, event gain, error handling, etc
  • Identical response to event stream configuration directives.
  • Identical transmission behavior

It also manages certain hard problems related to instrumentation such as

  • Battery life preservation
  • Loss of network connectivity
  • Overload of events


Design principles[edit]

Most developers are familiar with examples of a particular kind of cross-platform library -- POSIX, HTML DOM, and OpenGL, to name a few. While self-identified as "interface", "model", and "library", respectively, they are all examples of a metalibrary, specified in a metalanguage (a language-independent specification), sometimes with a reference implementation in a real programming language but otherwise implemented across various programming languages and system environments. A pervasive example of this pattern that sometimes escapes attention is a programming language itself - the language is the specification, and the implementation is the compiler or interpreter for that language.

The library is constructed in modules, called controllers. Each controller is driven by #instrumentation event stream configuration controls which are dynamically loaded at runtime, allowing the instrumentation to be changed remotely without additional software deployment.

Stream controller
Responsible for loading the #event stream configuration controls, and providing them to the library at runtime.
Sampling controller
Responsible for sampling-based event rejection. Uses the information from the stream controller to determine whether an event or a stream is in- or out- sample.
Automatic property controllers
Each of these controllers manages a set of properties defined in a schema fragment with corresponding name.
Identifiers
Controls the assignment of random and unique tokens that associate events belonging to a common #scope (consider scope controller?).
Activity
User
Page


Library core algorithms[edit]

Submit[edit]

Submit is the main algorithm of the library and is the only driver of runtime behavior post-initialization. Its steps are carried out in a particular order.

Algorithm: Submit

To submit an instrumentation event carrying data eventData to event stream streamName, the submit method must run these steps:

  1. Verify the event stream streamName is recognized
    1. Let config be the result of running getStreamConfig on streamName.
    2. If config is NULL, return void.
      The StreamController could not recognize streamName.
      We assume the client is misconfigured, i.e., the instrumentor has failed to create an event stream configuration, make an event stream configuration available for loading, or has specified the wrong streamName. Rather than produce potentially inconsistent data, the event submission does not proceed.
  2. Verify eventData is well-formed
    1. If eventData is NULL, undefined (implementation-dependent), or empty, return.
      Every event must define at least its #schema reference, therefore an empty eventData can never be valid.
    2. If eventData.$schema is NULL, undefined, or the empty string, return.
      Every event must define its #schema reference. See #why does the instrument need to say the schema?.
  3. Assign the event's #stream name.
    1. Let eventData.meta.stream be streamName.
  4. Assign the event's #submission time. Jump to the first appropriate substep:
    1. TODO: Actually doesn't this need to be done for ALL of the properties that are set by the library, not just this one? Yes, I think this is the case.
      If eventData.client_dt is not set
      1. Let eventData.client_dt be the current #client datetime to millisecond resolution, formatted as an #ISO-8601 datetime string.
      If eventData.client_dt is set
      1. Do nothing. Assume that this is a #Stream CC event, which should have the same timestamp as its original.
  5. Dispatch copies of eventData to any designated copy target streams
    1. For each copyRecipientStreamName in config.copyToStreamNames[ streamName ]
    2. Let copyOfEventData be a copy of eventData
    3. Run submit( copyRecipientStreamName, copyOfEventData )
  6. Schedule the event for transmission
    1. If STREAM_INTAKE_SERVICE_URL is not defined, return.
    2. Let eventDataSerialized be the result of running a JSON string serialization algorithm on eventData.
    3. Run #schedule event transmission on streamIntakeServiceURI and eventDataSerialized.

Configuration check algorithm[edit]

Configuration conflict is a form of involuntary sampling. Clients which do not meet certain requirements are not able to receive events from certain streams. This is not classified as loss due to error as it is defined behavior, but does not reflect the specification of the instrument developer.

Algorithm: Configuration check

Given an event stream streamName, to check the client's configuration:

  1. Event stream not provided
  2. The client's clock, randomness, or other compatibility has caused the library to disable itself
  3. The client's submission rate has climbed so high that the library has disabled itself

Out-of-sample algorithm[edit]

Sampling is interpreted broadly to mean any detemination made about whether or not to send an event on a particular stream on a particular client. Fine-grained control of sampling and targeting is performed by the IPL via the stream configuration. The IPL itself can also place the entire client out-sample, independent of any particular stream.

Algorithm: Sampling check

Given an event stream streamName,

  1. Stream disabled
  2. User opt-out
  3. User not in sample (according to sampling logic)

Network controller[edit]

Schedule[edit]

Algorithm: Schedule

To schedule an instrumentation event eventDataSerialized to a destination streamIntakeServiceURI, the network controller must run these steps:

  1. Enqueue eventDataSerialized and streamIntakeServiceURI
  2. When the queue wakes, run the #send algorithm on each item in the queue

Send[edit]

Algorithm: Send

To send an instrumentation event eventDataSerialized to a destination streamIntakeServiceURI, the network controller must run these steps:

  1. Create an HTTP POST request with URL streamIntakeServiceURI and request body eventDataSerialized
  2. Issue the HTTP POST request asynchronously and do not wait for or handle response or timeout ("fire-and-forget")

Stream controller[edit]

The stream controller is responsible for making the #event stream configuration available to for use in the library. The library uses the event stream configuration to drive various aspects of event submission.

type streamConfigItem 

interface StreamController {
  private readonly dict streamConfig;
  public streamConfigItem getStreamConfig( string streamName );
  public [string]? getCopyTargets( string streamName );
  constructor( streamConfigItem fetchedStreamConfig )
}

Constructor[edit]

Algorithm: Construct stream configuration

The constructor loads fetched stream configuration into a read-only property. After it has been fetched as fetchedStreamConfig, using a #event stream configuration fetch procedure, run the following steps

  1. Initialize streamConfig to the empty dict
  2. For each key streamName in fetchedStreamConfig, let StreamController.streamConfig[ streamName ] be fetchedStreamConfig[ streamName ].
    Note that the creation of stream CC routes has been delegated to the #stream configuration service, so we don't need to have extra work here constructing them.

getStream[edit]

Algorithm: getStream

To retreive a stream configuration for stream streamName, run the following steps

  1. If a key matching streamName exists in StreamController.streamConfig, and StreamController.streamConfig[ streamName ] has type dict, return StreamController.streamConfig[ streamName ].
    TODO, how to verify it's a real stream config? Do it in submit() then.
    The value under the key may be an empty dict, but NULL or undefined values are not acceptable.
  2. Return NULL
    The stream streamName is not recognized.

getCopyTargets[edit]

Algorithm: getCopyTargets

To retreive the copy targets for stream streamName, run the following steps

  1. If a key matching streamName exists in StreamController.streamConfig, and StreamController.streamConfig[ streamName ] has type dict, and StreamController.streamConfig[ streamName ].copyTargets exists and has type array of string, return StreamController.streamConfig[ streamName ].copyTargets.
    The value under the key may be an empty dict, but NULL or undefined values are not acceptable.
  2. Return NULL

Association controller[edit]

type tokenString

interface AssociationController {
	tokenString PAGEVIEW_ID 
	tokenString SESSION_ID
	[tokenString] ACTIVITY_TABLE
	integer ACTIVITY_COUNT
	const string activityTableStorageKey
	const string activityCountStorageKey
	const string sessionIdStorageKey
}

The association controller has a PAGEVIEW_ID, initially NULL, which represents the pageview token for the current page.

The association controller has a SESSION_ID, initially NULL, which represents the session token for the current session.

The association controller has an ACTIVITY_TABLE, initially NULL, which holds activity tokens for each stream.

The association controller has an ACTIVITY_COUNT, initially NULL, which holds a monotonic increasing sequence of integers, incremented for each new activity started.

pageview_id[edit]

Algorithm: pageview_id

To retreive the current pageview id, run the following steps:

  1. If PAGEVIEW_ID is NULL
    1. Let newPageviewId be the result of running #generate random id.
    2. Let PAGEVIEW_ID equal newPageviewId
  2. Return PAGEVIEW_ID

session_id[edit]

Algorithm: session_id

To retreive the current session id, run the following steps:

  1. If SESSION_ID is NULL, run the following steps
    1. Let sessionIdPersisted be the result of running #get persisted value on the sessionIdStorageKey.
    2. If sessionIdPersisted is not NULL, let SESSION_ID equal sessionIdPersisted, otherwise run the following steps
      1. Let sessionIdGenerated be the result of running #generate random id.
      2. Run #set persisted value on the sessionIdStorageKey and sessionIdGenerated.
      3. Let SESSION_ID equal sessionIdGenerated.
  2. Return SESSION_ID.

activity_id[edit]

Algorithm: activity_id

To retreive the current activity id for stream streamName and activity activityName, run the following steps:

  1. If ACTIVITY_COUNT or ACTIVITY_TABLE is NULL, run the following steps:
    1. Let ACTIVITY_COUNT be the result of running #get persisted value on the activityCountStorageKey
    2. Let ACTIVITY_TABLE be the result of running #get persisted value on the activityTableStorageKey
    3. If ACTIVITY_COUNT or ACTIVITY_TABLE are NULL, run the following steps:
      1. Let ACTIVITY_COUNT equal 1
      2. Let ACTIVITY_TABLE equal the empty object
      3. Run #set persisted value on activityCountStorageKey and ACTIVITY_COUNT
      4. Run #set persisted value on activityTableStorageKey and ACTIVITY_TABLE
  2. If streamName is NULL, undefined, or the empty string, return.
  3. If ACTIVITY_TABLE[ streamName ] is not set, run the following steps
    1. Let ACTIVITY_TABLE[ streamName ] equal ACTIVITY_COUNT + 1.
    2. Let ACTIVITY_COUNT equal ACTIVITY_COUNT + 1.
      1. Run #set persisted value on activityCountStorageKey and ACTIVITY_COUNT
      2. Run #set persisted value on activityTableStorageKey and ACTIVITY_TABLE
  4. Let currentCount equal ACTIVITY_TABLE[ streamName ].
  5. Return activityName+(currentCount+0x10000).toString(16).slice(1).

begin_new_session[edit]

Algorithm: begin_new_session

To begin a new session, run the following steps:

  1. Let PAGEVIEW_ID equal NULL
    Pageviews are nested in sessions; a change of session necessitates a change of pageview.
    Pageviews are not persisted, so they do not need to be removed from persistent storage.
  2. Let SESSION_ID equal NULL
  3. Run #delete persisted value on sessionIdStorageKey.
  4. Let ACTIVITY_TABLE equal NULL
  5. Let ACTIVITY_COUNT equal NULL
  6. Run #delete persisted value on activityTableStorageKey.
  7. Run #delete persisted value on activityCountStorageKey.

begin_new_activity[edit]

Algorithm: begin_new_activity

To begin a new activity for stream streamName, run the following steps:

  1. Run activity_id().
    This ensures ACTIVITY_TABLE and ACTIVITY_COUNT are loaded from the persistent store, or generated.
  2. If ACTIVITY_TABLE[ streamName ] is set, run the following steps:
    1. Unset ACTIVITY_TABLE[ streamName ]
      I.e., delete or remove the key streamName and its value from ACTIVITY_TABLE.
      1. Run #set persisted value on activityTableStorageKey and ACTIVITY_TABLE

Sampling controller[edit]

in_sample[edit]

in_sample( streamName, ??? )

.

Network controller[edit]

TODO: write out logic of the output queue and when it is flushed, with illustrations from state diagrams of each platform.

schedule[edit]

send[edit]

send_all_scheduled[edit]

Library Integration Functions[edit]

The core algorithms make use of a number of platform-specific functions that are defined in the integration layer. These functions and their contracts are outlined below.

get_persisted_value[edit]

get_persisted_value( key )

.

set_persisted_value[edit]

set_persisted_value( key, value )

.

del_persisted_value[edit]

del_persisted_value( key )

.

serializeJSON[edit]

unserializeJSON[edit]

HTTP POST request manager[edit]

Visibility state change[edit]

PRNG[edit]

Time[edit]

Input buffer[edit]

Output buffer[edit]

Fetch[edit]

void fetch( void )

Fetching of stream configuration is a platform-dependent process that may be tailored to the needs of the platform and its requirements. Stream configuration data may be injected as configuration (e.g. ResourceLoader) via one or more HTTP requests to the #Stream configuration service API (e.g. apps), or by locally provisioning it (e.g. during development or testing).

Stream configuration fetch can consist of any number of steps, and should not be confused with #Load stream configuration, which can be run once and only once per application runtime.

References[edit]