Reading/Web/Quantitative Testing

The following plan is the default option for Quantitative testing for the Reading Web Team on new features. Details will be determined on a per-feature basis. Development of a feature should not begin prior to communicating a testing plan.

Reading Web Quantitative Testing Requirements[edit]

Identify relevant metrics [PM, DA, DSGN]
1. List questions to be answered by data (e.g.: does new feature / UI change improve reader retention? Does it attract enough usage?), perhaps drawing from earlier generative research
2. Maybe: do retrospective analysis (observational analysis) of existing data from status quo
3. Maybe: coordinate with qualitative testing
4. Decide on use of exploratory data analysis vs. A/B testing
Define metrics
1. Maybe: Define success criteria and/or practical significance levels
2. Add quantitative testing tasks to EPIC of the feature
Implementation
1. Note: Work on the feature must not begin prior to establishing metrics. Much of implementation can happen in parallel
2. Define testing framework
  1. Define how the metrics will be measured and which users will be subject to this (sampling, A/B test bucketing)
  2. Build out framework for testing infrastructure (bucketing+sampling). Ensure proposal is feasible.
  3. Analytics/engineers sync before continuing, updating infrastructure if necessary [eng, PM, DA]
3. Build feature and data collection
  1. Define an EventLogging schema and data to be logged.
  2. Spec instrumentation, create/update schema page on Meta, privacy considerations, define purging strategy [DA, engineers, PM]
  3. Confirm sampling method, sampling rates, test venues (wikis) and duration [DA, PM, eng]
  4. Implement feature flagged feature and instrumentation within framework defined above [eng]
  5. Test and sync on instrumentation [eng, DA] - DA and engineers must meet and review prior to deploying new instrumentation.
  6. Seek review from Analytics Engineering/Research about privacy considerations and purging strategy [DA, PM]
4. Maybe: Update cookie documentation if needed [eng]
Notify communities about test [CL, PM]
Roll out feature to beta mode (unless stated otherwise)
Roll out feature to 1-2 wikis for a pre-determined test
Launch test - default scope to the experimental sample (if not defined otherwise)
1. Monitor live data, sanity checks [DA]
Deactivate test [eng]
1. Run test for X months (default to 1 month)
2. Analyze results after first X weeks (default to 2 weeks), publish results, communicate
Iterate when necessary
Graduate feature
Cake

Sampling and bucketing[edit]

Some notes on expressions we use in A/B tests and other experiments:

The sampling rate (more correctly sampling ratio or sampling fraction) in an experiment is the ratio of the size of the sample (the set for which data is being collected) to the size of the total population. For example, in an A/B test we might randomly assign browser sessions to the new treatment with 4% probability, to the control group (cf. below) also with 4% probability, and not collect data for the remaining 92%. In that case, the sampling ratio would be 8%. The sampling ratio must be chosen large enough to be able to answer the given research questions with sufficient accuracy or statistical significance based on the data collected during the experiment's duration, and small enough so that the resulting event rate (see below) is manageable by the analytics infrastructure. When the instrumentation involves sensitive data, a lower sampling ratio may be preferable from a privacy standpoint.
In A/B tests and other comparative experiments, the control group is that part of the sample (in the above sense, i.e. the part of the population for which data is being collected) that continues to receive the old/default design (or no treatment). The test group is the part of the sample that receives the new design (or the treatment). E.g. test is "A", control is "B". In the example above, the control group consists of 4% of the population (not 92% or 96%).
The test and control groups may also be called "buckets", alongside the remainder of the population outside the sample. I.e. in the above example, we would have three buckets of sizes 4%, 4% and 92%. A/B tests as used by the team always collect data for two buckets of equal size. (As of 2018 we are not employing more complicated methods such as multivariate tests or multi-armed bandit experiments.) We sometimes denote this common size of the test and control group (4% in the example) as "bucket size" or "group size". This is referred to in variable names such as $wgPopupsAnonsExperimentalGroupSize.
The event rate of an EventLogging instrumentation (or more generally an online experiment) is usually given as the number of events per second received via that instrumentation. Schemas with average event rates above a certain limit may need to be blacklisted from being recorded in MariaDB tables, i.e. their events will only be stored in the more robust Hadoop cluster (accessible via Hive).

Notes on tokens[edit]

pageToken: Several of our schemas including Schema:ReadingDepth and Schema:PageIssues use a token whose value remains constant for all events occurring during one pageview, and is consistent across these schemas, based on getPageviewToken( ) (more details: phab:T201124)
sessionToken: .... (cf. phab:T118063#4547178 ff.)

We often sample by sessionToken, but sometimes also want to measure per-pageview metrics, which should normally be calculated based on data that is sampled per pageview instead. This can introduce inaccuracies because the underlying events or measures may not be statistically independent for the pageviews within one session; however we assume that such errors will often be very small, considering the small average session lengths.