Reading/Web/Quantitative Testing

The following plan will be the default option for Quantitative testing for the Reading Web Team on all upcoming features. Details will be determined on a per-feature basis. Development of a feature should not begin prior to communicating a testing plan.

Reading Web Quantitative Testing Requirements

 * 1) Identify relevant metrics [PM, DA, DSGN]
 * 2) List questions to be answered by data (e.g.: does new feature / UI change improve reader retention? Does it attract enough usage?), perhaps drawing from earlier generative research
 * 3) Maybe: do retrospective analysis​ (observational analysis) of existing data from status quo
 * 4) Maybe: coordinate with qualitative testing
 * 5) Decide on use of exploratory data analysis vs. A/B testing
 * 6) Define metrics
 * 7) Maybe: Define success criteria and/or practical significance levels
 * 8) Add quantitative testing tasks to EPIC of the feature
 * 9) Implementation
 * 10) Note: Work on the feature must not begin prior to establishing metrics. Much of implementation can happen in parallel
 * 11) Define testing framework
 * 12) Define how the metrics will be measured and which users will be subject to this (sampling, A/B test bucketing)
 * 13) Build out framework for testing infrastructure (bucketing+sampling). Ensure proposal is feasible.
 * 14) Analytics/engineers sync before continuing, updating infrastructure if necessary [eng, PM, DA]
 * 15) Build feature and data collection
 * 16) Define an EventLogging schema and data to be logged.
 * 17) Spec instrumentation, create/update schema page on Meta, privacy considerations, define purging strategy [DA, engineers, PM]
 * 18) Confirm sampling method, sampling rates, test venues (wikis) and duration [DA, PM, eng]
 * 19) Implement feature flagged feature and instrumentation within framework defined above [eng]
 * 20) Test and sync on instrumentation [eng, DA] - DA and engineers must meet and review prior to deploying new instrumentation.
 * 21) Seek review from Analytics Engineering/Research about privacy considerations and purging strategy [DA, PM]
 * 22) Maybe: Update cookie documentation if needed [eng]
 * 23) Notify communities about test [CL, PM]
 * 24) Roll out feature to beta mode (unless stated otherwise)
 * 25) Roll out feature to 1-2 wikis for a pre-determined test
 * 26) Launch test - default scope to the experimental sample (if not defined otherwise)
 * 27) Monitor live data, sanity checks [DA]
 * 28) Deactivate test [eng]
 * 29) Run test for X months (default to 1 month)
 * 30) Analyze results after first X weeks (default to 2 weeks), publish results, communicate
 * 31) Iterate when necessary
 * 32) Graduate feature
 * 33) Cake

Sampling and bucketing
Some notes on expressions we use in A/B tests and other experiments:
 * The sampling rate (more correctly sampling ratio or sampling fraction) in an experiment is ratio of the sample size (the set for which data is being collected) to the size of the total population. For example, in an A/B test we might randomly assign browser sessions to the new treatment with 4% probability, to the control group (cf. below) also with 4% probability, and not collect data for the remaining 92%. In that case, the sampling ratio would be 8%.
 * In A/B tests and other comparative experiments, the control group is that part of the sample (in the above sense, i.e. the part of the population for which data is being collected) that continues to receive the old/default design (or no treatment), and the test group is the part of the sample that receives the new design (or the treatment). E.g. test is "A", control is "B". In the example above, the control group consists of 4% of the population (not 92% or 96%).
 * The test and control groups may also be regarded as "buckets", alongside the remainder of the population outside the sample. I.e. in the above example, we would have three buckets of sizes 4%, 4% and 92%. A/B tests as used by the team always collect data for two buckets of equal size (as of 2018 we are not employing more complicated methods such as multivariate tests or multi-armed bandit experiments). We sometimes denote this common size of the test and control goup (4% in the example) as "bucket size" or "group size". This is referred to in variables such as.