Introductory Online Controlled Experiments (2016/04/08)

Introductory Online Controlled Experiments

Bowen Li,Staff Data Scientist @Vpon

2016/04/08

Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 1 / 35

Outline

1 Introduction2 Online Experiment Procedure3 Experiment Designs4 Experiment Analytics5 Further on Online Experiments6 Discussions


Outline



Introduction

ApplicationsValidate segments for advertisingEnhance algorithms: search, ads, personalization, recommendationChange apps, UI, content management systemAmong many others,...

MotivationsVerify scientifically the hypothesis:If a specific change is introduced, will it improve key metrics?Establish causal relationship:Unlike data mining techniques for finding correlation patterns


Introduction

ApplicationsValidate segments for advertisingEnhance algorithms: search, ads, personalization, recommendationChange apps, UI, content management systemAmong many others,...

MotivationsVerify scientifically the hypothesis:If a specific change is introduced, will it improve key metrics?Establish causal relationship:Unlike data mining techniques for finding correlation patterns


Why Online Experiment?

Intuition for assessing idea value is not reliableMost ideas fail to improve key metrics:

Google: Only about 10% of experiments led to business changesNetflix: 90% of what they try to be wrong

Even small gains are aggregated across millions of users & events

Getting trustworthy results is hardShared pitfalls and puzzling results:

Kohavi et al (2010, 2012); Kohavi & Longbotham (2010)


Why Online Experiment?

Intuition for assessing idea value is not reliableMost ideas fail to improve key metrics:

Google: Only about 10% of experiments led to business changesNetflix: 90% of what they try to be wrong

Even small gains are aggregated across millions of users & events

Getting trustworthy results is hardShared pitfalls and puzzling results:

Kohavi et al (2010, 2012); Kohavi & Longbotham (2010)


Experiment Basics

Factor: Controlled variable thought to influence metricTest factor: Its effect is of interestNon-test factors: Its effects is of no interest

A/B test: Single factor with two levelsA vs. BControl vs. TreatmentExisting vs. New

A/B/n test: 1 factor with more than two levelsMultivariable test: More than 1 factorsVariant: E.g. A/B test has 2 experimental variantsRandomization unit: Based on independent assumption


Outline



Online Experiment Procedure (1/2)


Online Experiment Procedure (2/2)

A/B test procedure1 Define

Overall Evaluation Criterion (OEC): Make decisionMetrics of interest: Find insights

2 Sample size calculation3 Random assignment to Treatment & Control4 Log data collection5 Online monitor6 Experiment analytics7 Decision making based on OEC


Online Experiment Procedure: Details (1/5)

Step 1.1: Define OEC for decision makingSingle metric: Incorporate tradeoff between metrics

Frequently, experiment will improve one metric but hurt anotherMust be decided in advance

Otherwise induce Familywise Type-I Error (later)Guideline:

Bad OEC: Short-term profit, but not long-termGood OEC: Drivers of lifetime valueE.g. sessions per user, repeated visits, conversion rates, etc

Step 1.2: Define metrics for finding insightsCompute many metricsMust control False Discovery Rate (later)



Step 1.1: Define OEC for decision makingSingle metric: Incorporate tradeoff between metrics

Frequently, experiment will improve one metric but hurt anotherMust be decided in advance

Otherwise induce Familywise Type-I Error (later)Guideline:

Bad OEC: Short-term profit, but not long-termGood OEC: Drivers of lifetime valueE.g. sessions per user, repeated visits, conversion rates, etc

Step 1.2: Define metrics for finding insightsCompute many metricsMust control False Discovery Rate (later)



Step 3: Calculate sample sizeSample size based on 50%/50% of Treatment/Control

For maximum testing powerHow long experiement runs

See later for details



Step 4: Assign randomly user to Treatment or ControlGeorge Box: "Block what you can control and randomize what you cannot"

Blocking: (later)If can control some non-test factorsRandomization: (later)If cannot control these non-test factorsIn consistent manner:Same experience in user’s repeated visits

Step 5: Collect log dataCollect logs for online monitor & experiment analytics



Step 4: Assign randomly user to Treatment or ControlGeorge Box: "Block what you can control and randomize what you cannot"

Blocking: (later)If can control some non-test factorsRandomization: (later)If cannot control these non-test factorsIn consistent manner:Same experience in user’s repeated visits

Step 5: Collect log dataCollect logs for online monitor & experiment analytics



Step 6: Online monitorTreatment ramp-up

Intiate Treatment with 0.1%/99.9% splitRamp up from 0.1% to 0.5%, 2.5%, 10%, 50%At each step (for hours), analyze data to prevent egregious problems

Could be detected quickly on small samplesSample ratio mismatch (SRM) graph:

Monitor (1) # users, (2) OEC/metrics, etc, in each variant, over timeInteractions between overlapping experiments (later)



Step 7: Experiment analyticsCompare Treatment’s & Control’s OEC distributions

Hypothesis testing for experiment effectEstimation for experiment effect


May be defined with different units; for exampleExperiment unit: UserAnalysis unit: User-Session

Apply Bootstrapping Technique, among others (later)



Step 7: Experiment analyticsCompare Treatment’s & Control’s OEC distributions

Hypothesis testing for experiment effectEstimation for experiment effect


May be defined with different units; for exampleExperiment unit: UserAnalysis unit: User-Session

Apply Bootstrapping Technique, among others (later)


Outline



Statistics for Experiments (1/2)

Hypotheses for testingNull hypothesis: H0Treatment and Control are of no difference

Any observed differences are due to random fluctuationsAlternative hypothesis: H1Treatment is different from or better than Control

Testing null hypothesis: H0 : OB = OA

OX : OEC for Treatment & Control for X = B & A respectivelyOX : Estimated OEC



Hypotheses for testingNull hypothesis: H0Treatment and Control are of no difference

Any observed differences are due to random fluctuationsAlternative hypothesis: H1Treatment is different from or better than Control

Testing null hypothesis: H0 : OB = OA

OX : OEC for Treatment & Control for X = B & A respectivelyOX : Estimated OEC



Hypothesis testing basicsType-I error: Pr(H1|H0) = αProbability of rejecting H0 when H0 is true (common: 5%)Type-II error: Pr(H0|H1) = βProbability of not rejecting H0 when H0 is falseConfidence level: Pr(H0|H0) = 1− αProbability of not rejecting H0 when H0 is true (common: 95%)Power: Pr(H1|H1) = 1− βProbability of rejecting H0 when H0 is false (common: 80-95%)

Decision/Condition H0 is true (H0) H0 is false (H1)Reject H0 (H1) Type-I error PowerNot reject H0 (H0) Confidence level Type-II error


Sample Size Calculation

Hypothesis testing:H0 : OB = OA, with desired confidence level: 1− αH1 : OB − OA = 4, with desired power: 1− β

Minimum sample size:

0 + z1−α/2σ

√2n = 4− z1−βσ

√2n ⇒ n = σ2

∆2 2(z1−α/2 + z1−β)2


Outline



Experiment Effect Testing & Estimation (1/2)

Absolute effect: OB − OA

95% confidence interval (CI) for OB − OA:

OB − OA ± 1.96σd

σd : Estimated standard deviation of OB − OASee Appendix for derivations

Hypothesis tesing for OB − OA: Based on CI


Experiment Effect Testing & Estimation (2/2)

Percent effect:(

OB−OAOA

· 100%)

95% confidence interval (CI):

(OB − OA

OA+ 1

)1± 1.96

√CV

2A + CV

2B − 1.962CV

2ACV

2B

1− 1.962CV2A

− 1

CV B = σB

OB: Estimated coefficient of variation (CV)

σB : Estimated standard deviation of OBSee Appendix for derivations

Hypothesis tesing: Based on CI


Further Experiment Analytics

To reduce variance for increasing powerIncrease sample size: Will increase experiment lengthAdjust analysis units by features: May shorten experiment length

Pre-experiment user metricsUser demographics: gender, age, locationUser behavior analytics: device, AppAmong many others


Outline



Validation of Experiments

A/A test (null test):To test experimental & randomization setupsAssign users to variant groups, but expose to the same experienceIf system is working properly, H0 should be retained

rejected about only 5%Other application: Software migration


Limitations of Experiments (1/2)

Quantitative metrics, but no explanations:Possible to know which is better and by how much, but not whyLong-term effects:Online tests are typically run for short periods, e.g. a few days/weeks

Find good OEC metrics predicting long-term effectsRun experiments longer: Hard in practice due to Survivorship Bias inonline cohorts:

When lots of cookies would churn, especially in anonymous settings

Primacy effect & newness effect:Run experiment longer or compute OEC only for new users

Primacy effect:Experienced users may be less efficient to get used to TreatmentNewness effect:When Treatment is introduced, some users click everywhere


Limitations of Experiments (2/2)

Feature must be implemented:In early stages, use paper prototyping for quick feedback/refinementsConsistency:Need a consistent experience for usersOverlapping experiments:

Previous experiences: Strong interactions are rare in practice(?)Avoid initially tests that cound interactPerform Pairwise Tests: Flag interactions automatically

Launch event:All users need to see it, and we cannot run experiment


Other Practical Concerns (1/2)

TriggeringExample: Change to checkout page, only 10% of users arrive itAnalyze only users who were exposed to the variants (checkout pages)

Reduce variance of treatment effect estimates

Automatic optimizationRun experiments to optimize areas amenable to automated search

Once an organization has a clear OECMulti-Armed Bandit Algorithm / Hoeffding Races (later)



TriggeringExample: Change to checkout page, only 10% of users arrive itAnalyze only users who were exposed to the variants (checkout pages)

Reduce variance of treatment effect estimates

Automatic optimizationRun experiments to optimize areas amenable to automated search

Once an organization has a clear OECMulti-Armed Bandit Algorithm / Hoeffding Races (later)



Robots removalTheir acitivity can severely bias results

Call Treatment assignment by JavaScript (client-side), not server-sideExclude robots that reject cookies with unidentified requestsExclude robots that do not delete cookies and have many actionsRobots removal approach:

List of known robotsHeuristics (Kohavi & Parekh, 2003)


Outline



Discussions

Online experiments are extremely important for building data productsof various applicationsFor fast iteration, we will build online experiments platform with

Random assigment to Treatment or ControlOnline monitor for ramp-up, SRM, and interactionsExperiment analytics with data query, ETL and statistical inference

Next: Segment Validation SOP as the 1st application


References

Box et al. (2005). Statistics for experiments: design, innovation anddiscoveryKohavi & Longbotham (Encyclopedia of MLDM, 2015). Onlinecontrolled experiments and A/B testsKohavi et al. (DMKD, 2009). Controlled experiments on the web:survey and practical guidevan Belle (2002). Statistical rule of thumbWillan & Briggs (2006). Statistical analysis of cost-effective data


Thank you for your listening!


Appendix: Derivations of CI for Absolute Effect

Under H0 : OB = OA,E (OB − OA) = 0Var(OB − OA) can be estimated by σ2d

As sample size is large, by Central Limit Theorem

OB − OAσd

d−→ N(0, 1)

ThusPr(∣∣∣∣∣OB − OA

σd

∣∣∣∣∣ ≤ 1.96)

= 95%

CI for absolute effect:

OB − OA ± 1.96σd


Appendix: Derivations of CI for Percent Effect (1/2)

Fieller (1954):Define R = OB

OA

Obtain CI for R based on OB − ROA

Apply Central Limit Theorem

OB − ROAd−→ N(0,Var [OB − ROA])

Var [OB − ROA] = σ2B + R2σ2A (since Cov(OB, OA) = 0)Thus

OB − ROA√σ2B + R2σ2A

d−→ N(0, 1)

Pr

∣∣∣∣∣∣ OB − ROA√σ2B + R2σ2A

∣∣∣∣∣∣ ≤ 1.96

= 95%


Appendix: Proof of CI for Percent Effect (2/2)

CI for R: By solving quadratic equation of R OB − ROA√σ2B + R2σ2A

2

= 1.962

R = OB

OA

1± 1.96√

CV2A + CV

2B − 1.962CV

2ACV

2B

1.962CV2A

Note: OB−OAOA

= OBOA− 1

CI for percent effect:

OB

OA

1± 1.96√

CV2A + CV

2B − 1.962CV

2ACV

2B

1.962CV2A

− 1


Data & Analytics

Introductory Online Controlled Experiments (2016/04/08)