Upload
bowen-li
View
1.111
Download
3
Embed Size (px)
Citation preview
Introductory Online Controlled Experiments
Bowen Li,Staff Data Scientist @Vpon
2016/04/08
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 1 / 35
Outline
1 Introduction2 Online Experiment Procedure3 Experiment Designs4 Experiment Analytics5 Further on Online Experiments6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 2 / 35
Outline
1 Introduction2 Online Experiment Procedure3 Experiment Designs4 Experiment Analytics5 Further on Online Experiments6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 3 / 35
Introduction
ApplicationsValidate segments for advertisingEnhance algorithms: search, ads, personalization, recommendationChange apps, UI, content management systemAmong many others,...
MotivationsVerify scientifically the hypothesis:If a specific change is introduced, will it improve key metrics?Establish causal relationship:Unlike data mining techniques for finding correlation patterns
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 4 / 35
Introduction
ApplicationsValidate segments for advertisingEnhance algorithms: search, ads, personalization, recommendationChange apps, UI, content management systemAmong many others,...
MotivationsVerify scientifically the hypothesis:If a specific change is introduced, will it improve key metrics?Establish causal relationship:Unlike data mining techniques for finding correlation patterns
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 4 / 35
Why Online Experiment?
Intuition for assessing idea value is not reliableMost ideas fail to improve key metrics:
Google: Only about 10% of experiments led to business changesNetflix: 90% of what they try to be wrong
Even small gains are aggregated across millions of users & events
Getting trustworthy results is hardShared pitfalls and puzzling results:
Kohavi et al (2010, 2012); Kohavi & Longbotham (2010)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 5 / 35
Why Online Experiment?
Intuition for assessing idea value is not reliableMost ideas fail to improve key metrics:
Google: Only about 10% of experiments led to business changesNetflix: 90% of what they try to be wrong
Even small gains are aggregated across millions of users & events
Getting trustworthy results is hardShared pitfalls and puzzling results:
Kohavi et al (2010, 2012); Kohavi & Longbotham (2010)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 5 / 35
Experiment Basics
Factor: Controlled variable thought to influence metricTest factor: Its effect is of interestNon-test factors: Its effects is of no interest
A/B test: Single factor with two levelsA vs. BControl vs. TreatmentExisting vs. New
A/B/n test: 1 factor with more than two levelsMultivariable test: More than 1 factorsVariant: E.g. A/B test has 2 experimental variantsRandomization unit: Based on independent assumption
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 6 / 35
Outline
1 Introduction2 Online Experiment Procedure3 Experiment Designs4 Experiment Analytics5 Further on Online Experiments6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 7 / 35
Online Experiment Procedure (1/2)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 8 / 35
Online Experiment Procedure (2/2)
A/B test procedure1 Define
Overall Evaluation Criterion (OEC): Make decisionMetrics of interest: Find insights
2 Sample size calculation3 Random assignment to Treatment & Control4 Log data collection5 Online monitor6 Experiment analytics7 Decision making based on OEC
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 9 / 35
Online Experiment Procedure: Details (1/5)
Step 1.1: Define OEC for decision makingSingle metric: Incorporate tradeoff between metrics
Frequently, experiment will improve one metric but hurt anotherMust be decided in advance
Otherwise induce Familywise Type-I Error (later)Guideline:
Bad OEC: Short-term profit, but not long-termGood OEC: Drivers of lifetime valueE.g. sessions per user, repeated visits, conversion rates, etc
Step 1.2: Define metrics for finding insightsCompute many metricsMust control False Discovery Rate (later)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 10 / 35
Online Experiment Procedure: Details (1/5)
Step 1.1: Define OEC for decision makingSingle metric: Incorporate tradeoff between metrics
Frequently, experiment will improve one metric but hurt anotherMust be decided in advance
Otherwise induce Familywise Type-I Error (later)Guideline:
Bad OEC: Short-term profit, but not long-termGood OEC: Drivers of lifetime valueE.g. sessions per user, repeated visits, conversion rates, etc
Step 1.2: Define metrics for finding insightsCompute many metricsMust control False Discovery Rate (later)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 10 / 35
Online Experiment Procedure: Details (2/5)
Step 3: Calculate sample sizeSample size based on 50%/50% of Treatment/Control
For maximum testing powerHow long experiement runs
See later for details
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 11 / 35
Online Experiment Procedure: Details (3/5)
Step 4: Assign randomly user to Treatment or ControlGeorge Box: "Block what you can control and randomize what you cannot"
Blocking: (later)If can control some non-test factorsRandomization: (later)If cannot control these non-test factorsIn consistent manner:Same experience in user’s repeated visits
Step 5: Collect log dataCollect logs for online monitor & experiment analytics
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 12 / 35
Online Experiment Procedure: Details (3/5)
Step 4: Assign randomly user to Treatment or ControlGeorge Box: "Block what you can control and randomize what you cannot"
Blocking: (later)If can control some non-test factorsRandomization: (later)If cannot control these non-test factorsIn consistent manner:Same experience in user’s repeated visits
Step 5: Collect log dataCollect logs for online monitor & experiment analytics
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 12 / 35
Online Experiment Procedure: Details (4/5)
Step 6: Online monitorTreatment ramp-up
Intiate Treatment with 0.1%/99.9% splitRamp up from 0.1% to 0.5%, 2.5%, 10%, 50%At each step (for hours), analyze data to prevent egregious problems
Could be detected quickly on small samplesSample ratio mismatch (SRM) graph:
Monitor (1) # users, (2) OEC/metrics, etc, in each variant, over timeInteractions between overlapping experiments (later)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 13 / 35
Online Experiment Procedure: Details (5/5)
Step 7: Experiment analyticsCompare Treatment’s & Control’s OEC distributions
Hypothesis testing for experiment effectEstimation for experiment effect
See later for details
May be defined with different units; for exampleExperiment unit: UserAnalysis unit: User-Session
Apply Bootstrapping Technique, among others (later)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 14 / 35
Online Experiment Procedure: Details (5/5)
Step 7: Experiment analyticsCompare Treatment’s & Control’s OEC distributions
Hypothesis testing for experiment effectEstimation for experiment effect
See later for details
May be defined with different units; for exampleExperiment unit: UserAnalysis unit: User-Session
Apply Bootstrapping Technique, among others (later)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 14 / 35
Outline
1 Introduction2 Online Experiment Procedure3 Experiment Designs4 Experiment Analytics5 Further on Online Experiments6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 15 / 35
Statistics for Experiments (1/2)
Hypotheses for testingNull hypothesis: H0Treatment and Control are of no difference
Any observed differences are due to random fluctuationsAlternative hypothesis: H1Treatment is different from or better than Control
Testing null hypothesis: H0 : OB = OA
OX : OEC for Treatment & Control for X = B & A respectivelyOX : Estimated OEC
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 16 / 35
Statistics for Experiments (1/2)
Hypotheses for testingNull hypothesis: H0Treatment and Control are of no difference
Any observed differences are due to random fluctuationsAlternative hypothesis: H1Treatment is different from or better than Control
Testing null hypothesis: H0 : OB = OA
OX : OEC for Treatment & Control for X = B & A respectivelyOX : Estimated OEC
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 16 / 35
Statistics for Experiments (2/2)
Hypothesis testing basicsType-I error: Pr(H1|H0) = αProbability of rejecting H0 when H0 is true (common: 5%)Type-II error: Pr(H0|H1) = βProbability of not rejecting H0 when H0 is falseConfidence level: Pr(H0|H0) = 1− αProbability of not rejecting H0 when H0 is true (common: 95%)Power: Pr(H1|H1) = 1− βProbability of rejecting H0 when H0 is false (common: 80-95%)
Decision/Condition H0 is true (H0) H0 is false (H1)Reject H0 (H1) Type-I error PowerNot reject H0 (H0) Confidence level Type-II error
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 17 / 35
Sample Size Calculation
Hypothesis testing:H0 : OB = OA, with desired confidence level: 1− αH1 : OB − OA = 4, with desired power: 1− β
Minimum sample size:
0 + z1−α/2σ
√2n = 4− z1−βσ
√2n ⇒ n = σ2
∆2 2(z1−α/2 + z1−β)2
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 18 / 35
Outline
1 Introduction2 Online Experiment Procedure3 Experiment Designs4 Experiment Analytics5 Further on Online Experiments6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 19 / 35
Experiment Effect Testing & Estimation (1/2)
Absolute effect: OB − OA
95% confidence interval (CI) for OB − OA:
OB − OA ± 1.96σd
σd : Estimated standard deviation of OB − OASee Appendix for derivations
Hypothesis tesing for OB − OA: Based on CI
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 20 / 35
Experiment Effect Testing & Estimation (2/2)
Percent effect:(
OB−OAOA
· 100%)
95% confidence interval (CI):
(OB − OA
OA+ 1
)1± 1.96
√CV
2A + CV
2B − 1.962CV
2ACV
2B
1− 1.962CV2A
− 1
CV B = σB
OB: Estimated coefficient of variation (CV)
σB : Estimated standard deviation of OBSee Appendix for derivations
Hypothesis tesing: Based on CI
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 21 / 35
Further Experiment Analytics
To reduce variance for increasing powerIncrease sample size: Will increase experiment lengthAdjust analysis units by features: May shorten experiment length
Pre-experiment user metricsUser demographics: gender, age, locationUser behavior analytics: device, AppAmong many others
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 22 / 35
Outline
1 Introduction2 Online Experiment Procedure3 Experiment Designs4 Experiment Analytics5 Further on Online Experiments6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 23 / 35
Validation of Experiments
A/A test (null test):To test experimental & randomization setupsAssign users to variant groups, but expose to the same experienceIf system is working properly, H0 should be retained
rejected about only 5%Other application: Software migration
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 24 / 35
Limitations of Experiments (1/2)
Quantitative metrics, but no explanations:Possible to know which is better and by how much, but not whyLong-term effects:Online tests are typically run for short periods, e.g. a few days/weeks
Find good OEC metrics predicting long-term effectsRun experiments longer: Hard in practice due to Survivorship Bias inonline cohorts:
When lots of cookies would churn, especially in anonymous settings
Primacy effect & newness effect:Run experiment longer or compute OEC only for new users
Primacy effect:Experienced users may be less efficient to get used to TreatmentNewness effect:When Treatment is introduced, some users click everywhere
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 25 / 35
Limitations of Experiments (2/2)
Feature must be implemented:In early stages, use paper prototyping for quick feedback/refinementsConsistency:Need a consistent experience for usersOverlapping experiments:
Previous experiences: Strong interactions are rare in practice(?)Avoid initially tests that cound interactPerform Pairwise Tests: Flag interactions automatically
Launch event:All users need to see it, and we cannot run experiment
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 26 / 35
Other Practical Concerns (1/2)
TriggeringExample: Change to checkout page, only 10% of users arrive itAnalyze only users who were exposed to the variants (checkout pages)
Reduce variance of treatment effect estimates
Automatic optimizationRun experiments to optimize areas amenable to automated search
Once an organization has a clear OECMulti-Armed Bandit Algorithm / Hoeffding Races (later)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 27 / 35
Other Practical Concerns (1/2)
TriggeringExample: Change to checkout page, only 10% of users arrive itAnalyze only users who were exposed to the variants (checkout pages)
Reduce variance of treatment effect estimates
Automatic optimizationRun experiments to optimize areas amenable to automated search
Once an organization has a clear OECMulti-Armed Bandit Algorithm / Hoeffding Races (later)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 27 / 35
Other Practical Concerns (2/2)
Robots removalTheir acitivity can severely bias results
Call Treatment assignment by JavaScript (client-side), not server-sideExclude robots that reject cookies with unidentified requestsExclude robots that do not delete cookies and have many actionsRobots removal approach:
List of known robotsHeuristics (Kohavi & Parekh, 2003)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 28 / 35
Outline
1 Introduction2 Online Experiment Procedure3 Experiment Designs4 Experiment Analytics5 Further on Online Experiments6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 29 / 35
Discussions
Online experiments are extremely important for building data productsof various applicationsFor fast iteration, we will build online experiments platform with
Random assigment to Treatment or ControlOnline monitor for ramp-up, SRM, and interactionsExperiment analytics with data query, ETL and statistical inference
Next: Segment Validation SOP as the 1st application
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 30 / 35
References
Box et al. (2005). Statistics for experiments: design, innovation anddiscoveryKohavi & Longbotham (Encyclopedia of MLDM, 2015). Onlinecontrolled experiments and A/B testsKohavi et al. (DMKD, 2009). Controlled experiments on the web:survey and practical guidevan Belle (2002). Statistical rule of thumbWillan & Briggs (2006). Statistical analysis of cost-effective data
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 31 / 35
Thank you for your listening!
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 32 / 35
Appendix: Derivations of CI for Absolute Effect
Under H0 : OB = OA,E (OB − OA) = 0Var(OB − OA) can be estimated by σ2d
As sample size is large, by Central Limit Theorem
OB − OAσd
d−→ N(0, 1)
ThusPr(∣∣∣∣∣OB − OA
σd
∣∣∣∣∣ ≤ 1.96)
= 95%
CI for absolute effect:
OB − OA ± 1.96σd
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 33 / 35
Appendix: Derivations of CI for Percent Effect (1/2)
Fieller (1954):Define R = OB
OA
Obtain CI for R based on OB − ROA
Apply Central Limit Theorem
OB − ROAd−→ N(0,Var [OB − ROA])
Var [OB − ROA] = σ2B + R2σ2A (since Cov(OB, OA) = 0)Thus
OB − ROA√σ2B + R2σ2A
d−→ N(0, 1)
Pr
∣∣∣∣∣∣ OB − ROA√σ2B + R2σ2A
∣∣∣∣∣∣ ≤ 1.96
= 95%
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 34 / 35
Appendix: Proof of CI for Percent Effect (2/2)
CI for R: By solving quadratic equation of R OB − ROA√σ2B + R2σ2A
2
= 1.962
R = OB
OA
1± 1.96√
CV2A + CV
2B − 1.962CV
2ACV
2B
1.962CV2A
Note: OB−OAOA
= OBOA− 1
CI for percent effect:
OB
OA
1± 1.96√
CV2A + CV
2B − 1.962CV
2ACV
2B
1.962CV2A
− 1
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 35 / 35