Upload
optimizely
View
109
Download
1
Tags:
Embed Size (px)
Citation preview
Testing with Stats Engine Why it’s harder to find winners with Optimizely
Pete KoomenCo-founder & CTO, Optimizely
#opticon2015
1. Why we built Stats Engine
2. Best practices for testing with Stats Engine
1. - …to redesign Statistics to match the way our customers actually use it
2. - Start with a clear hypothesis- Use expected ROI to decide when to stop your experiments- Set a primary goal- Avoid “random” goals and variations for maximum test velocity
TL;DL
The “T-test” (a.k.a. “NHST”, a.k.a. “Student T-test” )
The T-test in a nutshell
1. Run your experiment until you have reached the required sample size, and then stop.
2. Ask “What are the chances I’d have gotten these results in an A/A test?” (p-value)
3. If p-value < 5%, your results are significant.
1908
Data is expensive.
Data is slow.
Practitioners are trained.
2015
Data is cheap.
Data is real-time.
Practitioners are everyone.
The T-test was designed for this world
Pitfalls of traditional statistics:
1. Stopping early
2. Peeking
3. Mistaking “False positive rate” for “False discovery rate”
“Thank goodness a third person didn't die, or public health
authorities would be banning jogging.”
– Alex Hutchinson, Runner’s World
1. Stopping early
Statistically Significant
Result
Minimum Sample Size
Minimum Detectable
Effect
• Must be set before experiment starts • Too high → you may miss an effect • Too low → experiments take too long
Traditional stats best practice: Use an online calculator to set your Minimum Detectable Effect and Minimum Sample Size
before your test
Stats Engine best practice: While your test is running, use expected ROI
to decide when to stop your experiments
p-Value < 5%. Significant!
p-Value > 5%. Inconclusive.
p-Value > 5%. Inconclusive.
Min Sample Size
2. Peeking
Time
Experiment Starts p-Value > 5%. Inconclusive.
p-Value < 5%. Significant!
p-Value > 5%. Inconclusive.
p-Value > 5%. Inconclusive.
Min Sample Size
2. Peeking
Time
Experiment Starts p-Value > 5%. Inconclusive.
4 peeks —> ~18% chance of seeing a false positive
The “T-test” (a.k.a. “NHST”, a.k.a. “Student T-test” )
The T-test in a nutshell
1. Run your experiment until you have reached the required sample size, and then stop.
2. Ask “What are the chances I’d have gotten these results in an A/A test?” (p-value)
3. If p-value < 5%, your results are significant.
p-Value < 5%. Significant!
p-Value > 5%. Inconclusive.
p-Value > 5%. Inconclusive.
Min Sample Size
2. Peeking
Time
Experiment Starts p-Value > 5%. Inconclusive.
Stats Engine calculates an “always-valid” p-value. Optimizely reports (one minus) this value as “Statistical Significance”
Stat Significance > 90%. Conclusive!
Stat Significance < 90%. Inconclusive.
Stat Significance < 90%. Inconclusive.
Min Sample Size
2. Peeking
Time
Experiment StartsStat Significance < 90%.
Inconclusive.
Stats Engine calculates an “always-valid” p-value. Optimizely reports (one minus) this value as “Statistical Significance”
Stat Significance > 90%. Conclusive!
Stat Significance < 90%. Inconclusive.
Stat Significance < 90%. Inconclusive.
Min Sample Size
2. Peeking
Time
Experiment StartsStat Significance < 90%.
Inconclusive.
Weak evidence: Stats Engine is conservative (+trust). Strong evidence: Stats Engine will call it sooner (+velocity).
Traditional stats best practice:
Wait until you reach your minimum sample size, and then stop your
test when you look at results
Say I run an experiment.
After I reach my minimum sample size, I stop the experiment and see
2 of my variations beating control and 1 variation losing to control
Winner
Winner
Loser
T-test guarantees < 5% false positives
How many of my winners/losers are false positives?
Winner
Winner
Loser
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
1 original page, 5 variations, 6 goals = 30 “A/B Tests”= 30 “A/B Tests”
Winner
Winner
Loser
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
30 A/B Tests x 5% = 1.5 False Positives!
Think of each goal as a signal.
A goal that is positively or negatively affected by a variation is a “strong signal.”
A goal that is not affected by a variation is a “weak signal.”
Stats Engine is more conservative when there are weak signals present
Even when there are also strong signals
So, adding a lot of “random” goals will slow down your experiment
Traditional stats best practice:
Divide your “false positive” (p-value) threshold by the number of goal-
variation pairs in your experiment.
1. Why we built Stats Engine
2. Best practices for testing with Stats Engine
1. - …to redesign Statistics to match the way our customers actually use it
2. - Start with a clear hypothesis- Use expected ROI to decide when to stop your experiments- Set a primary goal- Avoid “random” goals and variations for maximum test velocity
TL;DL
Learn more: optimizely.com/statistics
Q&A
Pete KoomenCo-founder & CTO, Optimizely
Testing with Stats Engine Why it’s harder to find winners with Optimizely
Pete KoomenCo-founder & CTO, Optimizely
#opticon2015