Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
A framework for Multi-A(rmed)/B(andit) testing with online FDR control
Fanny Yang, Aaditya Ramdas, Kevin Jamieson, Martin J. WainwrightUC Berkeley
Spotlight,NIPS Conference, December 2017
Traditional A/B Testing
A B
vs.
control alternative
Traditional A/B Testing
A B
50% 75%
vs.
control alternative
Traditional A/B Testing
A B
H0: A at least as good as B
50% 75%
vs.
control alternative
Hypothesis test
Traditional A/B Testing
A B
H0: A at least as good as B
50% 75%
vs.
control alternativeaccept
AKeep using
Hypothesis test
Traditional A/B Testing
A B
H0: A at least as good as B
50% 75%
vs.
control alternative
B
reject
Switch to
accept
AKeep using
Hypothesis test
In reality: many alternatives, many tests
vs.
…
Control (default) Alternatives
In reality: many alternatives, many tests
vs.
…January
Phone App Layout
Control (default) Alternatives
In reality: many alternatives, many tests
vs.
Sequen
ce o
f te
sts …
…AprilWebsiteLayout
JanuaryPhone App
Layout
Control (default) Alternatives
In reality: many alternatives, many tests
vs.
vs.
Sequen
ce o
f te
sts …
…
…AprilWebsiteLayout
JanuaryPhone App
Layout
AugustTeaser picture
Control (default) Alternatives
In reality: many alternatives, many tests
vs.
vs.
vs.
Sequen
ce o
f te
sts …
…
…AprilWebsiteLayout
JanuaryPhone App
Layout
AugustTeaser picture
Control (default) Alternatives
In reality: many alternatives, many tests
vs.
vs.
vs.
Goal I (A/B testing)
AprilWebsiteLayout
AugustTeaser picture
MayTV ad
JanuaryPhone App
Layout
Dec.NIPS booth
JuneEmail ads
Goal I (A/B testing)Null hypothesis truecontrol is indeed better
Null hypothesis wrongat least one alternative better
AprilWebsiteLayout
AugustTeaser picture
MayTV ad
JanuaryPhone App
Layout
Dec.NIPS booth
JuneEmail ads
Goal I (A/B testing)Null hypothesis truecontrol is indeed better
Null hypothesis wrongat least one alternative better
AprilWebsiteLayout
AugustTeaser picture
MayTV ad
JanuaryPhone App
Layout
Dec.NIPS booth
JuneEmail ads
Accepted
Rejected
discoveries
Goal I (A/B testing)Null hypothesis truecontrol is indeed better
Null hypothesis wrongat least one alternative better
AprilWebsiteLayout
AugustTeaser picture
MayTV ad
JanuaryPhone App
Layout
Dec.NIPS booth
JuneEmail ads
Accepted
Rejected
discoveriesfalse discoveries
Goal I (A/B testing)Null hypothesis truecontrol is indeed better
Null hypothesis wrongat least one alternative better
AprilWebsiteLayout
AugustTeaser picture
MayTV ad
JanuaryPhone App
Layout
Dec.NIPS booth
JuneEmail ads
Accepted
Rejected
discoveriesfalse discoveries
Goal I (A/B testing)
Control the expected ratio #false discoveries
#discoveries(FDR)
Null hypothesis truecontrol is indeed better
Null hypothesis wrongat least one alternative better
AprilWebsiteLayout
AugustTeaser picture
MayTV ad
JanuaryPhone App
Layout
Dec.NIPS booth
JuneEmail ads
Accepted
Rejected
discoveries
Goal II (power and best alternative)
Null hypothesis truecontrol is indeed better
Null hypothesis wrongat least one alternative better
AprilWebsiteLayout
AugustTeaser picture
MayTV ad
JanuaryPhone App
Layout
Dec.NIPS booth
JuneEmail ads
Accepted:
Rejected:
discoveries true discoveries
Goal II (power and best alternative)
Null hypothesis truecontrol is indeed better
Null hypothesis wrongat least one alternative better
AprilWebsiteLayout
AugustTeaser picture
MayTV ad
JanuaryPhone App
Layout
Dec.NIPS booth
JuneEmail ads
Accepted:
Rejected:
discoveries true discoveries
Goal II (power and best alternative)
Null hypothesis truecontrol is indeed better
Null hypothesis wrongat least one alternative better
AprilWebsiteLayout
AugustTeaser picture
MayTV ad
JanuaryPhone App
Layout
Dec.NIPS booth
JuneEmail ads
Accepted:
Rejected:
Best alternative: Alternative 3 Alternative 4 Alternative 2
discoveries true discoveries
Goal II (power and best alternative)
Null hypothesis truecontrol is indeed better
Null hypothesis wrongat least one alternative better
AprilWebsiteLayout
AugustTeaser picture
MayTV ad
JanuaryPhone App
Layout
Dec.NIPS booth
JuneEmail ads
Accepted:
Rejected:
Best alternative: Alternative 3 Alternative 4 Alternative 2
Maximize # true discoveries,
find best alternative for each discovery
Our framework: MAB-FDR
MAB-FDR meta algorithm
Online FDR procedure
desired FDR level 𝛼
Our framework: MAB-FDR
MAB-FDR meta algorithm
Test j
Online FDR procedure
…
desired FDR level 𝛼
Our framework: MAB-FDR
MAB-FDR meta algorithm
𝛼𝑗
Test j
Online FDR procedure
…
desired FDR level 𝛼
Our framework: MAB-FDR
MAB-FDR meta algorithm
𝛼𝑗
Test j
Test𝑝𝑗 < 𝛼𝑗
𝑝𝑗
Online FDR procedure
…
desired FDR level 𝛼
Best-armMAB
Our framework: MAB-FDR
MAB-FDR meta algorithm
𝛼𝑗 Reject/accept
Test j
Test𝑝𝑗 < 𝛼𝑗
𝑝𝑗
Online FDR procedure
…
desired FDR level 𝛼
Best alternative
Best-armMAB
Our framework: MAB-FDR
MAB-FDR meta algorithm
𝛼𝑗 Reject/accept
Test j
Test𝑝𝑗 < 𝛼𝑗
𝑝𝑗
𝛼j+1 Reject/accept
Test j+1
Best-armMAB
Test 𝑝j+1 < 𝛼j+1
𝑝j+1
Online FDR procedure
……
desired FDR level 𝛼
Best alternativeBest alternative
Best-armMAB
Our framework…
1. Uses online FDR procedures to control FDR at any test
2. Uses best-arm MAB algorithm for testing each hypothesis,
and finding the best alternative
while sampling only as much as needed
AadityaRamdas
KevinJamieson
MartinWainwright
”A framework for Multi-A(rmed)/B(andit) testing with online FDR control”
FannyYang
Come and learn more at
Poster #2