45
A/B Tes(ng Data-Driven Algorithms in the Cloud cloudacademy.com 7/25/2016

A/B Testing Data-Driven Algorithms in the Cloud - Webinar

Embed Size (px)

Citation preview

A/B Tes(ng Data-DrivenAlgorithms in the Cloud

cloudacademy.com7/25/2016

About us

Roberto Turrin Luca BaroffioSr. Data Scien8st (PhD) Data Scien8st (PhD)

@robytur @lucabaroffio

Agenda

Data-driven algorithms

Evalua8on

A/B tes8ng

Challenges in A/B tes8ng data-driven algorithms

A/B tes8ng in the cloud

Data-driven A/B tes8ng

Conclusions

Q&A

Data-driven algorithms

Decision problems that can be modeled from data

Data-driven problems - I

Image recogni8on

Document classifica8on

Speech-to-text

Spam/fraud detec8on

Stock price predic8on

Content personaliza8on

Market basket

Search sugges8on

Playlist genera8on

Document clustering

User segmenta8on

Target Adver8sing

Data-driven problems - II

Image recogni8on

Document classifica8on

Speech-to-text

Spam/fraud detec8on

Stock price predic8on

Content personaliza8on

Market basket

Search sugges8on

Playlist genera8on

Document clustering

User segmenta8on

Target Adver8sing

classifica'on

regression clustering

rule extrac'on?

170cm

group A group B

A, B C

Supervised Unsupervised

Data-driven algorithm pipeline

Training Predic6on

batch real-8meFeature

extrac6on

batch

data set informa(on

features ML models

real-(me data

Evalua8on: offline vs online

Offline evalua8on - I

Training Predic6on

batch real-8meFeature

extrac6on

batch

data set

features ML models

real (medata

informa(on

Offline experiments are run on a snapshot of the collected data set.

Offline evalua8on - I

PROS CONS

Quick

Large number of solu8ons

No impact on business

Applicable in most scenarios

They use past data

Risk to promote imita8on

Not considering the impact of the algorithm on the user context

Not suitable for “unpredictable” data(e.g., stock price)

Online evalua8on

Training Predic6on

batch real-8meFeature

extrac6on

batch

data set

features ML models

real-(medata

informa(on

Online experiments use live user feedback

Online: human-subject experiments - I

Controlled experimentA B?

Human-subject experiments work in a controlled environment

Online: human-subject experiments - II

PROS CONS

Feedback of real usersaffected by actual context

Implement controlled environment(back-end+front-end)

Mul8ple KPIs can be measured Environment is simulated

Recrui8ng non-biased users

Not scaling: limited number of users

Few solu8ons can be tested

Mo8vate users

Medium running 8me

Online: live A/B tes8ng - I

A

B

Live tes8ng works in produc8on

Online: live A/B tes8ng - II

PROS CONS

Capture real, full impact of thedata-driven solu8on

Very few solu8ons can be tested

Long running 8me

Large traffic required

May affect business

Some KPIs are hard to measure

A/B tes8ng: under the hood

Sta8s8cal hypothesis tes8ng:

1. Formulate a hypothesis 2. Set up a tes)ng campaign 3. Make use of sta)s)cs to evaluate the

hypothesis

A/B tes8ng: real-world similari8es

Clinical trials

Product comparison

Quality assurance

Decision making

A/B tes8ng: UI examples

ADD TO CART ADD TO CART

Register Register (it’s FREE!)

Lorem ipsum dolor sit amet, ius an aperiri sapientem disputando, legimus mandamus reprimique mei ea. In aliquam euripidis ius. Ei

sea dico interesset. Sit et veri brute. Eu sed populo option apeirian, essent blandit ei pro. No quo integre delicatissimi. Eos ea nostro fabulas neglegentur, vel dolor splendide eu, vel ei illud blandit scripserit. Dolor detracto efficiendi ei vel. Ad per error nullam.

Nec id facer impetus deseruisse. Pri dicunt phaedrum te. Ad cum munere consectetuer, has odio referrentur in. Elit atqui prodesset quo eu. Eu mei ubique bonorum deseruisse. Habeo sonet disputando et

duo. Et vim homero vocibus, vel ut dicunt omnium.

Start free trial Lorem ipsum dolor sit amet, ius an aperiri sapientem disputando, legimus mandamus reprimique mei ea. In aliquam euripidis ius. Ei

sea dico interesset. Sit et veri brute. Eu sed populo option apeirian, essent blandit ei pro. No quo integre delicatissimi. Eos ea nostro fabulas neglegentur, vel dolor splendide eu, vel ei illud blandit scripserit. Dolor detracto efficiendi ei vel. Ad per error nullam.

Nec id facer impetus deseruisse. Pri dicunt phaedrum te. Ad cum munere consectetuer, has odio referrentur in. Elit atqui prodesset quo eu. Eu mei ubique bonorum deseruisse. Habeo sonet disputando et

duo. Et vim homero vocibus, vel ut dicunt omnium.

Start free trial

A B“Control” “Varia8on”

A/B tes8ng: ingredients

Hypothesis formula8on • Everything starts with an idea

Define metrics: • How to measure if something is “successful”?

Run a test, collect data and compute metrics

Compare the two alterna8ves

A/B tes8ng: 1) hypothesis formula8on

A red bu4on is clicked more o7en than a blue bu4on

Sta6s6cs lingo:

Null hypothesis:

There is no difference between the red and the blue buLons

GOAL: reject the null hypothesis

The null hypothesis is true: • we fail to reject the null hypothesis

A/B tes8ng: 2) define a metric

Choose a measure that reflects your goals

Examples:

Click Through Rate (CTR)

Open rate, click rate

Conversion rate (# subs/# visitors)

Customer sa8sfac8on

Returning rate

A/B tes8ng: 3) run a test

It may affect your business!

1. Create the two alterna)ves

2. Assign a subset of users to each alterna8ve

3. Collect data and compute the metrics

A/B tes8ng: 4) compare the two alterna8ves

ADD TO CART ADD TO CART

1 view, 0 click —> 0% CTR 1 view, 1 click —> 100% CTR

100% > 0%,the red bupon is beper, right?

Not so fast…

A B

A/B tes8ng: confidence

What is the variability of our measure? How confident are we in the outcome of the test?

Model our measure resor8ng to a sta)s)cal distribu)on, e.g., a Gaussian distribu8on

E.g., the average click through rate for the blue bupon is 20% ± 7%

Confidence interval

A/B tes8ng: confidence interval

A confidence interval is a range defined so that there is a given probability that the value of your measure falls within such range

The confidence interval depends on the confidence level

The higher the confidence level, the larger the confidence interval

E.g., the average click through rate for the blue bupon is 20% ± 7% at 90% confidence level

Confidence interval

A/B tes8ng: comparing distribu8ons

20%

p(CTR)

CTR40%

ADD TO CART ADD TO CART

A/B tes8ng: comparing distribu8ons

20%

p(CTR)

CTR40%

ADD TO CART ADD TO CART

20% ± 7%

90% confidence level

A/B tes8ng: comparing distribu8ons

20%

p(CTR)

CTR40%

ADD TO CART ADD TO CART

20% ± 10%

95% confidence level

A/B tes8ng: rejec8ng the null hypothesis

20%

p(CTR)

CTR40%

ADD TO CART ADD TO CART

20% ± 10%

95% confidence level

The avg CTR for the varia8on falls outside the CI —> Null hypothesis rejected!

A/B tes8ng: errors

Null hypothesis ACCEPTED

Null hypothesis REJECTED

Null hypothesis TRUE

True Nega)ve

The buLons are the same, we acknowledge

it

Type I error

The buLons are the same, we say the red

one is beLer

Null hypothesis FALSE

Type II error

The red buLon is beLer, we say they are

the same

True Posi)ve

The red buLon is beLer, we

acknowledge it

Null hypothesis:

There is no difference between the red and the blue buLons

A/B tes8ng: errors

Null hypothesis ACCEPTED

Null hypothesis REJECTED

Null hypothesis TRUE

True Nega)ve

The buLons are the same, we acknowledge

it

Type I error

The buLons are the same, we say the red

one is beLer

Null hypothesis FALSE

Type II error

The red buLon is beLer, we say they are

the same

True Posi)ve

The red buLon is beLer, we

acknowledge it

Null hypothesis:

There is no difference between the red and the blue buLons

A/B tes8ng: errors

Null hypothesis ACCEPTED

Null hypothesis REJECTED

Null hypothesis TRUE

True Nega)ve

The buLons are the same, we acknowledge

it

Type I error

The buLons are the same, we say the red

one is beLer

Null hypothesis FALSE

Type II error

The red buLon is beLer, we say they are

the same

True Posi)ve

The red buLon is beLer, we

acknowledge it

Null hypothesis:

There is no difference between the red and the blue buLons

A/B tes8ng: comparing distribu8ons

20%

p(CTR)

CTR40%

ADD TO CART ADD TO CART

20% ± 7%

90% confidence level

⍺: type-I error rate

A/B tes8ng: comparing distribu8ons

20%

p(CTR)

CTR40%

ADD TO CART ADD TO CART

20% ± 7%

90% confidence level

β: type-II error rate

A/B tes8ng: comparing distribu8ons

20%

p(CTR)

CTR40%

ADD TO CART ADD TO CART

20% ± 7%

90% confidence level

power = 1 - β

A/B tes8ng: 8ps and common mistakes

DO NOT run the two varia8ons under different condi)ons

DO NOT stop the test too early

Pay apen8on to external factors

DO NOT blind test without a hypothesis

DO NOT stop ater the first failures

Choose the right metric

Consider the impact on your business

Randomly split the popula8on

Keep the assignment consistent

Tom

A/B tes8ng data-driven algorithms - I

A

B

Training Predic6onFeature extrac6on

Training Predic6onFeature extrac6on

Mike

People like you

ChrisLena

People like you

Targeted Ad.

Recommended users

A

B

AB

A/B tes8ng data-driven algorithms - II

CTR not always is the right metric

Search engine: ideally no click at all

Tweet sugges8ons: what users click is not necessarily what they want

E-commerce recommenda8ons: users click to find products alterna8ve to the one proposed

Find long-term metrics Reten8on/churnReturning users

Time spentUpgrading users

A/B tes8ng data-driven algorithms - III

Mul8ple goals are addressed

RelevanceTransparence

DiversityNovelty

CoverageRobustness

Consider all the steps of the pipeline

Do not vary UI and data-driven algorithm simultaneously

A/B tes8ng in the cloud - I

Cloud compu8ng makes A/B tes8ng simpler: 1. Create mul8ple environments/modules with different features 2. Split traffic

• e.g., Google App Engine’s traffic splivng feature

Do the same with the serverless paradigm

A/B tes8ng in the cloud - II

If unsure, use a third-party service

A/B tes8ng as a service:

• AWS A/B tes8ng service • Google Analy8cs A/B tes8ng feature • Op8mizely, VWO

A/B tes8ng libraries:

• Sixpack, Planout, Clutch.io, Alephbet

Build your own

Data-driven algorithms to support A/B tes8ng: mul8-armed bandit - I

A

B

A

D

E

C

D

E

A/B tes6ng Mul6-armed bandit

CDB

A

F

D

E

C

B

A

F

D

E

C

B

A

F

DE

C

BA

F

G

Training Predic6onFeature extrac6on

BA

F

G

(me (me

F

Data-driven algorithms to support A/B tes8ng: mul8-armed bandit - II

PROS CONS

Increased average KPI Longer 8me to reach sta8s8calsignificanceHarder implementa8on

Harder maintain consistence

Main takeaways

Evaluate data-driven solu(ons both offline and online

Define the correct KPIs Prefer long-term metrics to short-term conversions

Do not forget A/B tes(ng is a sta(s(cal test,rely on some cloud services if you are not “confident”

Exploita(on/explora(on approaches can be an alterna(ve to A/B tes(ng

Conversion rate is not the only metric

Thank you for apending :)

cloudacademy.com

Q & A