Chris Stuccio - Data science - Conversion Hotel 2015

Preview:

Citation preview

BAYESIAN STATISTICS

Chris Stucchio @stucchio

https://chrisstucchio.com

| https://github.com/stucchio

09 Oct 2015

INGREDIENTS:

A Null HypothesisVersion A and B have the same conversion rate

An Alternative HypothesisVersion B’s conversion rate is 5% or more higher than A’s

A Test StatisticWhich we expect to be close to 0 if the null hypothesis is true and far from 0 if it is false. For example

T =CONVERSIONS A

VISITORS A

CONVERSIONS B

VISITORS B-

How Frequentist A/B Tests Work

• If N is at least a certain size, then the probability of T exceeding a certain cutoff is less than 0.05 (the significance cutoff) assuming the null hypothesis is true

• If N is at least a certain size, then the probability of T being smaller than a certain cutoff is less than 0.20 (the power cutoff) assuming the alternative hypothesis is true

TWO PIECES OF MATH

T =CONVERSIONS A

VISITORS A

CONVERSIONS B

VISITORS B-

How Frequentist A/B Tests Work

Suppose the control conversion rate is 5%, and we are seeking a 20% lift in an experiment.

EXAMPLE

• If we have at least 7,600 samples per variation, then

there is a 5% chance of a false positive assuming  both  variations  are  equal.    

• There is also a 20% chance of a false negative

assuming  B  has  at  least  a  6%  conversion  rate.  

How Frequentist A/B Tests Work

P -VALUEA probability of a false positive “at least as extreme” as the result

you just saw in a hypothetical A/A test.

SIGNIFICANCE LEVEL (= 100% - P-VALUE)

A probability of NOT seeing a false positive at least as extreme.

These numbers are highly  dependent  on  your  null  and  alternative  hypothesis, so you have to choose them carefully.

How Frequentist A/B Tests Work

(Many vendors, including VWO until recently, incorrectly referred to the significance level as “Chance to Beat Control”.)

You've run an A/B test. Your A/B testing software has given you a p-value of p=0.03 for a one-tailed test.

(Note that several or none of the statements may be correct.)

• You have disproved the null hypothesis (that is, there is no difference between the variations).

• The probability of the null hypothesis being true is 0.03.

• You have proved your experimental hypothesis (that the variation is better than the control).

• The probability of the variation being better than control is 97%.

• You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision is 3%.

• You have a reliable experimental finding in the sense that if the experiment were repeated a great number of times, you would obtain a significant result on 97% of occasions.

WHICH  OF  THE  FOLLOWING  IS  TRUE?

How Frequentist A/B Tests Work

ALL ARE FALSEBUT TRY TELLING THAT TO CUSTOMERS

Study shows 100% of psychology graduates and 80% of

professors get that question wrong.

Misinterpretations of Significance: A Problem

Students Share with Their Teachers?

How Frequentist A/B Tests Work

A PRACTICAL QUIZ

An A/B test is run, and it is observed that B has a higher mean than A with a p-value of 4%. What is the probability that B is really better than A?

‣ 96%  

‣ 95%

‣ 80%

How Frequentist A/B Tests Work

So how do we compute this probability?

A PRACTICAL QUIZ

An A/B test is run, and it is observed that B has a higher mean than A with a p-value of 4%. What is the probability that B is really better than A?

How Frequentist A/B Tests Work

‣ 96%  

‣ 95%

‣ 80%  

‣ Cannot  be  determined  from  the  information  given

OUR FIRST BAYESIAN

CALCULATION

Make unrealistic  assumptions to simplify  the  math. (This is a pedagogical exercise.)

ASSUME THERE ARE ONLY TWO POSSIBILITIES IN THE WORLD

• Null Hypothesis (Control and Variation are Equal)

• Alternate Hypothesis (Variation beats control by at least 20%)

WE  WILL  ASSUME  EACH  OF  THESE  OCCURS  WITH  FIXED  PROBABILITY

How Frequentist A/B Tests Work

CONSIDER  A  SPHERICAL  COW(Physics phrase describing calculations that illustrate the

point, but are ridiculously oversimplified.)

Our First Bayesian Calculation

Need to know base rate - fraction of A/B tests which actually

have true result.

Suppose base rate is 5% - i.e., 95%  of  ideas  suck.    

This means exactly 5%  of  tests have a variation which is 20%  better  than  control, and 95%  of  tests  have  a  variation  identical  to  control.

Our First Bayesian Calculation

TEST SAYS WIN TEST SAYS LOSE

REAL WINNER 40 (80% of 50) 10 50

REAL LOSER 47  (5% of 950) 903 950

87 913

Suppose base rate is 5% - i.e., 95% of ideas suck.

Consider 1000 A/B tests:

Our First Bayesian Calculation

PROBABILITY OF REAL WINNER: 40 / 87 = 46%

TEST  SAYS  WIN TEST SAYS LOSE

REAL WINNER 40 10 50

REAL LOSER 47 903 950

87 913

Suppose base rate is 5% - i.e., 95% of ideas suck.

Consider 1000 A/B tests:

Our First Bayesian Calculation

PROBABILITY OF REAL WINNER: 240 / 275 = 87%

TEST  SAYS  WIN TEST SAYS LOSE

REAL WINNER 240 60 300

REAL LOSER 35 665 700

275 725

Suppose base rate is 30% - i.e., 70% of ideas suck.

Consider 1000 A/B tests:

Our First Bayesian Calculation

Our First Bayesian Calculation

THE  PRIOROur opinion before we

have any data

- PAUL SAMUELSON

When events change,I change my mind. What  do  you  do?

Our First Bayesian Calculation

THE  POSTERIORWe’ve changed our opinion

after seeing the data

BAYESIAN STATISTICS

‣ Come up with a

subjective Prior opinion

‣ Gather evidence

‣ Change your opinion

and form a Posterior

BAYES RULE

The mathematically optimal way to change your opinion

IMPROVING  THE  ACCURACY  OF  OUR  MODEL

Unrealistic Assumptions

• Only possible conversion rates are 5% and 6% - why not

4.3% or 5.5%?

• Ignores cost/benefit. If B has a 3% conversion rate,

choosing it is very bad. If B has a 4.99% conversion rate,

choosing it is almost harmless.

• Results in previous test based on looking at results only

once, then making decision. Our  users  check  test  results  every  day.  

THERE ARE MORE THAN TWO POSSIBLE CONVERSION RATES

It’s not realistic to assume that conversion rates are either 5% or 6%.

This is just not a useful picture of reality:

THERE ARE MORE THAN TWO POSSIBLE CONVERSION RATES

Conversion rate can be 4%, 5%, 5.34%, 6.21%, or any other value

between 0 and 100%. Represent with continuous functions

CREDIBLE  INTERVALS99% probability that true conversion rate is at least

16.9% and not more than 23.3%.

THE  PRIORWe generally think conversion

rates are low.

ONE  VISITOR,  ONE  CLICKOur opinion updates, and higher conversion

rates are more likely

6  VISITORS,  1  CLICKWe update our opinion, that first click was

probably a fluke.

22  VISITORS,  1  CLICKWe update our opinion, that first click was

probably a fluke.

207  VISITORS,  4  CLICKSWe are confident the CR is

approximately 2%.

PRIORS  ARE  SUBJECTIVEBayesian Analysis starts by “pulling a prior out

of your posterior”.

POSTERIORS  CONVERGETheorem (stylized): Rational Bayesians never “agree

to disagree” when sufficient data is available.

JOINT POSTERIORS - REPRESENTING ALL VARIATIONS

So far we only form opinions about conversion

rate of one variation.

Need to represent probability of things like “conversion

rate of A is 4.5% and conversion rate of B is 6.3%”.

SOLUTION IS CALLED

JOINT POSTERIOR

TWO  POSTERIORS  ON  TWO  DIFFERENT  CONVERSION  RATES

COMBINE  TO  FORM  JOINT  POSTERIORPoint (0.10, 0.15) represents “A has a conversion rate of

10%, B has a conversion rate of 15%”.

Opinions About the World

• Start with an uneducated opinion, the  prior.  

• Gather data.

• Change your opinion and end up educated

with a posterior.  

MAKING DECISIONS

Maximize Revenue, don’t Test for Truth

• Designed by and for scientists.

• Question: “Do jellybeans cause acne?”

• Run A/B test, give B group jellybeans. Measure amount of acne in both groups.

• If p < 0.05, publish paper in good journal - “Jellybeans cause acne.”

• If p >= 0.05, publish paper in bad journal - “Jellybeans don’t cause acne, but we did a good experiment to check.”

Hypothesis Testing

Goal  of  hypothesis  testing  is  to  avoid  publishing  false  results.

Think Like a Trader

look for interesting phenomena, and publish papers when they find them.

• CRO is more like trading - the goal is to get more conversions = $.

• If A == B, thinking A > B is harmless; instead of getting a 5% conversion rate with B, you are stuck with a 5% conversion rate with A. Money lost: $0.

• If the CR of A is 4.9% and B is 5%, a wrong decision costs only 0.1%. If CR of A is 4%, a wrong decision costs 10x more!

buy and sell stocks with the goal of making money.

A  SCIENTISTS  

A  TRADER  

B > A (50% CHANCE) B = A (50% CHANCE)

DEPLOY A Lose Even

DEPLOY B Win Even

ASYMMETRIC  COSTS  AND  FALSE  POSITIVES

Smart decision: Deploy B.

Heads you win, tails you don’t lose.

Cost of a Mistake

Suppose we choose variation x. The cost of this choice is:

Loss[x] = Max (CR[i] - CR[x])

This is simple opportunity cost - it’s the difference

between the best choice and our choice.

Key point: bigger mistakes cost us more money.

Cost of a Mistake

EXAMPLE

A. 5%

‣ Loss[A] = Max(5% - 5%, 6% - 5%, 4.5% - 5%) = 1%

‣ Loss[B] = Max(5% - 6%, 6% - 6%, 4.5% - 6%) = 0%

‣ Loss[C] = Max(5% - 4.5%, 6% - 4.5%, 4.5% - 4.5%) = 1.5%

B. 6% C. 4.5%

Expected Loss

CR A = 4% CR A = 5% CR A = 6%

CR B = 4% 0% 0% 0%

CR B = 5% 1% 0% 0%

CR B = 6% 2% 1% 0%

BEFORE  HAVING  ANY  DATA:  Only problem - we don’t know true conversion rate. So we compute expected value.

EXPECTED LOSS FOR A IS = (1/9) 1% + (1/9) 2% + (1/9) 1% = 0.44%

(Probability of each cell is 1/9.)

Expected Loss

CR A = 4% CR A = 5% CR A = 6%

CR B = 4% 0% 1% 2%

CR B = 5% 0% 0% 1%

CR B = 6% 0% 0% 0%

BEFORE  HAVING  ANY  DATA:  Only problem - we don’t know true conversion rate. So we compute expected value.

EXPECTED LOSS FOR B IS = (1/9) 1% + (1/9) 2% + (1/9) 1% = 0.44%

No  decision,  loss  >  threshold  of  caring  =  0.01

Expected Loss

EXPECTED LOSS FOR A IS = (1/4) 1% + (1/4) 2% + (1/4) 1% = 1%

AFTER  GATHERING  DATA,  WE  RULE  OUT  SOME  POSSIBILITIES:  

(All black cells have probability ¼, grey cells have probability 0. WILD OVERSIMPLIFICATION.)

CR A = 4% CR A = 5% CR A = 6%

CR B = 4% 0% 0% 0%

CR B = 5% 1% 0% 0%

CR B = 6% 2% 1% 0%

Expected Loss

EXPECTED LOSS FOR B IS = 0% < 0.01%

AFTER GATHERING DATA, WE

RULE OUT SOME POSSIBILITIES:

CR A = 4% CR A = 5% CR A = 6%

CR B = 4% 0% 1% 2%

CR B = 5% 0% 0% 1%

CR B = 6% 0% 0% 0%

Smart  Decision

How to run a Bayesian A/B test

• Identify a threshold  of  caring - a value so small that if your conversion rate

drops by less than this, you don’t care.

• Example: I sell $10,000 of product/week on a 2% conversion rate. A 0.05% threshold of caring corresponds to a $250/week change in revenue.

• Run A/B test.

• Periodically (not more than once a week!) compute the expected loss for each variation. If the expected loss for some variation drops below the threshold of caring, deploy  that  variation.

NOT  NECESSARILY  A  WINNER,  BUT  IT  WON’T  LOSE.  

Advantages

• Bayesian tests are insensitive to peeking - it’s fine to stop a test early.

• “Chance to beat control” is really the chance that a variation is better than

the control

• Get additional numbers, e.g. chance  to  beat  all  - what is the probability that

B is better than A, C and D?

• Credible intervals bound uncertainty - when a winner is deployed, you’ll be

told “variation B is between 0.01% and 25% better than A”. (Confidence

intervals do NOT provide this information.)

• Easy to understand and extend. Is there a cost of switching? Want to account

for other factors? Just include it in the loss function. (Question asked by

Denis @ booking.com, and in Bayesian framework answer was obvious.)

MORE  ACCURATE  CALCULATIONS  Central Limit Theorem with

10,000 data points

MORE  ACCURATE  CALCULATIONS  Central Limit Theorem with

100 data points

WHY  THE  WORLD  DIDN’T  GO  BAYESIAN  SOONERBayesian calculations are 10 million times slower than frequentist - Charles Pickering

and his computers couldn’t handle it.

Thank You !

For any questions, you can talk to us at chris@wingify.com

@stucchio

Recommended