32
IMPLEMENTING AND ANALYZING ONLINE EXPERIMENTS SEAN J. TAYLOR 28 JUL 2015 MULTITHREADED DATA

Implementing and analyzing online experiments

Embed Size (px)

Citation preview

IMPLEMENTING AND ANALYZING ONLINE EXPERIMENTSSEAN J. TAYLOR 28 JUL 2015 MULTITHREADED DATA

WHO AM I?

• Core Data Science Team at Facebook

• PhD from NYU in Information Systems

• Four academic papers employing online field experiments

• Teach and consult on experimental design at Facebook

http://seanjtaylor.comhttp://github.com/seanjtaylorhttp://facebook.com/seanjtaylor@seanjtaylor

I ASSUME YOU KNOW

• Why causality matters

• A little bit of Python and R

• Basic statistics + linear regression

SIMPLEST POSSIBLE EXPERIMENT

user_id version spent

123 B $10

596 A $0

456 A $4

991 B $9

def  get_version(user_id):          if  user_id  %  2:                  return  'A'          else:                  return  'B'  

>  t.test(c(0,  4),  c(10,  9))  

  Welch  Two  Sample  t-­‐test  

data:    c(0,  4)  and  c(10,  9)  t  =  -­‐3.638,  df  =  1.1245,  p-­‐value  =  0.1487  alternative  hypothesis:  true  difference  in  means  is  not  equal  to  0  95  percent  confidence  interval:    -­‐27.74338    12.74338  sample  estimates:  mean  of  x  mean  of  y                2.0              9.5

FIN

COMMON PROBLEMS

• Type I errors from measuring too many effects

• Type II and M errors from lack of power

• Repeated use of the same population (“pollution”)

• Type I errors from violation of the i.i.d. assumption

• Composing many changes into one experiment

POWEROR

THE ONLY WAY TO TRULY FAIL AT AN EXPERIMENT

OR

THE SIZE OF YOUR CONFIDENCE INTERVALS

ERRORS

• Type I: Thinking your metric changed when it didn’t. We usually bound this at 1 or 5%.

• Type II: Thinking your metric didn’t change when it did. You can control this through better planning.

HOW TO MAKE TYPE I ERRORS

$ SpentTime Spent

Survey Satisfaction

oMale, <25

Female, <25

Male, >=25

Female, >=25

Measure a ton of metrics

Find

a su

bgro

up it

wor

ks o

n

AVOID TYPE II ERRORS WITH POWER

1. Use enough subjects in your experiment.

2. Test a reasonably strong treatment. Remember: you care about the difference.

POWER ANALYSIS

First step in designing an experiment is to determine how much data you’ll need to learn the answer to your question.

Process:

• set the smallest effect size you’d like to detect.

• simulate your experiment 200 times at various sample sizes

• count the number of simulated experiments where you correctly reject the null of effect=0.

TYPE M ERRORS

• Magnitude error: reporting an effect size which is too large

• happens when your experiment is underpowered AND you only report the significant results

IMPLEMENTATION

PLANOUT: KEY IDEAS

• an experiment is just a pseudo-random mapping from (user, context) → parameters, and is serializable.

• persistent randomizations implemented through hash functions, salts make experiments orthogonal

• always log exposures (parameters assignment) to improve precision, provide randomization check

• namespaces create ability to do sequential experiments on new blocks of users

https://facebook.github.io/planout/

A/B TESTING IN PLANOUTfrom  planout.ops.random  import  *  from  planout.experiment  import  SimpleExperiment  

class  ButtonCopyExperiment(SimpleExperiment):          def  assign(self,  params,  user_id):                  #  `params`  is  always  the  first  argument.                  params.button_text  =  UniformChoice(                          choices=["Buy  now!",  "Buy  later!"],                          unit=user_id                  )  

#  Later  in  your  production  code:  from  myexperiments  import  ButtonCopyExperiment  

e  =  ButtonCopyExperiment(user_id=212)  print(e.get('button_text'))  

#  Event  later:  e  =  ButtonCopyExperiment(user_id=212)  e.log_event('purchase',  {'amount':  9.43})  

PLANOUT LOGS → DATA

{"inputs":  {"user_id":  212},  "name":  "ButtonCopyExperiment",  "checksum":  "646e69a5",  "params":  {"button_text":  "Buy  later!"},  "time":  1437952369,  "salt":  "ButtonCopyExperiment",  "event":  “exposure"}  

{"inputs":  {"user_id":  212},  "name":  "ButtonCopyExperiment",  "checksum":  "646e69a5",  "params":  {"button_text":  "Buy  later!"},  "time":  1437952369,  "extra_data":  {"amount":  9.43},  "salt":  "ButtonCopyExperiment",  "event":  "purchase"}  

user_id button_text

123 Buy later!

596 Buy later!

456 Buy now!

991 Buy later!

user_id amount

123 $12

596 $9

Exposures

Purchases

ADVANCED DESIGN 1: FACTORIAL DESIGN

• Can use conditional logic as well as other random assignment operators: RandomInteger, RandomFloat, WeightedChoice, Sample.

class  FactorialExperiment(SimpleExperiment):          def  assign(self,  params,  user_id):                  params.button_text  =  UniformChoice(                          choices=["Buy  now!",  "Buy  later!"],                          unit=user_id                  )                  params.button_color  =  UniformChoice(                          choices=["blue",  "orange"],                          unit=user_id                  )  

ADVANCED DESIGN 2: INCREMENTAL CHANGES

##  We're  going  to  try  two  different  button  redesigns.  class  FirstExperiment(SimpleExperiment):          def  assign(self,  params,  user_id):                  #  ...  set  some  params  

class  SecondExperiment(SimpleExperiment):          def  assign(self,  params,  user_id):                  #  ...  set  some  params  differently                    class  ButtonNamespace(SimpleNamespace):          def  setup(self):                  self.name  =  'button_experiment_sequence'                  self.primary_unit  =  'user_id'                  self.num_segments  =  1000  

       def  setup_experiments():                  #  allocate  and  deallocate  experiments  here                  #  First  gets  100  out  of  1000  segments.                  self.add_experiment('first',  FirstExperiment,  100)                      self.add_experiment('second',  SecondExperiment,  100)  

ADVANCED DESIGN 3: WITHIN-SUBJECTS

Previous experiments persistently assigned same treatment to user, but unit of analysis can be more complex:

class  DiscountExperiment(SimpleExperiment):          def  assign(self,  params,  user_id,  item_id):                  params.discount  =  BernoulliTrial(p=0.1,  unit=[user_id,  item_id])                  if  params.discount:                          params.discount_amount  =  RandomInteger(                                  min=5,  max=15,  unit=user_id                          )                  else:                          params.discount_amount  =  0  

e  =  DiscountExperiment(user_id=212,  item_id=2)  print(e.get('discount_amount'))  

ANALYSIS

THE IDEAL DATA SET

Subject / User

Gender Age Button Size

Button Text

Spent Bounce

Erin F 22 Large Buy Now!

$20 0

Ashley F 29 Large Buy Later!

$4 0

Gary M 34 Small Buy Now!

$0 1

Leo M 18 Large Buy Now!

$0 1

Ed M 46 Small Buy Later!

$9 0

Sam M 25 Small Buy Now!

$5 0

Independent Observations

Randomly Assigned MetricsPre-experiment

Covariates{ { {

{

SIMPLEST CASE: OLS

>  summary(lm(spent  ~  button.size,  data  =  df))  

Call:  lm(formula  =  spent  ~  button.size,  data  =  df)  

Residuals:          1          2          3          4          5          6      10.0    -­‐0.5    -­‐4.5  -­‐10.0      4.5      0.5    

Coefficients:                                            Estimate  Std.  Error  t  value  Pr(>|t|)  (Intercept)                        10.000            5.489      1.822        0.143  factor(button.size)s      -­‐5.500            6.722    -­‐0.818        0.459  

Residual  standard  error:  7.762  on  4  degrees  of  freedom  Multiple  R-­‐squared:    0.1434,   Adjusted  R-­‐squared:    -­‐0.07079    F-­‐statistic:  0.6694  on  1  and  4  DF,    p-­‐value:  0.4592  

DATA REDUCTION

Subject Xi Di Yi

Evan M 0 1

Ashley F 0 1

Greg M 1 0

Leena F 1 0

Ema F 0 0

Seamus M 1 1

X D Y Cases

M 0 1 1

M 1 1 1

F 0 1 1

F 1 1 0

M 0 0 0

M 1 0 1

F 0 0 1

F 1 0 1

N # treatments X # groups X #outcomes

source('css_stats.R')  

reduced  <-­‐  df  %>%      mutate(rounded.spent  =  round(spent,  0))  %>%      group_by(button.size,  rounded.spent)  %>%      summarise(n  =  n())  

>  lm(rounded.spent  ~  button.size,  data  =  reduced,  weights  =  n)  %>%  +      coeftest(vcov  =  sandwich.lm)  

t  test  of  coefficients:  

                         Estimate  Std.  Error  t  value    Pr(>|t|)          (Intercept)      7.43137        0.45162  16.4548  7.522e-­‐14  ***  button.sizes  -­‐2.45178        0.59032  -­‐4.1533  0.0004149  ***  -­‐-­‐-­‐  Signif.  codes:    0  '***'  0.001  '**'  0.01  '*'  0.05  '.'  0.1  '  '  1  

DATA REDUCTION + WEIGHTED OLS

FACTORIAL DESIGNS

• Identify two types of effects: marginal and interactions. Need to fix one group as the baseline.

>  coeftest(lm(spent  ~  button.size  *  button.text,  data  =  df))  

t  test  of  coefficients:  

                                                   Estimate  Std.  Error  t  value    Pr(>|t|)          (Intercept)                                6.79643        0.62998  10.7884  <  2.2e-­‐16  ***  button.sizes                            -­‐2.43253        0.86673  -­‐2.8066    0.006064  **    button.textn                              2.11611        0.86673    2.4415    0.016458  *      button.sizes:button.textn  -­‐2.57660        1.27584  -­‐2.0195    0.046219  *      -­‐-­‐-­‐  Signif.  codes:    0  '***'  0.001  '**'  0.01  '*'  0.05  '.'  0.1  '  '  1  

USING COVARIATES TO GAIN PRECISION

• With simple random assignment, using covariates is not necessary.

• However, you can improve precision of ATE estimates if covariates explain a lot of variation in the potential outcomes.

• Can be added to a linear model and SEs should get smaller if they are helpful.

NON-IID DATA

• Repeated observations of the same user are not independent.

• Ditto if you ‘re experimenting on certain items only.

• If you ignore dependent data, the true confidence intervals are larger than you think.

Subject / User

Item Button Size

Spent

Erin Shirt Large $20

Erin Socks Large $4

Erin Pants Large $0

Leo Shirt Large $0

Ed Shirt Small $9

Ed Socks Small $5

THE BOOTSTRAP

R1

All Your Data

R2

R500

Generate random sub-samples

s1

s2

s500

Compute statistics or estimate model

parameters

}0.0

2.5

5.0

7.5

-2 -1 0 1 2Statistic

Count

Get a distribution over statistic of interest

(e.g. the treatment effect)

- take mean - CIs == 95% quantiles - SEs == standard deviation

USER AND USER-ITEM BOOTSTRAPS

source('css_stats.R')  library(broom)  ##  for  extracting  model  coefficients  

fitter  <-­‐  function(.data)  {          lm(summary  ~  opposed,  data  =  .data,  weights  =  .weights)  %>%          tidy  }  

iid.replicates        <-­‐  iid.bootstrap(df,  fitter,  .combine  =  bind_rows)  oneway.replicates  <-­‐  clustered.bootstrap(df,  c('user_id'),  fitter,  .combine  =  bind_rows)  twoway.replicates  <-­‐  clustered.bootstrap(df,  c('user_id',  'item_id'),  fitter,  .combine  =  bind_rows)  

>  head(iid.replicates)                    term      estimate    std.error  statistic            p.value  1    (Intercept)    0.4700000  0.04795154    9.801561  6.296919e-­‐18  2  button.sizes  -­‐0.2200000  0.08003333  -­‐2.748855  6.695621e-­‐03  3    (Intercept)    0.4250000  0.05307832    8.007036  5.768641e-­‐13  4  button.sizes  -­‐0.1750000  0.08456729  -­‐2.069358  4.049329e-­‐02  5    (Intercept)    0.4137931  0.05141050    8.048805  4.118301e-­‐13  6  button.sizes  -­‐0.1429598  0.08621804  -­‐1.658119  9.965016e-­‐02  

DEPENDENCE CHANGES CONFIDENCE INTERVALS

DATA REDUCTION WITH DEPENDENT DATA

Subject Di Yij

Evan 1 1

Evan 1 0

Ashley 0 1

Ashley 0 1

Ashley 0 1

Greg 1 0

Leena 1 0

Leena 1 1

Ema 0 0

Seamus 1 1

Create bootstrap replicates

R1

R2

R3

reduce the replicates as if they’re i.i.d.

r1

r2

r3

s1

s2

s3

compute statistics on reduced data

THANKS! HERE ARE SOME RESOURCES:

• Me: http://seanjtaylor.com

• These slides:http://www.slideshare.net/seanjtaylor/implementing-and-analyzing-online-experiments

• Full Featured Tutorial: http://eytan.github.io/www-15-tutorial/

• “Field Experiments” by Gerber and Green