Sample Size and Power Calculations - Economics.ozier.comeconomics.ozier.com/owen/slides/ozier_powercalc_talk_20100914a.pdf · MotivationProbability basicsPower calculationExercises

Motivation Probability basics Power calculation Exercises

Sample Size and Power Calculations

IPA/JPAL/CMF Training

Limuru, Kenya28 July 2010

Owen OzierDepartment of Economics

University of California at Berkeley

Slides revised 14 September 2010

Owen Ozier Sample Size and Power Calculations


Thanks and Introduction

Thanks to everyone from JPAL/IPA who made this happen!

My background: randomized evaluations in Busia, Kenya.



Thanks and Introduction

Thanks to everyone from JPAL/IPA who made this happen!

My background: randomized evaluations in Busia, Kenya.



Motivation

Program evaluation:bringing the scientific method to social science

First steps:

Propose a hypothesis

Design an experiment to test the hypothesis

This involves gathering data......but how much data will we need?



Motivation


First steps:






Motivation


First steps:






Motivation


First steps:






Motivation


First steps:



This involves gathering data...

...but how much data will we need?



Motivation


First steps:






Usually a lot



How this will work

“Numerical data should be kept for eternity; it’s great stuff.”- Glenn Stevens, Boston University



Outline

1 Motivation

2 Probability basicsCoin tossing

3 Power calculationTerminology/ConceptsThe Basic CalculationClustersCovariatesDetails

4 Exercises



Outline

1 Motivation



4 Exercises



Outline

1 Motivation



4 Exercises



Outline

1 Motivation



4 Exercises


Motivation Probability basics Power calculation Exercises Coin tossing

A hypothesis and a kind of test

“Null” Hypothesis: the coin is fair50% chance of heads, 50% chance of tails.

Structure of the data:Toss the coin a number of times, count heads.

The test:“Accept” hypothesis if within some distance of the mean under the null;“Reject” otherwise.

If we only had 4 tosses of the coin, what distance cutoffs could we use?Could accept (A) never,(B) when exactly the mean (2 heads),(C) when within 1 (1, 2, or 3 heads),or (D) always.

We don’t want to reject the null when it is true, though;How much accidental rejection would each possible cutoff give us?































If we only had 4 tosses of the coin, what distance cutoffs could we use?

Could accept (A) never,(B) when exactly the mean (2 heads),(C) when within 1 (1, 2, or 3 heads),or (D) always.




















Distribution of possible results

0.06

0.25

0.38

0.25

0.06

0.1

.2.3

.4P

roba

bilit

y

0 1 2 3 4P(2)=.38; P(1...3)=.88; P(0...4)=1

Distribution of numbers of heads in 4 tosses of a fair coin



Not enough data.

There is no way* to create such a test with four coin tossesso that the chance of accidental rejection under the “null” hypothesis(sometimes written H0) is less than 5%, a standard in social science.

* (Except the “never reject, no matter what” rule. Not very useful.)

What about 20 coin tosses?



Not enough data.






Not enough data.






Not enough data.






Distribution of possible results

0.000.000.000.000.000.01

0.04

0.07

0.12

0.16

0.18

0.16

0.12

0.07

0.04

0.010.000.000.000.000.00

0.0

5.1

.15

.2P

roba

bilit

y

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20P(10)=.18; P(9...11)=.5; P(8...12)=.74; P(7...13)=.88; P(6...14)=.96




The normal distribution

0.1

.2.3

.4P

roba

bilit

y de

nsity

−4 −2 0 2 4Deviations from mean

Distribution (limiting) of any well−behaved residual error



As sample size increases more:

0.0

2.0

4.0

6.0

8P

roba

bilit

y

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100



Motivation Probability basics Power calculation Exercises Terminology/Concepts The Basic Calculation Clusters Covariates Details

Types of error

Test result

“Reject Null,” “Accept Null,”Find an effect! Conclude no effect.

Truth:There is an effect Great! “Type II Error”

(low power)Truth:

There is NO effect “Type I Error” Great!(test size)

The probability of Type I error is what we just discussed: the “size” of the test.By convention, we are usually interested in tests of “size” 0.05.

The probability of Type II error is also very important;If P(failure to detect an effect) = 1 − κ, then the power of the test is κ.Power depends on anticipated effect size; typical desired power is 80% or higher.



Types of error

Test result



(low power)Truth:






Types of error

Test result



(low power)Truth:



The probability of Type II error is also very important;If P(failure to detect an effect) = 1 − κ, then the power of the test is κ.

Power depends on anticipated effect size; typical desired power is 80% or higher.



Types of error

Test result



(low power)Truth:






Rejecting H0 in critical region

Area: α/2Area: α/2

0P

roba

bilit

y de

nsity

0

Null distn. False rejection probability α

Significance level (test size) α

Note: two-sided test.Owen Ozier Sample Size and Power Calculations


Under an alternative:

0P

roba

bilit

y de

nsity

0

Null Under 1 SE effect

Suppose true effect were 1 SE (Standard Error):




0P

roba

bilit

y de

nsity

0


Power would only be approximately 0.17




0P

roba

bilit

y de

nsity

0 1 2 3


Suppose true effect were 3 SE’s (Standard Errors):

Note: this example follows slides by Marc Shotland (JPAL).Owen Ozier Sample Size and Power Calculations



0P

roba

bilit

y de

nsity

0 1 2 3


Power would be approximately 0.85

Note: this example follows slides by Marc Shotland (JPAL).Owen Ozier Sample Size and Power Calculations


Power calculation, visually

Effect

0P

roba

bilit

y de

nsity

0

Null distn. Effect distn. t_α/2 size t_1−κ power

How the power calculation formula works

Note: see the related figure in the Toolkit paper.Owen Ozier Sample Size and Power Calculations


The formula: for power κ and size α,

Effect > (t1−κ + tα/2)SE (β̂) Notation: t1−p = pth percentile of the t dist’n.

MDE = (t1−κ + tα/2)

√1

P(1 − P)

√σ2

N≈ (z1−κ + zα/2)

√1

P(1 − P)

√σ2

N

In practice (Stata): sampsiNote: Stata uses the normal rather than t distribution (avoiding the D.O.F. issue).

Where do these numbers come from, σ2 and the effect size? Two basic options:

Consider standardized effect sizes in terms of standard deviations

Draw on existing data: What is available that could inform your project?





MDE = (t1−κ + tα/2)

√1

P(1 − P)

√σ2

N

≈ (z1−κ + zα/2)

√1

P(1 − P)

√σ2

N









MDE = (t1−κ + tα/2)

√1

P(1 − P)

√σ2

N≈ (z1−κ + zα/2)

√1

P(1 − P)

√σ2

N









MDE = (t1−κ + tα/2)

√1

P(1 − P)

√σ2

N≈ (z1−κ + zα/2)

√1

P(1 − P)

√σ2

N









MDE = (t1−κ + tα/2)

√1

P(1 − P)

√σ2

N≈ (z1−κ + zα/2)

√1

P(1 − P)

√σ2

N









MDE = (t1−κ + tα/2)

√1

P(1 − P)

√σ2

N≈ (z1−κ + zα/2)

√1

P(1 − P)

√σ2

N







What if treatment is assigned by groups?

We have been thinking here of randomizing at the individual level.But in practice, we often randomize larger units.

Examples:

Entire schools are assigned to treatment or comparison groups,and we observe outcomes at the level of the individual pupil

Classes within a school are assigned to treatment or comparison groups,and we observe outcomes at the level of the individual pupil

Households are assigned to treatment or comparison groups,and we observe outcomes at the level of the individual family member

Sub-district locations are assigned to treatment or comparison groups,and we observe outcomes at the level of the individual road

Microfinance branch offices are assigned to treatment or comparison groups,and we observe outcomes at the level of the individual borrower

What does this do?It depends on how much variation is explained by the group each individual is in.




We have been thinking here of randomizing at the individual level.But in practice, we often randomize larger units.Examples:










We have been thinking here of randomizing at the individual level.But in practice, we often randomize larger units.Examples:









Intuitive example

You want to know how close the upcoming Kenyan referendum will be.

Method 1: Randomly select 50 people from entire Kenyan population

Method 2: Randomly select 5 families, and ask ten members of each extendedfamily their opinion



Intuitive example






Intuitive example






The formula

Scale the effective standard error by:

DesignEffect =√

1 + (ngroupsize − 1)ρ

ρ (“rho”) is the intra-class correlation.

In practice (Stata): loneway and sampclus

Recall earlier formula:

MDE = (t1−κ + tα/2)

√1

P(1 − P)

√σ2

N

But where does this ρ number come from? Two basic options:

Consider what might be reasonable assumptions

Draw on existing data (again):What is available that could inform your project?



The formula


DesignEffect =√


ρ (“rho”) is the intra-class correlation.In practice (Stata): loneway and sampclus


MDE = (t1−κ + tα/2)

√1

P(1 − P)

√σ2

N






The formula


DesignEffect =√




MDE = (t1−κ + tα/2)

√1

P(1 − P)

√σ2

N






The formula


DesignEffect =√




MDE = (t1−κ + tα/2)

√1

P(1 − P)

√σ2

N






The formula


DesignEffect =√




MDE = (t1−κ + tα/2)

√1

P(1 − P)

√σ2

N






The formula


DesignEffect =√




MDE = (t1−κ + tα/2)

√1

P(1 − P)

√σ2

N






Intra-class correlations we have known

Data source ICC (ρ)Madagascar Math + Language 0.5

Busia, Kenya Math + Language 0.22Udaipur, India Math + Language 0.23Mumbai, India Math + Language 0.29

Vadodara, India Math + Language 0.28Busia, Kenya Math 0.62

Source: Marc Shotland (JPAL) slides



Baseline and other covariate data

Controlling for covariates reduces the SE, so increases power.

Baseline survey data on the outcome of interest is especially useful when thewithin-individual correlation of outcomes (in the absence of treatment) ishigh. Consider academic test scores, for example.

However, it may risk contaminating an otherwise clean randomized design.(randomizing was supposed to avoid omitted variable biases in OLS!)

Stratifying is a good option:

On baseline outcome valuesOn fixed observables that have high predictive power(parents’ education, age, ...)On observables that delineate subpopulations you may want to test within(you might want to do another power calculation on this subsample)
































On baseline outcome values

On fixed observables that have high predictive power(parents’ education, age, ...)On observables that delineate subpopulations you may want to test within(you might want to do another power calculation on this subsample)








On baseline outcome valuesOn fixed observables that have high predictive power(parents’ education, age, ...)

On observables that delineate subpopulations you may want to test within(you might want to do another power calculation on this subsample)












0.0

1.0

2.0

3K

erne

l den

sity

est

imat

e

100 120 140 160 180Height (cm)

5.1 cm average per additional year in age; overall SD=12.1cm;

Kernel density estimate




0.0

2.0

4.0

6.0

8K

erne

l den

sity

est

imat

e

100 120 140 160 180Height (cm)

Age 8Age 9

Age 10Age 11

Age 12Age 13

Age 14

5.1 cm average per additional year in age; overall SD=12.1cm; within SD=7.21cm



Some things to watch out for

Imperfect compliance with treatment:

The most straightforward approach is to decrease the effect size.

Multiple treatments:

Optimal treatment structure depends on several things;May be better to have control group larger than any of the treatment groups.To get started, calculate power for the smallest effect you hope to test.Multiple test correction may be needed later. (recently: Michael Anderson)

How randomization is done:

Alternative test based on statistical work of Fisher in early 20th centuryIssues with 1,2,3,1,2,3... if alphabetical, not a problem if order is random(recently: McKenzie and Bruhn)

Much more: check out Duflo, Glennerster, and Kremer:Using randomization in development economics research: a toolkit

“A first comment is that, despite all the precision of these formulas,power calculations involve substantial guess work in practice.”





























Optimal treatment structure depends on several things;May be better to have control group larger than any of the treatment groups.

To get started, calculate power for the smallest effect you hope to test.Multiple test correction may be needed later. (recently: Michael Anderson)











Optimal treatment structure depends on several things;May be better to have control group larger than any of the treatment groups.To get started, calculate power for the smallest effect you hope to test.

Multiple test correction may be needed later. (recently: Michael Anderson)



































Alternative test based on statistical work of Fisher in early 20th century

Issues with 1,2,3,1,2,3... if alphabetical, not a problem if order is random(recently: McKenzie and Bruhn)











Alternative test based on statistical work of Fisher in early 20th centuryIssues with 1,2,3,1,2,3... if alphabetical, not a problem if order is random

(recently: McKenzie and Bruhn)






































Exercises

First make sure you have the files:

Make sure sampclus is installed; either by using findit sampclusor by copying the two files into C:/ado/plus/s/ (typically).

Various .do and .dta files

type-I-error-reject-null-when-true.do

type-II-error-fail-to-reject-when-alt-is-true.do

sampsi-syntax.do

icc-example.do

SampsiExerciseB.do -here we need to adjust the directory in the .do file before running.



Exercises






sampsi-syntax.do

icc-example.do




Exercises






sampsi-syntax.do

icc-example.do




Exercises






sampsi-syntax.do

icc-example.do




Exercises






sampsi-syntax.do

icc-example.do




Exercises






sampsi-syntax.do

icc-example.do



Documents

Sample Size and Power Calculations - Economics.ozier.comeconomics.ozier.com/owen/slides/ozier_powercalc_talk_20100914a.pdf · MotivationProbability basicsPower calculationExercises