Stata code for Sampling - World Banksiteresources.worldbank.org/INTPOVRES/Resources/477227... · Stata code for Sampling . ... Estimated sample size for two samples with repeated

Stata code for Sampling

Introduction

This is a basic introduction to the code that can be used for doing design effect and sample size calculations in Stata.

There are other options – many samplers use

SAS, Excel, or Optimal Design to do their calculations.

2

Design Effects

• In order to calculate design effects for a

particular dataset, you first need to define the

complex design for Stata.

• This example uses the common two-stage cluster

sample, but other more complicated designs are

also supported.

svyset clusterid [w= hh_weight_trimmed], strata(strataid)

3

To simply calculate the design effects for the overall sample, use the following commands:

svy: mean hhsize

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 3265

Number of PSUs = 409 Population size = 7245851

Design df = 393

--------------------------------------------------------------

| Linearized

| Mean Std. Err. [95% Conf. Interval]

-------------+------------------------------------------------

hhsize | 5.166811 .0668891 5.035306 5.298316

--------------------------------------------------------------

estat effects

----------------------------------------------------------

| Linearized

| Mean Std. Err. DEFF DEFT

-------------+--------------------------------------------

hhsize | 5.166811 .0668891 1.76149 1.32721

----------------------------------------------------------

4

Over subpopulations:

svy: mean hhsize, over (rural) (running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 3265

Number of PSUs = 409 Population size = 7245851

Design df = 393

1: rural = 1

2: rural = 2

--------------------------------------------------------------

| Linearized

Over | Mean Std. Err. [95% Conf. Interval]

-------------+------------------------------------------------

hhsize |

1 | 5.436019 .0809182 5.276932 5.595105

2 | 4.41353 .10837 4.200473 4.626587

--------------------------------------------------------------

estat effects, srssubpop

1: rural = 1

2: rural = 2

----------------------------------------------------------

| Linearized

Over | Mean Std. Err. DEFF DEFT

-------------+--------------------------------------------

hhsize |

1 | 5.436019 .0809182 1.58701 1.25977

2 | 4.41353 .10837 2.04029 1.42839

----------------------------------------------------------

5

Design Effects

Then to calculate ρ, use the following formula:

where m is the cluster size. You will know the cluster size either from the survey documentation or it can be calculated from the data:

gen n=1

collapse (sum) n, by (clusterid)

sum n

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

n | 409 7.982885 .1298585 7 8

6

)1(1 mdeff

Sample Size

Maize yield (kg/hect) • Mean: 802.6 kg/hect

• St. Dev. : 1027.79

• Deff: 1.28 (ρ=0.03)

Irrigation Usage

• Mean: 0.15

• Deff: 5.37 (ρ=0.49)

7

Fertilizer Usage

• Mean: 0.553

• Deff: 3.63 (ρ=0.29)

T values for two-tailed test

50% 80% 90% 95% 98% 99%

0.67449 1.281552 1.644854 1.95996 2.32635 2.57583

Sample Size

• What is the required sample size to detect a

10% change in maize yields?

• Note this is the same question as if you

wanted to see a 10% difference between

related questions in the same dataset (though

the variance would probably be lower).

8

)1(1)(4

2

2

12/1

2

m

D

zzn

sampsi 802.6 882.6, sd1(1027.79)

Estimated sample size for two-sample comparison of

means

Test Ho: m1 = m2, where m1 is the mean in population

1 and m2 is the mean in population 2

Assumptions:

alpha = 0.0500 (two-sided)

power = 0.9000

m1 = 802.6

m2 = 882.6

sd1 = 1027.79

sd2 = 1027.79

n2/n1 = 1.00

Estimated required sample sizes:

n1 = 3469

n2 = 3469

sampsi 802.6 882.6, sd1(1027.79) onesided


means

Test Ho: m1 = m2, where m1 is the mean in population 1

and m2 is the mean in population 2

Assumptions:

alpha = 0.0500 (one-sided)

power = 0.9000

m1 = 802.6

m2 = 882.6

sd1 = 1027.79

sd2 = 1027.79

n2/n1 = 1.00


n1 = 2828

n2 = 2828

sampsi 802.6 882.6, sd1(1027.79) power (0.8) onesided


means



Assumptions:


power = 0.8000

m1 = 802.6

m2 = 882.6

sd1 = 1027.79

sd2 = 1027.79

n2/n1 = 1.00


n1 = 2041

n2 = 2041

sampsi 802.6 882.6, ratio(2) sd1(1027.79)


means



Assumptions:

alpha = 0.0500 (two-sided)

power = 0.9000

m1 = 802.6

m2 = 882.6

sd1 = 1027.79

sd2 = 1027.79

n2/n1 = 2.00


n1 = 2602

n2 = 5204



means



Assumptions:


power = 0.9000

m1 = 802.6

m2 = 1043.38

sd1 = 1027.79

sd2 = 1027.79

n2/n1 = 1.00


n1 = 313

n2 = 313


method(change) pre(1) post(1) r01(0.7)

Estimated sample size for two samples with repeated measures

Assumptions:


power = 0.9000

m1 = 802.6

m2 = 882.6

sd1 = 1027.79

sd2 = 1027.79

n2/n1 = 1.00

number of follow-up measurements = 1

number of baseline measurements = 1

correlation between baseline & follow-up = 0.700

Method: CHANGE

relative efficiency = 1.667

adjustment to sd = 0.775

adjusted sd1 = 796.123

Estimated required sample

sizes:

n1 = 1697

n2 = 1697

sampsi 802.6 907.6, sd1(1027.79) sd2(1456.85) onesided

n1(1500) n2(1246)

Estimated power for two-sample comparison of means



Assumptions:


m1 = 802.6

m2 = 907.6

sd1 = 1027.79

sd2 = 1456.85

sample size n1 = 1500

n2 = 1246

n2/n1 = 0.83

Estimated power:

power = 0.6897

sampsi 802.6 907.6, sd1(1027.79) sd2(1456.85) onesided

method(change) pre(1) post(1) r01(.7) n1(1500) n2(1246)

Estimated power for two samples with repeated measures

Assumptions: alpha = 0.0500 (one-sided)

m1 = 802.6

m2 = 907.6

sd1 = 1027.79

sd2 = 1456.85

sample size n1 = 1500

n2 = 1246

n2/n1 = 0.83

number of follow-up measurements = 1

number of baseline measurements = 1

correlation between baseline & follow-up = 0.700

Method: CHANGE

relative efficiency = 1.667

adjustment to sd = 0.775



Estimated power:

power = 0.868

Can’t forget the Deff…

No design effects With design effects

2828 3620

2041 2612

313 401

849 1087

…where deff = 1.28


n1 = 313

n2 = 313

sampclus, obsclus(10) rho(0.03)

Sample Size Adjusted for Cluster Design

n1 (uncorrected) = 313

n2 (uncorrected) = 313

Intraclass correlation = .03

Average obs. per cluster = 10

Minimum number of clusters = 80

Estimated sample size per group:

n1 (corrected) = 398

n2 (corrected) = 398

Documents

Stata code for Sampling - World Banksiteresources.worldbank.org/INTPOVRES/Resources/477227... · Stata code for Sampling . ... Estimated sample size for two samples with repeated