Upload
noah-weaver
View
225
Download
8
Embed Size (px)
Citation preview
1
Ch 4: Stratified Random Sampling (STS) DEFN: A stratified random sample
is obtained by separating the population units into non-overlapping groups, called strata, and then selecting a random sample from each stratum
2
Procedure Divide sampling frame into mutually
exclusive and exhaustive strata Assign each SU to one and only one stratum
Select a random sample from each stratum Select random sample from stratum 1 Select random sample from stratum 2 … Stratum H
h=1
h=2
. . . . . . h=HStratum #1
3
Ag example Divide 3078 counties into 4 strata
corresponding to regions of the countries Northeast (h = 1) North central (h = 2) South (h = 3) West (h = 4)
Select a SRS from each stratum In this example, stratum sample size is proportional to
stratum population size 300 is 9.75% of 3078 Each stratum sample size is 9.75% of stratum
population
4
Ag example – 2
Stratum(h)
Stratum size (Nh)
Sample size (nh)
1 (NE) 220 21
2 (NC) 1054 103
3 (S) 1382 135
4 (W) 422 41
Total 3078 300
5
Procedure – 2 Need to have a stratum value for each
SU in the frame Minimum set of variables in sampling frame:
SU id, stratum assignment
Stratum (h)
SU (j)
1 1
1 2
1 3
2 1
2 2
… …
6
Ag example – 3
Stratum (h)
SU (j)
1 1
1 2
1 3
… …
1 220
2 1
2 2
… …
4 421
4 422
7
Procedure – 3 Each stratum sample is selected
independently of others New set of random numbers for each stratum Basis for deriving properties of estimators
Design within a stratum For Ch 4, we will assume a SRS is selected
within each stratum Can use any probability design within a
stratum Sample designs do not need to be the same
across strata
8
Uses for STS To improve representativeness of
sample In SRS, can get ANY combination of n
elements in the sample In SYS, we severely restricted the set
to k possible samples Can get “bad” samples Less likely to get unbalanced samples if
frame is sorted using a variable correlated with Y
9
Uses for STS – 2 To improve representativeness of
sample - 2 In STS, we also exclude samples
Explicitly choose strata to restrict possible samples
Improve chance of getting representative samples if use strata to encourage spread across variation in population
10
Uses for STS – 3 To improve precision of estimates
for population parameters Achieved by creating strata so that
variation WITHIN stratum is small variation AMONG strata is large
Uses same principal as “blocking” in experimental design
Improve precision of estimate for population parameter by obtaining precise estimates within each stratum
11
Uses for STS – 4 To study specific subpopulations
Define strata to be subpopulations of interest Examples
Male v. female Racial/ethnic minorities Geographic regions Population density (rural v. urban) College classification
Can establish sample size within each stratum to achieve desired precision level for estimates of subpopulations
12
Uses for STS – 5 To assist in implementing operational
aspects of survey May wish to apply different sampling and data
collection procedures for different groups Agricultural surveys (sample designs)
Large farms in one stratum are selected using a list frame
Smaller farms belong to a second strata, and are selected using an area sample
Survey of employers (data collection methods) Large firms: use mail survey because information is
too voluminous to get over the phone Small firms: telephone survey
13
Estimation strategy Objective: estimate population total Obtain estimates for each stratum
Estimate stratum population total Use SRS estimator for stratum total
Estimate variance of estimator in each stratum Use SRS estimator for variance of estimated stratum
total Pool estimates across strata
Sum stratum total estimates and variance estimates across strata
Variance formula justified by independence of samples across strata
14
Ag example – 4
Stratum(h)
Stratum size (Nh)
Sample size (nh)
Sample mean ( )
Estimated stratum total ( )
1 (NE) 220 21 97,630 21,478,558
2 (NC) 1054 103 300,504 316,731,379
3 (S) 1382 135 211,315 292,037,391
4 (W) 422 41 662,295 279,488,706
Total 3078 300 Acres devoted to farms / co
Total farms acres for stratum
hy ht̂
15
Ag example – 5 Estimated total farm acres in US
US in acres farm 034,736,909
)295,662(422)315,211(1382)504,300(1054)630,97(220
ˆˆ11
H
hhh
H
hhstr yNtt
16
Ag example – 6
Stratum(h)
Stratum size (Nh)
Sample size (nh)
Sample variance ( )
1 (NE) 220 21 7,647,472, 708
2 (NC) 1054 103 29,618,183,543
3 (S) 1382 135 53,587,487,856
4 (W) 422 41 396,185,950,266
Total 3078 300
2hs
17
Ag example – 7 Estimated variance for estimated total
farm acres in US
acres 248,417,50)ˆ(ˆ)ˆ(
10 x 2.5419
(...)422(...)1382(...)105421
708 7,647,472,22021
1220
1)ˆ(ˆ)ˆ(ˆ
15
2222
2
1
2
1
strstr
h
hH
h h
hn
H
hhstr
tVtSE
ns
Nn
NtVtV
18
Ag example – 8 Compare with SRS estimates
acres 381,169,58)ˆ(ˆ)ˆ(
10 x 3.38368 )ˆ(ˆ
acres 100,927,916
15
strstr tVtSE
yNtV
yN
19
Estimation strategy - 2 Objective: estimate population mean Divide estimated total by population size
OR equivalently, Obtain estimates for each stratum
Estimate stratum mean with stratum sample mean Pool estimates across strata
Use weighted average of stratum sample means with weights proportional to stratum sizes Nh
Nt
y strstr
ˆ
20
Ag example – 9 Estimated mean farm acres / county
county / acres farm 034,736,909
295,6623078422
315,21130781382
504,30030781054
630,973078220
3078034,736,909ˆ
1
H
hh
hstr
strstr
yNN
y
orNt
y
21
Ag example – 10 Estimate variance of estimated mean
farm acres / county
H
hh
hstr
strstr
yVN
NyV
or
tVN
yV
12
2
2
)(ˆ)(ˆ
)ˆ(ˆ1)(ˆ
22
Index set for stratum h = 1, 2, …, H Uh = {1, 2, …, Nh } Nh = number of OUs in stratum h in the population
Partition sample of size n across strata nh = number of sample units from stratum h (fixed) Sh = index set for sample belonging to stratum h
NotationStratum H
h=1
h=2
. . . . . . h=H
Stratum 1
23
Notation – 2 Population sizes
Nh = number of OUs in stratum h in the population
N = N1 + N2 + … + NH Partition sample of size n across strata
nh = number of sample units from stratum h n = n1 + n2 + … + nH The stratum sample sizes are fixed
In domain estimation, they are random For now, we will assume that the sampling
unit (SU) is an observation unit (OU)
24
Notation – 3 Response variable
Yhj = characteristic of interest for OU j in stratum h
Population and stratum totals
total population
stratum in total population
1
1
H
hh
N
jhjh
tt
hyth
25
Notation – 4 Population and stratum means
mean population overall
stratum in mean population
1 1
1
N
y
Nt
y
hN
yy
H
h
N
j hj
U
h
N
j hj
hU
h
h
26
Notation – 5 Population stratum variance
h
N
yyS
hN
j h
hUhjh stratum in variancepopulation
11
2
2
27
Notation – 6 SRS estimators for stratum
parameters
1
ˆ
2
2
h
Sjhhj
h
hhSj
hjh
hh
h
Sjhj
h
n
yy
s
yNynN
t
n
y
y
h
h
h
28
STS estimators For population total
H
hhh
H
hhstr yNtt
11
ˆˆ
h
hH
h h
hn
H
hhstr n
sNn
NtVtV2
1
2
1
1)ˆ(ˆ)ˆ(ˆ
29
STS estimators – 2 For population mean
H
hh
hstrstr yV
N
NtV
NyV
12
2
2)(ˆ)ˆ(ˆ1
)(ˆ
H
hh
hstrstr y
NN
Nt
y1
ˆ
30
STS estimators – 3 For population proportion
31
Properties STS estimators are unbiased
Each estimate of stratum population mean or total is unbiased (from SRS)
pp
tt
yy
str
str
Ustr
ofestimator unbiased is ˆ
ofestimator unbiased is ˆ
ofestimator unbiased is
U
N
hhU
hH
hh
hH
hh
h yyNN
yENN
yNN
Eh
111
32
Properties – 2 Inclusion probability for SU j in
stratum h Definition in words:
Formula hj =
33
Properties – 3 In general, for any stratification scheme,
STS will provide a more precise estimate of the population parameters (mean, total, proportion) than SRS For example
Confidence intervals Same form (using z/2) Different CLT
)()( yVyV str
34
Sampling weights Note that
Sampling weight for SU j in stratum h
A sampling weight is a measure of the number of units in populations represented by SU j in stratum h
h
hhj n
Nw
H
h
N
jhjhj
H
h
N
jhj
h
hH
hhh
H
hhstr
hh
ywynN
yNtt1 11 111
ˆˆ
35
Example
Note: weights for each OU within a stratum are the same
Stratum (h)
Nh
nh
h
hhj n
Nw
h = 1 6 3 2
36
h = 2 2 2 1
22
h = 3 4 1 4
14
h = 4 5 3 67.1
35
17 9
36
Example – 2 Dataset from study
Stratum (h) Nh nh whj yhj
1 6 3 2 53
1 6 3 2 107
1 6 3 2 83
2 2 2 1 34
2 2 2 1 22
3 4 1 4 90
4 5 3 1.67 12
4 5 3 1.67 34
4 5 3 1.67 15
37
Sampling weights – 2 For STS estimators presented in Ch
4, sampling weight is the inverse inclusion probability
h
hhj N
n
hjh
hhj n
Nw
1
38
Defining strata Depends on purpose of stratification
Improved representativeness Improved precision Subpopulations estimates Implementing operational aspects
If possible, use factors related to variation in characteristic of interest, Y
Geography, political boundaries, population density Gender, ethnicity/race, ISU classification Size or type of business
Remember Stratum variable must be available for all OUs
39
Allocation strategies Want to sample n units from the population An allocation rule defines how n will be spread
across the H strata and thus defines values for nh
Overview for estimating population parameters
Stratum costs same
Stratum variances
same
Allocation rule
No No Optimal
Yes No Neyman
Yes Yes Proportional
Special cases of optimal allocation
40
Allocation strategies – 2 Focus is on estimating parameter
for entire population We’ll look at subpopulations later
Factors affecting allocation rule Number of OUs in stratum Data collection costs within strata Within-stratum variance
41
Proportional allocation Stratum sample size allocated in
proportion to population size within stratum
Allocation rulen
NN
n hh
Nn
Nn
h
h
42
Ag example – 11
Stratum h
Stratum Total Nh
Stratum Sample Size nh = n (Nh / N )
1 (NE) 220 21 .0975 (220) = 21.4
2 (NC) 1054 103 .0975 (1054) = 102.7
3 (S) 1382 135 .0975 (1382) = 134.7
4 (W) 422 41 .0975 (422) = 41.1
Total N = 3078 300 = n
43
Proportional allocation – 2 Proportional allocation rule implies
Sampling fraction for stratum h is constant across strata
Inclusion probability is constant for all SUs in population
Sampling weight for each unit is constant
Nn
Nn
h
h
Nn
Nn
h
hhj
nN
whj
hj 1
44
Proportional allocation – 3 STS with proportional allocation leads to a
self-weighting sample What is a self-weighting sample?
If whj has the same value for every OU in the sample, a sample is said to be self-weighting
Since each weight is the same, each sample unit represents the same number of units in the population
For self-weighting samples, estimator for population mean to sample mean
Estimator for variance does NOT necessarily reduce to SRS estimator for variance of
y
y
45
Proportional allocation – 4 Check to see that a STS with proportional
allocation generates a self-weighting sample Is the sample weight whj is same for each OU?
Is estimator for population mean equal to the sample mean ?
What happens to the variance of ?stry
y
stry
y
46
Stratum h
Stratum Total Nh
Stratum Sample Size nh
Sample Weight whj
1 (NE) 220 21 220/21 = 10.5
2 (NC) 1054 103 1054/103 = 10.2
3 (S) 1382 135 1382/135 = 10.2
4 (W) 422 41 422/41 = 10.3
Total N = 3078 n = 300
Ag example – 12
Even though we have used proportional allocation, rounding in setting sample sizes can lead to unequal (but approximately equal) weights
47
Neyman allocation Suppose within-stratum variances
vary across strata Stratum sample size allocated in
proportion to Population size within stratum Nh
Population standard deviation within stratum Sh
Allocation rule nSN
SNn
H
lll
hhh
1
2hS
48
S t r a t u m
h
N h
S h
N h S h
nSN
SNH
lll
hh
1
w h j
A 4 0 0 3 , 0 0 0 1 , 2 0 0 , 0 0 0 9 6 . 2 6 9 6 4 0 0 / 9 6 = 4 . 1 7
B 3 0 2 , 0 0 0 6 0 , 0 0 0 4 . 8 1 5 3 0 / 1 0 = 3 . 0 0
C 6 1 9 , 0 0 0 5 4 9 , 0 0 0 4 4 . 0 4 4 4 6 1 / 3 7 = 1 . 6 5
D 1 8 2 , 0 0 0 3 6 , 0 0 0 2 . 8 9 3 1 8 / 6 = 3 . 0 0
E 7 0 1 2 , 0 0 0 8 4 0 , 0 0 0 6 7 . 3 8 6 7 7 0 / 3 9 = 1 . 7 9
F 1 2 0 1 , 0 0 0 1 2 0 , 0 0 0 9 . 6 3 1 0 1 2 0 / 2 1 = 5 . 7 1
T o t a l N = 6 9 9 000,805,2
1
H
lll SN
n = 2 2 5
Caribou survey example
49
Optimal allocation Suppose data collection costs ch vary across strata Let C = total budget
c0 = fixed costs (office rental, field manager)
ch = cost per SU in stratum h (interviewer time,travel cost)
Express budget constraints as
and determine nh
H
hhhnccC
10
50
Optimal allocation – 2 Assume general case: stratum population
sizes, stratum variances, and stratum data collection costs vary across strata
Sample size is allocated to strata in proportion to
Stratum population size Nh
Stratum standard deviation Sh
Inverse square root of stratum data collection costs Allocation rule
ncSN
cSNn
H
llll
hhhh
1
/
/
hc
1
51
Obtain this formula by finding nh such that is minimized given cost constraints
The optimal stratum allocation will generate the smallest variance of for a given stratification and cost constraint
Sample size for stratum h (nh ) is larger in strata where one or more of the following conditions exist
Stratum size Nh is large Stratum variance is large Stratum per-unit data collection costs ch are small
2hS
)( stryV
Optimal allocation – 3
stry
52
Welfare example Objective
Estimate fraction of welfare participant households in NE Iowa that have access to a reliable vehicle for work
Sample design Frame = welfare participant list Stratum 1: Phone
N1 = 4500 households, p1 = 0.85, c1 = $100 Stratum 2: No phone
N2 = 500 households, p2 = 0.50, c2 = $300 Sample size n = 500
53
Welfare example – 2 Optimal allocation with phone
strataStratum
h
Nh ph (1-ph)
ch
hhh cSN /
H
llll
hhh
cSN
cSN
1
/
/
nh
whj
1: phone
2: no phone
Total N = 5000
H
llll cSN
1
/
n = 500
2hS
54
Optimal allocation – 4 Proportional and Neyman allocation are
special cases of optimal allocation Neyman allocation
Data collection costs per sample unit ch are approximately constant across strata
Telephone survey of US residents with regional strata
ch term cancels out of optimal allocation formula n
SN
SNn
H
lll
hhh
1
55
Optimal allocation – 5 Proportional allocation
Data collection costs per sample unit ch are approximately constant across strata
Within stratum variances are approximately constant across strata
Y = number of persons per household is relatively constant across regions
ch and Sh terms drop out of allocation formula
2hS
nNN
n hh
56
Subpopulation allocation Suppose main interest is in estimating
stratum parameters Subpopulation (stratum) mean, total,
proportion Define strata to be subpopulations
Estimate stratum population parameters:
Allocation rules derived from independent SRS within each stratum (subpopulation) Equal allocation for equal stratum costs,
variances Stratum variances change across strata
hUhUhU pty or or
57
Subpopulation allocation – 2 Equal allocation
Assume Desired precision levels for each subpopulation
(stratum) are constant across strata Stratum costs, stratum variances equal across strata Stratum FPCs near 1
Allocation rule is to divide n equally across the H strata (subpopulations)
If Nh vary much, equal allocation will lead to less precise estimates of parameters for full population
Hn
nh
58
Welfare example – 3 Suppose we wanted to estimate
proportion of welfare households that have access to a car for households in each of three subpopulations in NE Iowa Metropolitan county Counties adjacent to metropolitan
county Counties not adjacent to metro county
59
Welfare example – 4 Equal allocation with population
density strata
Stratum h Nh nh h whj
1: Metro
3,800
2: Adjacent to metro
700
3: Not adjacent to metro
500
Total
N = 5000 n = 500
60
Subpopulation allocation – 3 More complex settings: If Sh vary across strata, can
use SRS formulas for determining stratum sample sizes, e.g., for stratum mean
Result is
May get sample sizes (nh) that are too large or small relative to budget
Relax margin of error eh and/or confidence level 100(1-)% Recalibrate stratum sample sizes to get desired sample size
h
hh
hh
N
Sze
Szn
222/2
222/
H
hhnn
1
61
Welfare example – 5 95% CI, e = 0.10 for all pop density
strataStratum h Nh ph
Initial nh Recalibrate nh
1: Metro
3,800 0.70 0.21
2: Adjacent to metro
700 0.80 0.16
3: Not adjacent to metro
500 0.90 0.09
Total
N = 5000 n = 500
2hS
62
Compromise allocations
nh
Nh Nh
Nh
nh
nh
Proportional Allocation
Square Root Allocation
Equal Allocation
nh = nNh /N
nh = n /H
H
ll
hh
N
Nnn
1
63
Square root allocation More SUs to small strata
than proportional allocation
Fewer SUs to large strata than equal
Variance for subpopulation estimates is smaller than proportional
Variance for whole population estimates is smaller than equal allocation
nh
Square Root Allocation
H
ll
hh
N
Nnn
1
Nh
64
max nh
min nh
Compromise allocations – 2
May want to set Minimum number of
SUs in a stratum Cap on max number
of SUs in a stratum Rule
nh = min for Nh < A nh = max for Nh > B Apply rule in between
A and B Square root Proportional
max nh
min nh
nh
nh
A B Nh
A B Nh
65
Welfare example – 6 Comparing equal, proportional and
square root allocation
Stratum h
Nh
Equal allocation
Proportional allocation
Square root of Nh
Square root allocation
1: Metro
3,800 167
2: Adjacent to metro
700 167
3: Not adjacent to metro
500 166
Total
N = 5000 n = 500 n = 500 Sum = n = 500
66
Other allocations Certainty stratum is used to guarantee
inclusion in sample Census (sample all) the units in a stratum For certainty stratum h
Allocation: nh = Nh
Inclusion probability: hj = 1
Ad hoc allocations The sample allocation does not have to follow
any of the rules mentioned so far However, you should determine the stratum
allocation in relation to analysis objectives and operational constraints
67
Welfare example – 7 Ad hoc allocation
Stratum h
Nh
Equal allocation
Square root allocation
Proportional allocation
Actual allocation
1: Metro
3,800 167 279 380 200
2: Adjacent to metro
700 167 120 70 150
3: Not adjacent to metro
500 166 101 50 150
Total
N = 5000 n = 500 n = 500 n = 500 n = 500
68
Determining sample size n Determine allocation using rule expressed in
terms of relative sample size nh /n
Rewrite variance of as a function of relative sample sizes (ignoring stratum FPCs)
Sample size calculation based on margin of error e for population total
H
llll
hhhh
cSN
cSN
nn
1
/
/
H
hh
h
H
hh
hstr SN
nn
nSN
nn
ntV
hh1
22
1
22 where 1
)ˆ(
strt̂
2
22/
e
zn
69
Determining sample size n – 2 Rewrite variance of as a function of
relative sample sizes (ignoring stratum FPCs)
Samples size calculation based on margin of error e for population mean
H
hh
h
H
hh
hstr SN
nn
nNSN
nn
NnyV
hh1
222
1
222
where 11
)(
stry
22
22/
Ne
zn
70
Welfare example – 8 Relative sample size for equal allocation
Value of
For 95% CI with e = 0.1
Hnnh 1
900,399,9)]09(.500)16(.700)21(.3800[3 222
1
22
1
22
H
hh
H
hh
h
SHNSNnn
hh
150)000,000,25(01.
)900,399,9(422
22/
Ne
zn
71
STS Summary Choose stratification scheme
Scheme depends on objectives, operational constraints
Must know stratum identifier for each SU in the frame Set a design for each stratum
Design for each stratum – SRS, SYS, … Determine n and nh
Select sample independently within each stratum
Pool stratum estimates to get estimates of population parameters