STRATIFIED INVERSE SAMPLING Prayad Sangngam A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy (Statistics) School of Applied Statistics National Institute of Development Administration 2010
Microsoft Word - 0 Title.docFulfillment of the Requirements for the
Degree of
Doctor of Philosophy (Statistics)
School of Applied Statistics
2010
ABSTRACT
Author Mr. Prayad Sangngam
Year 2010
This dissertation is concerned with stratified inverse sampling and
four
different sampling schemes are considered, namely inverse random
sampling with
replacement, inverse random sampling without replacement, inverse
PPS sampling
with replacement and inverse PPS sampling without replacement.
Unbiased
estimators of the mean of a study variable in the whole population,
the number of
units in a class of interest and the prevalence of a characteristic
are given together
with their unbiased variance estimators. Estimation of the mean per
unit in the class of
rare units is also presented and the bound of its bias
derived.
A simulation study was employed to study the properties of these
sampling
designs and the results of this indicate that inverse sampling
without replacement is
more efficient than inverse random sampling with replacement.
Inverse PPS sampling
with replacement gave higher efficiencies to estimates than inverse
random sampling
with replacement when the correlation coefficient between the
auxiliary and study
variables is large. In addition, inverse PPS sampling without
replacement is more
efficient than inverse random sampling without replacement when the
correlation
coefficient between the auxiliary and study variables is high. When
the number of
sampled units in a class of interest increases, the variance and
mean squared error of
the estimator decrease.
ACKNOWLEDGEMENTS
I would like to express my thanks and appreciation to everyone who
has given
me help in completing this dissertation. In particular, I am
indebted to my advisor,
Professor Dr. Prachoom Suwattee, for always giving invaluable
suggestions and
encouragement. With his guidance and support, I have completed the
dissertation.
Thank you very much for making it all possible.
I also gratefully acknowledge the committee members, Associate
Professor
Dr. Samruam Chongcharoen, Associate Professor Dr. Jirawan
Jitthavech and
Associate Professor Dr. Montip Tiensuwan, for their constructive
comments and
helpful advice. I would also like to thank Associate Professor Dr.
Vichit
Lorchirachoonkul and Associate Professor Dr. Pachitjanut Siripanich
for their
suggestions.
I sincerely offer appreciation to Silpakorn University and the
National Institute
of Development Administration for their financial support during
the study.
I am grateful to my friends and my colleagues for their
cheerfulness,
friendship and support during the study. I am very grateful to Dr.
John McMorris for
his kindness in editing my English, which has also made the
manuscript more
readable.
Finally, I would like to acknowledge my mother, my sisters and my
wife for
their help, support and cheerfulness. This dissertation is
dedicated to my beloved
father who passed away recently.
Prayad Sangngam January 2011
CHAPTER 2 LITERATURE REVIEW 2.1 Introduction
2.2 Review of Sampling Designs for Rare Populations
2.2.1 Two-Stage Cluster Sampling
2.3 Estimation of Parameters in Inverse Sampling
2.3.1 Estimation under Inverse Simple Random Sampling
2.3.2 Estimation under Inverse PPS Sampling
Page
iii
iv
v
viii
x
1
1
3
4
6
6
7
8
9
9
11
12
14
15
17
18
19
20
20
23
vi
3.3.1 Stratified Inverse Random Sampling
3.3.2 Stratified Inverse PPS Sampling
3.4 Simulation Study
4.1 Introduction
4.2.1 The Sampling Scheme
4.3.1 The Sampling Scheme
5.1 Introduction
5.2.1 The Sampling Scheme
5.3.1 The Sampling Scheme
6.1 Simulation Study
6.2.1 Comparison of Estimates in Stratified Inverse
Random Sampling
Sampling
Random and PPS Sampling
CHAPTER 7 SUMMARY AND FUTURE RESEARCH
7.1 Summary and Conclusions
BIBLIOGRAPHY
APPENDICES
ESTIMATORS
BIOGRAPHY
99
101
104
106
106
107
108
110
117
118
123
129
Tables Page
4.1 The Values from the Units in the Stratified Inverse
Random
Sample with Replacement 48
4.2 The Values from the Units in the Stratified Inverse
Random
Sample without Replacement 60
5.1 The Values of a Study Variable and Initial Probabilities
from
a Stratified Inverse PPS Sample with Replacement 76
5.2 The Values of a Study Variable from a Stratified Inverse
PPS
Sample without Replacement 90
6.2 Averages of the Estimates from 10,000 Samples based on
four Stratified Inverse Sampling Designs under the Population
with 0.5ρ = 95
6.3 Estimates of Variances and Mean Square Errors of
Estimators
from four Stratified Inverse Sampling Designs under the
Population with 0.9ρ = 96
6.4 Comparison of the Relative Efficiencies of the Estimates
under
Stratified Inverse Random Sampling with and without
Replacement 98
6.5 Comparison of the Relative Efficiencies of the Estimates
under
Stratified Inverse PPS Sampling with and without Replacement
100
6.6 Comparison of the Relative Efficiencies of the Estimates
under
Stratified Inverse Random Sampling and Stratified Inverse PPS
Sampling with Replacement 102
6.7 Comparison of the Relative Efficiencies of the Estimates
under
Stratified Inverse Random Sampling without Replacement and
Stratified Inverse PPS Sampling without Replacement 103
6.8 Estimates of Variances and Mean Squared Errors of the
Estimators
under the Life Population Study 104
LIST OF FIGURES
3.1 The Study Population
B.1 Frequency Distribution of the Estimator sty under
Stratified
Inverse Random Sampling with Replacement
B.2 Frequency Distribution of the Estimator stP under
Stratified
Inverse Random Sampling with Replacement
B.3 Frequency Distribution of the Estimator Csty under
Stratified
Inverse Random Sampling with Replacement
B.4 Frequency Distribution of the Estimator sty under
Stratified
Inverse Random Sampling without Replacement
B.5 Frequency Distribution of the Estimator stP under
Stratified
Inverse Random Sampling without Replacement
B.6 Frequency Distribution of the Estimator Csty under
Stratified
Inverse Random Sampling without Replacement
B.7 Frequency Distribution of the Estimator sty under
Stratified
Inverse PPS Sampling with Replacement
B.8 Frequency Distribution of the Estimator stP under
Stratified
Inverse PPS Sampling with Replacement
B.9 Frequency Distribution of the Estimator Csty under
Stratified
Inverse PPS Sampling with Replacement
Page
12
13
26
29
123
123
124
124
125
125
126
126
127
xi
Inverse PPS Sampling without Replacement
B.11 Frequency Distribution of the Estimator stP under
Stratified
Inverse PPS Sampling without Replacement
B.12 Frequency Distribution of the Estimator Csty under
Stratified
Inverse PPS Sampling without Replacement
127
128
128
1.1 Background
In some situations, the purpose of a sample survey is to estimate
some
parameters that involve just a portion of the whole population. The
problem becomes
more interesting when the portion is small. Kalton and Anderson
(1986: 65) have
mentioned that the small portion might be as small as one tenth,
one hundredth, one
thousandth, or even less of the whole population. In this
situation, the population
might be subdivided into two classes, one of which is the
subpopulation containing
only a small number of units of interest and the other containing
the remaining units.
A population so described is called a rare population. A rare unit
is sometimes defined
as a unit that belongs to the small subpopulation. Usually a
sampling frame for
subpopulations does not exist, whereas a list of units in the whole
population might be
available.
The development of a proper sampling design for a rare population
is one of
the most challenging tasks confronting the statistician. Examples
of rare populations
are the incidence of the AIDs virus in the human population and the
counts of some
special plant in a quadrant within a given forest. Searching for
units with rare
characteristics from a sampling frame is called screening. The
sample size for
screening depends on the desired number of rare units and the
prevalence of rare units
within the population. If the prevalence is small, the sample size
for screening will be
large. Suppose that it is desirable to obtain a sample containing
10 rare units from a
population of size 10,000 with 1,000 rare units (i,e, 10 percent of
which are rare
units), then the sample size for screening needs to be
approximately 100 units. On the
other hand, if the population contains only 100 rare units ( P
0.01= ), the sample size
for screening needs to be approximately 1,000 units. Therefore, if
the prevalence
decreases from 0.1 to 0.01, the sample size for screening increases
from 100 to 1,000
2
units. In fact, the timing and cost of sample surveys are
definitely a consideration as a
result of a large sample size.
Usually, there are three main parameters associated with these
populations:
1) The prevalence of a characteristic which is used to define units
of interest is
denoted by
MP N
= , (1.1)
where M is the number of units in the population possessing the
characteristic of
interest and N is the size of the population.
2) The mean of a study variable y in a class of interest. This
parameter is
denoted by
i C
i C
yY M∈
=∑ , (1.2)
where C is the set of all units of interest in the population
3) The mean of a study variable y in the whole population. This
parameter is
represented by N
=∑ (1.3)
In a sample survey of such a population, it might be of interest to
examine
several variables. A suitable sampling design to detect the
appropriate number of units
of interest in a sample which will give a small variance of a
corresponding estimator
is difficult to specify. It is a challenging area of research for
statisticians to specify the
optimum number of units of interest in a sample.
Many methods of sampling for rare populations have been reviewed
by
Sudman and Kalton (1986: 401-429). Some of these are two-phase,
cluster, multi-
frame, snowball, stratified, inverse and network sampling, and the
advantages and
disadvantages of these sampling methods are discussed in the paper.
In addition,
examples of the application of these sampling methods are also
given. Kalton (1993:
53-74) considered the applications of these methods in order to
sample groups at risk
of contracting AIDS. The methods have been oriented mostly towards
surveys
involving human populations (Sudman, 1985; Sudman et al. 1988;
Koegle et al.
1996). In natural resource surveys in which the rare units are
spatially clustered,
3
Christman (2000: 168-201) reviewed some sampling methods based on
selecting
quadrants in order to estimate the abundance of rare units; one of
these includes
adaptive cluster sampling. Kalton (2001: 9-14) considered sampling
methods for rare
and mobile populations, such as international travelers, car
passengers, visitors to
museums and shoppers at a shopping mall.
1.2 Statement of the Problem
In two-phase sampling, the classification of sample units from the
first-phase
into two strata yields high efficiencies of the estimates of
parameters when one
stratum contains a large number of rare units. The cost ratio for
identifying the sample
unit in the first-phase per rare unit should be small. In practice,
two-stage cluster
sampling is convenient to use but the estimates of parameters of
interest have high
variances.
A two-stage method suggested by Waksberg (1987: 40-46) may be
applicable
in telephone surveys but usually creates problems in face-to-face
interview surveys. In
multi-frame sampling, the cost of the construction of the sampling
frames is
sometimes too high and the estimate of a parameter of interest
usually not
satisfactorily precise. In network sampling, a suitable
multiplicity rule is difficult to
define and non-sampling errors are usually affected by linked
units. In snowball
sampling, a sample is a non-probability sample but it might only be
useful for
constructing a sampling frame of rare units. Christman (2000: 194)
pointed out that
stratified random sampling is an appropriate design for the study
of a rare event when
a disproportional allocation of sample size is adopted. However,
the disproportional
allocation might increase the variances of estimators and some
samples from strata
with a low prevalence of rare units can give samples without a rare
unit present.
Inverse sampling has the advantage that each sample will contain
the desired number
of rare units, but the sample size is not fixed. When the
prevalence of rare units is
small, the inverse sampling scheme might give a large sample. In
adaptive cluster
sampling, the probability of detecting a rare unit depends on both
the definition of the
neighborhood and the condition for linking units to the
neighborhood. Sometimes, an
4
appropriate definition of the neighborhood cannot be given for the
linkage of units in
adaptive sampling.
Hence, of interest is to develop an efficient sampling scheme so
that the
number of sampled units in a class of interest and the parameters
will be estimated
efficiently. The objectives of the study are as follows:
1) To develop sampling designs identified as stratified inverse
sampling
2) To give some unbiased estimators of the parameters Y , M , P and
a ratio
estimate of CY for sampling designs
3) To derive the variances or the mean squared errors (MSE) of the
estimators
under consideration
4) To find appropriate estimators of the variances or the MSE of
the
estimators obtained in 3)
1.3 Scope of the Study
This study develops suitable sampling designs for a population
which can be
divided into two classes; the number of sampled units in a class to
be detected and
the efficiencies of the estimators are of interest. It is assumed
that the population
{ }1 2 NU u , u , , u= … contains a finite number of N distinct and
identifiable units.
Furthermore, it is also assumed that the population consists of M
units of interest and
N M− remaining units, where N is known but M is unknown. A sampling
unit
cannot be specified as a unit of interest until the value of its
characteristic has been
observed. The population is divided into L strata, where L 2≥ . A
stratum h contains
hN sampling units of which hM units are of interest. In addition,
it is assumed that
hN is known but hM is unknown. For stratified inverse sampling, a
fixed number of
sampled units in the class of interest, hm , from stratum h is
prespecified, and
h h1 m M< ≤ , for h 1,2, ,L= … . In this study, the following
points are concentrated
upon:
5
1) A sample of units is drawn one by one from stratum h until the
sample
contains hm units of interest, where h 1,2, ,L= … . Different
sampling schemes from
the strata are considered
2) Unbiased estimators of Y , M , P and a ratio estimator of CY in
each
sampling design are developed
3) For the unbiased estimators, their variances are obtained, and
for the ratio
estimator of CY , the MSE or its approximation given
4) Unbiased estimators of the variances are given and the
estimation of the
MSE obtained
5) Numerical examples for calculation of the estimates and their
variance or
MSE estimates are given. The sampling designs are compared using a
numerical
study. Stratified inverse sampling with replacement is compared to
stratified inverse
sampling without replacement. Furthermore, the variances and MSE’s
of the estimates
in stratified inverse PPS are compared to stratified inverse random
sampling
CHAPTER 2
LITERATURE REVIEW
2.1 Introduction
In order to obtain a sufficient number of sampled units in a class
of interest,
screening a large number of units from a population is needed. If
the prevalence of a
rare characteristic in the population is small, the sample size for
screening is large.
Consequently, this usually costs time and money for a screening
when the sample size
is large.
Researchers usually want to minimize the time and money on a
survey
together with the specification of the degree of precision in the
results. When
conducting a household survey, telephone interviewing can be used
to reduce the cost
of screening. Although rare units can be identified during
telephone conversations, the
non-response rate and inaccuracy of answers from the units can be
affected.
Researchers must pay particular attention to the screening
questions so as to minimize
the risk of non-response or wrong answers of the units in a sample.
Furthermore,
when telephone screening is applied, some households without a
telephone cannot be
included in a sample. In this case, telephone screening is biased.
When face-to-face
interviews are used for screening, greater field costs are incurred
because of travel
between the units. In this case, cluster sampling should be
considered in order to
capture more rare units within large areas, such as complete city
blocks or entire
villages.
Many researchers are interested in sampling methods which can
quickly delete
some of the clusters without rare units. Sudman (1972: 335-339)
suggested an
optimum sampling design for use with very rare human populations
using a Bayesian
optimum sampling procedure to discard some of the clusters without
rare units.
However, some conventional sampling designs can be applied to the
sampling of a
7
rare population, for instance Sudman (1978: 300-304) examined a
sampling method
combining telephone screening and face-to-face interviewing.
Kish (1980: 209-222) described sampling designs and estimation
methods for
parameters in domains, and considered stratified sampling, cluster
sampling and
network sampling. The parameters of interest consisted of the
population mean, the
total and the difference between the means from some of the
domains, and the
estimators of the parameters, their variances and variance
estimates were given.
Kalton and Anderson (1986: 1986) reviewed various sampling designs
for rare
populations based on available sampling frames. One of these was
stratified random
sampling in which a disproportional allocation of sample size to
strata was proposed
and the advantages and disadvantages of each design were discussed,
as well as
examples for the application of these sampling designs being
given.
Sudman and Kalton (1986: 401-429) described alternative methods for
the careful
sampling of rare populations. Kalton (1993: 53-74) considered
sampling methods to
select members of the human population with HIV. Christman (2000:
168-201)
reviewed various sampling designs of rare geographic clusters of a
population in order
to estimate the prevalence of rare units. In addition, the sampling
designs were
compared using a simulation study. The results showed that
stratified sampling was
more efficient than other sampling designs when the estimate for
this design had the
smallest variance and a sampling distribution was most similar to a
normal
distribution. Magnani et al. (2005: 67-72) reviewed alternative
sampling designs for a
hidden population with HIV. The methods discussed included
snowball, facility,
targeted, time-location, respondent-driven and cluster sampling.
Each of the sampling
designs suggested by various statisticians for the sampling of rare
populations is
subsequently described.
2.2 Review of Sampling Designs for Rare Populations
Many sampling schemes can be applied to rare populations, such as
two-stage
cluster, time-location, multiple frame, two-phase, snowball,
network, adaptive cluster,
stratified, inverse and probability proportional to size sampling.
These sampling
schemes are reviewed in this subsection. However, if a sampling
frame for only rare
8
units exists then obviously it should be used, which makes these
sampling methods
unnecessary.
2.2.1 Two-Stage Cluster Sampling
When a survey is conducted by face-to-face screening, high costs
for finding
rare units are incurred. To reduce this cost, units should be
grouped into clusters and
two-stage sampling will do well in these situations. In samples
from household
surveys, households are the sampling units and they may be grouped
into cities,
districts, etc. and regarded as primary units. During the first
stage, a sample of
primary units is selected followed by a subsample of secondary
units within each of
the selected primary sampling units during the second stage. When
households are
grouped into cities, the cities may be selected by stratified
random sampling. In the
second stage, households may be subsampled from each city already
drawn in the first
stage by simple random sampling. However, two-stage cluster
sampling leads to loss
of precision in the estimate of a parameter when compared with an
unclustered design
for the same sample size (Kish, 1965: 161-164).
Two-stage cluster sampling can reduce the cost of obtaining
sampling units by
observing a large number of subsample units. When rare units are
geographically
clustered, it is more efficient to sample the primary units where
the rare units are more
concentrated. A method for improving the efficiency of two-stage
inverse sampling,
as described by Waksberg (1987: 40-46), can be adopted in this
situation. This
procedure involves first selecting primary units with a probability
proportional to the
measure size and then drawing a secondary unit from each sampled
primary unit. If
the secondary unit is a rare unit, then the primary unit is
accepted into the sample. The
secondary units are drawn until a fixed number ( m ) of rare units
is attained. If the
secondary unit drawn in the second stage is not a rare unit, then
the primary unit is
rejected and the process is repeated until the required number of
primary units which
contain rare units is attained. However, some disadvantages of the
scheme are that the
primary unit is often rejected and some clusters might consist of
rare units with less
than the desired m units. A modified procedure which addresses this
issue is where
the secondary unit consists of more than one rare unit and the
clusters are large.
9
Sudman (1985: 20-29) has given a formula for calculating a good
choice of the
number of sampled rare units in a cluster.
2.2.2 Time-Location Sampling
Time-location sampling is a specific method used to sample
individuals who
visit certain locations, such as libraries, museums, shopping
centers, bars, bookstores
and polling places. Sampling is usually conducted either as the
visitors enter or leave
a location. These samples are generally obtained when convenient,
with the recruiting
being conducted at a time when the numbers of visits to the
locations are high.
Location sampling can readily produce a probability sample of
visits with known
selection probabilities and hence visits are easily analyzed.
Visits may be the
appropriate units of analysis for, say, a survey about satisfaction
with visits to a
museum. However, for many surveys using location sampling, the
visitor is the
appropriate unit of analysis.
In a clinical survey for AIDS testing, the sampling units are those
who visit
four clinics in a given 13-week period. Clinics and times are
separated as primary
units. Kalton (2001: 9-14) reviewed various sampling methods for
rare and mobile
populations. One of these includes location sampling. There were
two parameters of
interest including the number of rare units or prevalence of a rare
characteristic and
the mean of a study variable per rare unit. Karon (2005: 3180-3186)
proposed a
weighted analysis when time-location sampling was used to collect
information.
Time-location sampling is an application of two-stage sampling and
an important
feature of this procedure is that the sample units have an equal
probability of being
selected.
2.2.3 Two-Phase Sampling
In some situations, accurate identification of rare units is
difficult or
expensive, such as in a case where an expensive medical test is
needed for a firm
diagnosis of an illness. Suppose that there is an inexpensive
method for identifying
rare units but this method is imperfect, then it is cost-effective
to apply two-phase
sampling in this situation.
10
In the first phase of sampling, a large number of units are drawn
by simple
random sampling and the inexpensive method is used to divide the
sample units into
two strata. The first stratum contains units with a high likelihood
of being rare units
(the positive group) and the second stratum contains units with a
smaller likelihood of
being rare units (the negative group). During the second phase,
disproportional
stratified sampling is used with a high sampling fraction from the
positive group. An
accurate method for identifying rare units, which is difficult and
expensive, is used to
identify rare units in the subsample. The accurate measure of a
characteristic and the
values of a study variable are collected in the second phase.
Information from both
phases is used to make inferences about the parameters of interest.
In the first phase,
the imperfect method is used to identify units and both positive
and negative groups
may contain non-rare units.
Tenenbein (1970: 1350-1361) derived the maximum likelihood
estimator of
the prevalence of a characteristic under a double sampling scheme.
Optimum values
of the initial sample size and subsample sizes were found so that
the variance of the
estimator was minimized under a fixed cost. However, two-phase
sampling is useful
when the following two conditions are both satisfied. The first
condition is that the
inexpensive method to easily classify units in the initial sample
influencing the first
stratum should contain a larger prevalence of rare units than the
second stratum. The
second condition is that the cost ratio in identifying a rare unit
in the first phase to a
rare unit in the second phase is small. Deming (1977: 33) suggested
that the cost ratio
should be lower than 1: 6. In the first phase, if the cheap method
correctly classifies
population units to the negative group then only units in the
positive group are drawn
in the second phase.
Mak and Li (1988: 105-111) considered the estimation of the mean of
each
subgroup when the sample comes from a double sampling scheme.
Shrout and
Newman (1989: 549-555) presented an optimal two-phase design for
estimating the
prevalence of a rare characteristic. The relative efficiency of
two-phase sampling
compared to simple random sampling was derived and a condition
under which two-
phase sampling was more efficient than simple random sampling was
found. Hughes-
Oliver and Rosenberger (2000: 315-327) developed a two-phase design
to select units
with multiple rare characteristics. Udofia (2002: 82-89) applied
two-phase sampling
11
to the selection of sampling units with a probability proportional
to the measure size.
McNamee (2003: 1072-1078) derived a mathematical formula for the
optimal cost
ratio when choosing between two-phase or simple random
sampling.
In some cases, only sampling units in the positive group are
selected to
identify rare units and to observe study values. Biases can occur
in these cases since
the second stratum may contain rare units. The error from the
misclassification of
population units can adjust this bias. Alonzo and Pepe (2003:
313-326) suggested an
estimator which avoids bias. McNamee (2004: 783-792) proposed two
new
allocations of sample size to the subsamples for when two-phase
sampling is applied.
Morvan et al. (2007: 261-269) considered methods for assessing the
accuracy of an
inexpensive method to identify rare units in the first phase.
2.2.4 Snowball Sampling
According to the original idea of Goodman (1961: 572-579),
snowball
sampling refers to methods applied when an initial sample is asked
to identify a fixed
number of acquaintances, who in turn are asked to identify a fixed
number of
acquaintances, and so on until either a fixed number of waves is
reached or without
further cost being incurred. For instance, when taking a snowball
sample of homeless
persons, first select a small number of homeless persons. The
homeless persons in the
sample are asked to identify other homeless persons, and then each
of these persons is
asked to identify homeless persons and so on until the number of
desire homeless
persons is reached. One advantage of snowball sampling is in
constructing a frame of
rare units. When the sampling frame has been compiled, a
probability sample can be
drawn from the frame. However, some rare units in the frame may be
omitted leading
to the survey estimate becoming biased. A disadvantage of using
snowball sampling
is that the rare units must know other rare units and this
condition may not hold for all
rare populations. Snowball samples are not probability samples.
Some weighted
methods can not be applied to get an unbiased estimator of a
parameter of interest. In
order to increase the efficiency of the estimator, it may be
assumed that the population
has a probability distribution.
2.2.5 Multiple Frame Sampling
Although a complete sampling frame that contains all of the units
of interest
may not be available, sometimes incomplete frames exist which cover
every unit of
interest. In this situation, the frames may be combined in two
ways. In the first
method, duplicate units are deleted in order to construct a
complete frame. In the
second method, independent samples are selected from the incomplete
frames and the
sample estimates combined to infer the parameters of interest. For
the case of
independent samples, the original idea of multiple frames was
proposed by Hartley
(1962: 203-206). He considered two frames, called a dual frame.
Some rare units may
be listed in more than one frame. The units are then divided into
the three subsets, as
shown in figure 2,1.
Figure 2.1 Three disjointed Subsets of two Frames
Let A and B be two given frames. The partitioning of A B∪ into
3
subsamples gives A B− , B A− and A B∩ , with A BN − , B AN − and A
BN ∩ units,
respectively. Let p be the probability that a sampling unit lies in
frame A and q be
the probability that a sampling unit lies in frame B under the
condition p q 1+ = . An
estimate of the population total is A B A B B A B A A B A B A BY N
y N y N (py qy )− − − − ∩ ∩ ∩′ ′′= + + + ,
where A By − is the sample mean of a study variable from sample
units in frame A
only, B Ay − is the sample mean of a study variable from units
frame B only, A By ∩′ is
the sample mean of a study variable from units in frame A but lie
in frame B, and
A AB B
13
A By ∩′′ is the sample mean from units in frame B but lie in frame
A. If A BN − , B AN − and
A BN ∩ are unknown, a new weighted estimate can be applied.
Casady and Sirken (1980: 601-605) gave a dual frame based on
household and
telephone frames. An allocation of sample size into frames was
considered so that the
variance of an estimator was minimized under a fixed cost when
units were drawn by
simple random sampling. Casady et al. (1981: 444-447) considered a
dual frame base
on household and telephone frames when the units were selected by
cluster sampling
from both frames. Lepkowski and Groves (1984: 265-270) examined an
optimal
allocation of sample size into frames when stratified multistage
samplings were used
to select samples from both household and telephone frames. A model
was also
developed for survey errors to include both the response error and
bias from dual
frames. Iachan and Dennies (1993: 747-764) described the use of
multiple frames to
sample homeless persons in Washington DC. Four frames were
considered, namely
homeless shelters, soup kitchens, encampments (such as vacant
buildings and location
under bridges) and streets.
Lohr (1999: 401) examined a dual frame to sample people with
Alzheimer’s
disease where the first sampling frame consisted of the entire
population that covered
every person with Alzheimer’s and the second sampling frame was
adult-care centers
with a high prevalence of people with the disease. In this
situation, units are divided
in two subsets of partitions A B∪ , as shown in figure 2.2. Let A
and B be the two
given frames. The partition A B∪ into 2 disjointed subsets are A B−
and A B∩ . In
this case frame A represents the sampling frame from the entire
population and frame
B refers to the sampling frame of adult-care centers.
Figure 2.2 Frame B belonging to Frame A
A AB
14
This is a special case of partitioning as shown by the two frames
in figure 2.1. An
estimate of the population total can be similarly derived.
Lorh and Rao (2000: 271-280) discussed various estimators of the
population
total from a dual frame sample and a jackknife variance estimator
of the estimators
was found. Asymptotic variances of these estimates were
subsequently compared.
Srinath et al. (2004: 4424-4429) considered a dual frame where the
first frame was a
complete household frame and the other only the households of
interest. This dual
frame was adopted during the National Immunization Survey where the
households of
interest consisted of households with children aged between 19 and
35 months. Palit
(2006: 3508-3513) proposed an alternative estimator of the
population total when
multiple frames have been used to select a sample. Mecatti (2007:
151-157) proposed
a new estimator of the population total of a study variable when a
sample was made
up of an aggregation of samples from multiple frames. The variance
of the new
estimator was derived and an unbiased estimator of the variance was
also given.
2.2.6 Network Sampling
In conventional sampling, a sampling frame provides a single
listing for each
sampling unit. For sampling rare populations, the problem of
duplicate listings within
frames may arise. In an example of a medical survey, the
observation units are the
patients with certain diseases. A random sample of medical centers
is drawn and,
from the records of each medical center in the sample, records of
patients treated for
those diseases are obtained. However, some of the patients may have
been treated at
more than one medical center. Network sampling has been developed
to solve this
problem and refers to a sampling scheme where sampling units are
drawn by
conventional sampling; not only are the sampled units selected but
also other units
which are linked to them and the sampled and linked units are then
combined. A
network is defined as a set of units which obey a given linkage
rule. Network
sampling can improve the efficiency of the estimate of a parameter
whenever a rare
unit in the population can report other rare units and this
information may be used to
increase the number of rare units in a sample. It can be applied to
increase the number
of rare units where the number of population units in the initial
sample is small.
15
Birnbaum and Sirken (1965: 1-8) proposed three unbiased estimators
of the
number of units with rare characteristics based on stratified
network sampling. Each
estimator depends on the use of information from the observed units
in the sample.
Sirken (1970: 257-566) considered network sampling for household
surveys. In order
to compare the variances of the estimators, three statistical
models were adopted. The
results showed that the estimator from the network sample gave a
smaller sampling
variance than the estimators from a conventional design for the
same sample size.
Sirken (1972: 224-227) compared the variances of two estimators
from stratified
multiplicity sampling to an estimator from conventional sampling.
One estimator
from the multiplicity sample permits a sampling unit to be linked
to units from only
one stratum, whereas the second estimator permits a sampling unit
to be linked to any
unit. The results showed that none of the three estimators always
gave a smaller
sampling variance than the others. Levy (1977: 758-763) considered
an optimum
allocation of sample size to strata under stratified network
sampling. The cost
efficiency of network sampling was compared to that of conventional
sampling. The
results showed that network sampling can be more cost efficient
than conventional
sampling under certain conditions. Snowden (1983: 102-105) compared
the biases
and mean squared errors of four multiplicity rule estimators for
estimating the
prevalence of cancer. Czaja et al. (1987: 411-419) found that
network sampling could
improve the efficiency of the estimator of a rare characteristic in
a population. The
mean squared errors of two multiplicity rule estimators were
compared to the mean
squared error of a conventional estimator. Czaja (1988: 38-43)
applied network
sampling to a local crime victimization survey where six counting
rules were
considered. Sirken (1998: 1-6) reviewed a short history in the
application of network
sampling. Sirken (2006: 3664-3668) also described network sampling
methodology
and some of its advantages and disadvantages.
2.2.7 Adaptive Cluster Sampling
Adaptive sampling means a sampling design in which the scheme for
selecting
units depends on the values of a study variable in the sample
during the survey
(Thompson: 1990: 1050-1059). The probability of getting a sample
depends on the
values of a study variable in the sample. The rationale behind
adaptive sampling is to
16
obtain a more precise estimator of the prevalence of a
characteristic of interest or else
to increase the number of sample units with a certain
characteristic. Adaptive cluster
sampling is a sampling design where an initial set of units is
drawn using a
predetermined probability sampling scheme and whenever values of a
study variable
of each sample unit in the initial sample satisfy a given condition
then additional units
in the neighborhood of this unit are added to be sampled and so on,
until no units
which satisfy the condition are left to add to the sample. The
neighborhood refers to a
set of units which is identified by a set of rules. If the i-th
unit is a neighborhood of
the j-th unit, then unit j is also a neighborhood of unit i. This
relationship is referred to
as symmetric. In social surveys, a neighborhood may be defined by
the social
relationship of the units. A collection of all of the units
observed under the design
with a result from the i-th initial unit in the sample is called a
cluster, which may also
consist of the union of various neighborhoods. A network refers to
a collection of
units which satisfy the property that if a unit in a network is
selected, then every unit
in the network will be sampled. A unit which does not satisfy the
condition but is in
the neighborhood of one that does is called an edge unit. Thus,
units partitioned into
networks can be mutually exclusive. However, an inappropriate
neighborhood
definition leads to the selection of a large sample size without
any units of interest.
Salehi and Smith (2005: 84-103) proposed two-stage adaptive
sampling. In the
first stage, a sample of primary sampling units was drawn by one
conventional
sampling scheme followed by the selection of initial secondary
units from the primary
sampled units. In the second stage, if the initial secondary units
satisfy a certain
condition, the secondary units in the primary unit are drawn and
added to the sample.
Thompson (2006: 1224-1223) proposed what is called adaptive web
sampling.
In the first stage, an initial sample is selected using a specified
sampling design, and
either it or its subset is used to construct an active set. The
active set depends on the
observed values of a study variable in the sample. In the k-th
stage, an additional
sample is selected which depends on the active set or the sampling
frame with
probabilities p and 1 p− , respectively. The procedure for
selection is performed until
a fixed number of stages k or a fixed sample size has been
obtained. Four estimators
of the population mean were given, two of which are unbiased and
the remainder
biased. The variances and mean squared errors of the estimators
were derived, and
17
estimates of the variances were also given. However, these
estimators were obtained
by taking the expected value of an estimator conditioned on a
minimum sufficient
statistic. One disadvantage of this technique is that if the sample
size is large, the
estimators are cumbersome to compute.
2.2.8 Inverse Sampling
A problem of a fixed sample size sampling design for rare
populations is that a
sample may not contain any rare units. A method for solving this
problem is the use
of an inverse sampling scheme. Haldane (1945: 222-225) considered
inverse simple
random sampling when the parameter to be estimated is the
prevalence of a rare
characteristic. The units are drawn one by one with replacement
until the sample
contains m rare units. An unbiased estimator of the prevalence of a
rare unit was
found but an unbiased estimator of its variance was not given.
Finey (1949: 223-234)
gave an unbiased estimator of the variance of Haldane’s estimator.
Sampford (1962:
27-40) considered inverse sampling with a probability proportional
to size using
clusters. The primary units or cluster are selected one by one with
replacement and
with a probability proportional to size until the sample contains m
1+ distinct
clusters. The total sample size is n 1+ clusters after a final
cluster draw. In order to
find an unbiased estimator of the population mean, he suggested
that the last cluster
be rejected.
Pathak (1964: 158-192) showed that the method of inverse sampling
with a
probability proportional to size in cluster sampling of Sampford
(1962) is equivalent
to sampling with a probability proportional to size without
replacement. In addition,
he also pointed out that an unbiased estimator of the population
mean exists which is
uniformly better than the estimator given by Sampford (1962).
Mikulski and Smith
(1974: 216-217) found the variance bounds of Haldane’s unbiased
estimator. Pathak
(1976: 1012-1017) considered inverse sampling for a fixed cost
survey. The sampling
units are drawn sequentially until the fixed total cost of the
survey is reached,
assuming that the cost for observing a population unit is unknown
in advance until the
information from this unit is completely observed. This method is
preferable to the
elimination of the randomness of the total cost for a sample
survey. Sathe (1977: 425-
18
426) improved the upper bound of the variance of Haldane’s unbiased
estimator, as
did Prasad and Sahai (1982: 286).
Lan (1999), and Chistman and Lan (2001: 1096-1105), developed what
is
called inverse adaptive cluster sampling. An unbiased estimator of
the population total
was given but an unbiased estimator of the variance of the unbiased
estimators was
not. Salehi and Seber (2001: 281-286) showed that Murthy’s
estimator can be applied
to inverse sampling. An unbiased estimator of the population total
and an unbiased
estimator of its variance were given when inverse adaptive sampling
was adopted.
Salehi and Seber (2002: 63-74) considered cases where networks were
selected
without replacement until the sample contains m networks with rare
units in adaptive
cluster samples. This sampling scheme is called restricted adaptive
cluster sampling.
An unbiased estimator of the population total and an unbiased
estimator of its
variance were derived. Salehi and Seber (2004: 483-493) found an
unbiased estimator
of the population total and an unbiased estimator of its variance
when samples come
from general inverse adaptive cluster sampling. Greco and Naddeo
(2007: 1039-1048)
considered inverse sampling when the population units were drawn
with a probability
proportional to size and with replacement. An unbiased estimator of
the population
total and an unbiased estimator of its variance were found. Espejo
et al. (2008: 133-
137) considered inverse sampling without replacement.
2.2.9 Stratified Random Sampling
Stratified random sampling may be used to increase the number of
rare units
in a sample. In a simple case, assume that there are only two
strata such that one
stratum has a higher prevalence of rare units than the other. Since
the first stratum
consists of more rare units, a large sample size is selected from
this stratum. In
stratified random sampling of a rare population, there are two
problems involving the
allocated sample size and the appropriate value of the ratio of two
prevalences into
two strata ( )1 2r P P= .
Kalton and Anderson (1986: 73) suggested a disproportional
allocation of
sample size to strata where the stratum with the largest number of
rare units should be
oversampled. Srinath (1996: 226-231) proposed a new allocation of
sample size in an
attempt to keep the sample size in a stratum close to the desirable
sample size. This
19
allocation showed that the sample size should be close to the
initial sample size and
the number of rare units in a sample should be close to a desirable
number. However,
the efficiency of the new allocation depends on the initial
allocation. Srinath (1999:
351-354) considered a varying proportional allocation which
minimizes the increase
of the variance and tries to minimize the cost of the survey. Boyel
and Kalsbeek
(2005: 2738-2793) considered an optimum value of 1 2r P / P= when
the population
was divided into two subpopulations. LeFauve et al. (2006) compared
the sample size
from a disproportional allocation to proportional allocations under
the same variance
of an estimator using a real life population. Kalton and Anderson
(1986: 73) pointed
out that disproportionate stratified sampling is preferential when
two conditions are
satisfied; the first is that the prevalence of rare units in the
first stratum is higher than
the other stratum and the second is that the total number of rare
units in the first
stratum per total rare units in the population must be large ( * 1
1P M M= , where 1M
refers to the number of rare units in first stratum).
2.2.10 Probability Proportional to Size Sampling
Unequal probabilities are commonly used in agricultural surveys
when the
selection probabilities are evaluated at the moment when the units
are drawn. For
instance, previous census data of agricultural production are
important indicators
when studying production during the current year. In this case, the
study variable is
the total quantity of production by farmers in the current year
while the selection
probabilities may be proportional to the quantity of production in
previous years. In
order to estimate production of a particular commodity, knowledge
of the acreage
used for an agricultural commodity, obtained via analysis of
satellite pictures, might
be useful for evaluating the probabilities of unit selection. A
quick eye-estimate by an
expert might be useful in suggesting a suitable sampling
methodology for estimating
the volume of timber in a forest. Moreover, if it is believed that
the values of a study
variable are proportional to the value of an auxiliary variable,
then the selection
probabilities are usually chosen proportional to the value of the
auxiliary variable in
order to reduce the variance of the estimate.
20
Probability proportional to size (PPS) sampling is a sampling
technique used
in surveys in which the probability of selecting a sampling unit is
proportional to its
size at each draw. The use of unequal probabilities in sampling was
first proposed by
Hansen and Hurwitz (1943: 333-362). Before that, there had been
substantial
developments in sampling theory and practice, but all these had
been based on the
assumption that the probabilities of the selection of units within
each stratum should
be equal. Hansen and Hurwitz demonstrated that the use of unequal
selection
probabilities within a stratum would give a more efficient
estimator of the population
total. In many situations, it may be preferable to draw a sample
with unequal
probabilities and without replacement and there is a vast
collection of literature
examining this under fixed sample size sampling schemes.
Madow (1949: 333-345) considered the use of systematic sampling
with
unequal probabilities so as to avoid the possibility of units being
selected more than
once. This suggestion was followed by a large number of alternative
selection
procedures. Horvitz and Thompson (1952: 663-685) produced a general
theory of
sampling with unequal probabilities without replacement based on
the use of an
unbiased estimator of the population total. Midzono (1952: 99-107)
suggested an
unequal probability sampling scheme under a fixed sample size. The
first unit of the
sample is selected with an initial probability iz and the remaining
units are drawn
with equal probabilities and without replacement.
2.3 Estimation of Parameters in Inverse Sampling
2.3.1 Estimation under Inverse Simple Random Sampling
Haldane (1945: 222) gave an unbiased estimator of the prevalence of
rare
units, P , as
, (2.1)
where m is the number of rare units in the sample and n refers to
the sample size
drawn by inverse simple random sampling with replacement.
21
However, the expression of the variance of P is complicated and an
unbiased
estimator of the variance was not proposed. Finey (1949: 223-234)
showed that an
unbiased estimator of the variance of Haldane’s estimator is
( )ˆ ˆP 1 P ˆ ˆV(P)
n 2
− . (2.2)
Mikulski and Smith (1976: 216-217) gave the upper and lower bounds
of the variance
of Haldane’s unbiased estimator as
≤ ≤ − +
, (2.3)
where Q 1 P= − . When a sample is drawn by inverse random sampling
with
replacement, Lan (1999) proved that an unbiased estimator of the
population mean of
a study variable y in the whole population Y is
( )C C ˆ ˆy Py 1 P y= + − , (2.4)
with the variance as
2 2C C C C
ˆ ˆ ˆ ˆV y Y Y V P E P E P 1 P m m 1
σσ = − + + − − , (2.5)
where C
iC i s
i C
i C
σ = − − ∑ is the population variance of non-rare units.
In addition, C refers to the set of sampling units with rare
characteristics in the
population, C is the set of sampling units that don’t possess rare
characteristics in the
population, Cs is the set of sampled units with rare
characteristics and Cs is the set of
sampled units that don’t possess rare characteristics in the
sample. An unbiased
estimator of ( )V y was also given as
( ) ( ) ( ) 22
2 * *C C C C
ss m 1ˆ ˆ ˆ ˆ ˆ ˆV y y y V P P P P m m 1 m 2
− = − + + − − − , (2.6)
22
= − −
= − − ∑ , the sample variance of a rare
unit, and ( ) C
22 iC C
= − − − ∑ , the sample variance of a non-rare unit.
For inverse random sampling without replacement, units are drawn
one by one
with equal probabilities and without replacement. The sampling
continues until the
sample contains m rare units, which is assumed to be on the n-th
draw. In this case,
Salehi and Seber (2001: 284) gave an unbiased estimator of the
prevalence of a
characteristic, P , as
, (2.7)
which is the same as (2.1). However, the variances of the estimator
and its estimate
are different. An unbiased estimator of the variance of P was also
given as
( )ˆ ˆP 1 Pn 1ˆ ˆV(P) 1
N n 2
−− = − − , (2.8)
In addition, Salehi and Seber (2001: 284) also gave an unbiased
estimator of the
population mean of a study variable, Y , as
( )C C ˆ ˆy Py 1 P y= + − , (2.9)
which is similar to expression (2.4). The variance of y was also
given as
( ) ( ) ( ) ( ) 2
2 2C C C
= − + −
m 1 N M − + − − − −
. (2.10)
They also proved that an unbiased estimator of the variance of y
is
( ) ( ) ( ) ( ) 22
2 * *C C C C
− = − + − + − − −
− + − − −
2.3.2 Estimation under Inverse PPS Sampling
In inverse PPS sampling with replacement, the sampling units are
drawn with
unequal probabilities ( )iz and with replacement until the sample
contains m rare
units. Recently, Greco and Naddeo (2007: 1041-1042) gave an
unbiased estimator of
the total of a study variable in a whole population,
( )C C ˆ ˆ ˆY P y 1 P y′ ′= + − , (2.12)
where m 1P n 1 −
= −
Y for the parameter Y was also derived as
( ) ( ) ( ) ( ) ( ) 22
2 2C C C C
ˆ ˆ ˆ ˆ ˆV Y Y Y V P E P E P 1 P m m 1
′′ σσ ′ ′= − + + − − , (2.13)
Y y z ∈ ∈
Y y z ∈ ∈
the estimator Y was given as
( ) ( ) ( ) 22
2 * *C C C C
ss m 1ˆ ˆ ˆ ˆ ˆ ˆ ˆV Y y y V P P P P m m 1 m 2
′′ − ′ ′= − + + − − − , (2.14)
where ( )( ) ( )( )
n 1 n 2 − −
′ ′= − −
′ ′= − − −
∑
ˆ ˆV(P) n 2
− =
− .
In inverse PPS sampling without replacement, the sampling units are
drawn
with unequal probabilities ( )iz of selection of the remaining
units and without
replacement until the sample contains m units of interest. Usually,
the Horvitz-
Thompson estimator is used to estimate the population total under
PPS sampling
without replacement. The estimator depends on the first-order
inclusion probabilities.
Unfortunately, these under an inverse sampling scheme depend on an
unknown
parameter, M , the number of units of interest in the population.
The Horvitz-
Thompson estimator cannot easily be applied in this sampling
scheme. The problem
of sampling with unequal probabilities without replacement has
received considerable
attention and Murthy’s estimator has also been used to derive an
unbiased estimator
24
of the population total. Raj (1956: 269-284) gave an ordered
unbiased estimator of the
population total. Murthy (1957: 379-390) used an unordered process
to obtain an
unbiased estimator of the population total under a fixed sample
size sampling design.
The probability of obtaining a sample s is denoted by ( )P s . The
conditional
probability of getting sample s , given that the i-th unit is
selected first, is given by
( )P s | i . The notation ( )P s | i, j refers to the probability
of getting sample s given that
the i-th and j-th units are selected in any order in the first two
draws.
Fractions ( ) ( )P s | i P s and ( ) ( )P s | i, j P s are used to
determine Murthy’s
estimator and an unbiased estimator of its variance. Murthy’s
estimator of the
population total is
P s= = ∑ , (2.15)
where n is the sample size. This estimator does not depend on the
order of selection
( ) ( ) ( ) ( )
2 N N ji
i j i 1 j i s i, j i j
= − −∑∑ ∑
( ) ( ) ( )
( ) ( ) ( )
i j2 i 1 j i i j
= − −∑∑
. (2.17)
Salehi and Seber (2001) gave a direct proof that Murthy’s estimator
can be
applied in any sequential sampling design including inverse
sampling and some
adaptive sampling methods.
METHODOLOGY
In this chapter, the notation used is first set out. After this,
some of the
definitions of probability sampling and properties of estimators
are given. Thirdly, the
development of sampling schemes and estimators are described. The
sampling
designs considered are stratified inverse random sampling and
stratified inverse PPS
sampling, both with and without replacement. Finally, a comparison
method of the
properties of the estimators under the proposed sampling designs is
presented.
3.1 Notation
Let { }1 2 NU u , u , , u= … denote a finite population of N
distinct and
identifiable units. The elements 1 2 Nu , u , , u… of U are called
sampling units. The
identifiable units imply that there exists a one to one
correspondence function
between the units and the integers 1, 2, , N… . Let y be a study
variable. The values of
the study variable in the population are denoted by 1 2 Ny , y , ,
y… . The population is
partitioned into two subpopulations of C and C with cardinality M
and N M− ,
respectively. It is assumed that C and C are unknown before
sampling, for example,
{ }i iC u : y b= ≥ and { }i iC u : y b= < , where b is a given
constant. The parameters to
be estimated include the number of units in C , the prevalence of
units of interest, the
mean of a study variable of units in class C and the mean of a
study variable of the
whole population. The population is stratified into L strata. A
subpopulation consists
of hN sampling units and unknown hM 1≥ units of interest, for h
1,2, ,L= … , so
that L
=∑ . A subpopulation can be partitioned into hC and hC
with cardinality hM and h hN M− , respectively. as in figure
3.1.
26
Figure 3.1 The Study Population
The following notations are used for the stratified inverse
sampling. For stratum h:
h h
NW N
Ch hj i Ch
units in class hC
units in class hC
h
1Y y M ∈
variable in hC
1Y y N M ∈
variable in hC
from class hC
h hn m− The number of units in a sample
from class hC
( )hh 1 2 ns i , i , , i= … The ordered sample from
selection
The prevalence of units of interest is written as L
h h 1
P M N =
hj
variable of units in C is denoted by L L
C hj h h 1 j Ch h 1
Y y M = ∈ =
=∑∑ ∑ .
However, the same notations are defined in chapters 4 and 5 but
their
meanings are different.
3.2 Definitions
Let S denote the collection of all possible samples from a given
finite
population U . In order to make inferences about parameters in the
population, a
sample comes from a probability sampling, i.e. a sample obtained by
a certain known
sampling design.
Definition 3.1 A sampling design1, based on a population U , is a
pair ( ), PS where
P is a probability distribution on S such that:
1 The definition is due to Hedayat and Sinha (1991: 3)
28
i) ( )P s 0> for all s∈S
ii) For every unit in the population, there exists at least one
sample s∈S
containing the unit
Definition 3.2 Choosing a subset of the population according to a
probability
sampling design is called a probability sampling.
Definition 3.3 Let P be a probability distribution defined on S .
An estimator T is
unbiased of the parameter θ with respect to a sampling design
if
( ) ( ) ( ) s
= = θ ∑ S
.
Definition 3.4 The mean squared error of the estimator T for the
parameter θ with
respect to a sampling design is defined as
( ) ( )( )2ˆ ˆMSE T E T s = − θ ,
( )( ) ( ) 2
∈
.
If T is an unbiased estimator of θ under the sampling design, then
the mean squared
error of T is the variance.
Definition 3.5 Two samples, 1s and 2s , are said to be equivalent
if they both contain
the same sampling units. For example, { }1 1 1 2 3s u , u , u , u=
and { }2 1 2 3s u , u ,u= are
equivalent as they both lead to the inclusion of the first three
sampling units.
Definition 3.6 A partition of S into subsets of equivalent samples
is called a
sufficient partition. Thus, { }T Ts=S is a sufficient partition if
each Ts of TS contains
only equivalent samples. Ts is called an element of TS . It is
desirable to express a
29
sufficient partition together with its probability measure as ( ){
}T T Ts , P s=S , where
( ) ( ) T
∈ =∑ S
.
Definition 3.7 A statistic ( )T s is said to be sufficient if the
partition, TS , induced by
T is sufficient.
Definition 3.8 Let ( )f s be a real-valued function defined on S ,
then the conditional
expectation of ( )f s , given a sufficient partition ( ){ }T T Ts ,
P s=S , is given by
( ) ( ) ( ) ( )1 1 E f s | T f s P s / P s= ∑ ∑ ,
where the summation, 1∑ , is taken over the samples Ts s∈ . Note
that ( )E f s | T
is defined on TS .
This dissertation is concerned with stratified inverse sampling
using four
different schemes, as showed in figure 3.2. For each sampling
scheme, the method of
derivation of the estimators of the parameters of interest is
described.
Figure 3.2 Diagram of the four Study Sampling Schemes
Stratified Inverse Sampling
30
In each sampling design, the parameters of interest are P , Y and
CY . When a
probability proportional to size is made in each draw and without
replacement of the
units, a PPS sample without replacements is obtained. Since the
selection probabilities
change from draw to draw, suitable estimators taking this aspect
into account must be
devised. In order to estimate the total of the whole population,
the Horvitz-Thompson
estimator can be used when the inclusion probabilities of units are
available.
Unfortunately, inclusion probabilities from inverse sampling depend
on an unknown
parameter M , the number of units in class C , which means it is
not easy to use the
Horvitz-Thomson estimator. However, Salehi and Seber (2001:
281-286) proved that
Murthy’s estimator can be applied to inverse sampling.
There are many sampling schemes developed for selecting units with
PPS
sampling. The case of a sample size equal to 2 is of particular
interest because
i i2zπ = . Not all sampling schemes are appropriate for development
of inverse PPS
sampling without replacement. Of interest is the sampling scheme
reported by
Midzono (1952: 99-107) because it is easy to select units. With
this sampling scheme,
Murthy’s estimators can be applied to get unbiased estimators of
the parameters of
interest in each stratum.
3.3.1 Stratified Inverse Random Sampling
When a population is divided into L non-overlapping strata, stratum
h
contains hN units and hM units of interest, where hN is assumed
known but hM is
unknown, for h 1,2, ,L= … . The number of units in the sample from
stratum h falling
into hC is given by hm assumed to be fixed in advance; L
h h 1
Inverse random sampling with replacement is applied in each
stratum
independently. Let hn be the number of units in stratum h needed to
obtain hm units
of interest. So as to find the unbiased estimators of M and P , the
estimator of hP is
applied as given by Haldane (1945: 222) in (2.1). The results are
combined using
31
weights so that the estimators of their parameters are unbiased. In
order to find an
unbiased estimator of Y , the estimator of hY in (2.4) as given by
Lan (1999) is used
in each stratum. The results are combined using a sampling weight
to obtain an
unbiased estimator of the mean of a study variable of a whole
population. For finding
a ratio estimator of CY , the parameter can be written as the ratio
of two unknown
parameters h
Y y = ∈
= ∑ ∑ and M , and is simply estimated using a sample ratio of
unbiased estimators of CY and M, respectively. Since sampling is
carried out
independently from the strata and the variances of Haldance’s
estimator and of Lan’s
estimator are known from inverse simple random sampling with
replacement, the
properties of variance from independent variables can be applied to
derive the
variances of the estimators of M , P and Y , respectively. Unbiased
estimators of the
variances are obtained by substituting the unbiased estimators of
variance given by
expressions (2.2) and (2.6). The ratio is a biased estimator of the
parameter CY . The
calculation of the bias and the variance for the ratio estimates
are given by using a
linearization method.
Inverse random sampling without replacement is also applied to draw
the units
in a stratum when the selection between strata is independent. Let
hn be the number
of the final sample size in stratum h , where h 1,2, ,L= … . So as
to obtain unbiased
estimators of M and P , the estimator of hP in (2.7) as given by
Salehi and Seber
(2001: 284) is applied in each stratum. The results are then
combined using weights
so that estimators are unbiased with respect to the parameters M
and P . So as to find
an unbiased estimator of Y , the estimator of hY as given in (2.9)
is applied for every
stratum and sampling weights for the estimators are used. A ratio
of two unbiased
estimators of the parameters CY and M is considered for estimating
the mean of a
study variable of units in a class of interest. Since the samples
are independent from
distinct strata and the variances of the estimators of hP and hY
from inverse simple
random sampling without replacement are known, the properties of
the variances from
the independent variables are applied in order to derive the
variances of the estimators
of M , P and Y . Unbiased estimators of the variances are obtained
by substituting
32
the unbiased estimators of the variance in each stratum given by
expression (2.9) and
(2.11). Since the ratio is a biased estimator of the parameter CY ,
so calculation of both
bias and variance is given by using a linearization method.
Note that the estimators under inverse random sampling with and
without
replacement have similar expressions but the variances and
estimates of the variances
are distinct.
3.3.2 Stratified Inverse PPS Sampling
In stratified inverse PPS sampling with replacement, in stratum h,
the
sampling units are drawn so that they are sampled with unequal
probabilities with
replacement until the sample contains hm units of interest where
the selections
between strata are independent, for h 1,2, ,L= … . So as to obtain
an unbiased
estimator of Y , the estimator of hY in (2.12) as given by Greco
and Naddeo (2007:
1041-1042) is used in each stratum. The results are combined using
weights of the
estimators in order to obtain an unbiased estimator of the mean of
a study variable of
a whole population. It can be seen that the number of units in
class C, M , is the
summation of a study variable that takes values 0 and 1. To make
inferences of M
and P , we use
′ =
By replacing the hjy in (2.9), an estimate of hM is obtained. The
estimates of hM are
combined by using weights to obtain unbiased estimators of M and P
. A ratio of two
unbiased estimators of the parameters CY and M is considered for
estimating CY .
Since sampling is independent from distinct strata and the variance
of estimator of hY
from inverse PPS sampling with replacement is known, the properties
of the variance
from the independent variables can be applied to derive the
variances of the
estimators of Y . Unbiased estimators of the variances are obtained
by substituting the
unbiased estimators of the variance given by expression (2.14). In
order to make
inferences of their unbiased estimates for the parameters M and P ,
we define
33
′ =
and substitute for the hjy . For the ratio estimator, both bias and
variance are derived
by using a linearization method.
In stratified inverse PPS sampling without replacement,
developing
Midzuno’s scheme to use inverse PPS sampling is desirable. For the
Midzono
scheme, the first unit of the sample is selected with an initial
probability hjz and the
remaining units are drawn with equal probabilities and without
replacement. This
sampling scheme is applied as stratified inverse PPS sampling
without replacement.
Under this method, Murthy’s estimator in (2.15) can be applied in
order to find
unbiased estimators of the population total of a study variable in
a stratum, the
number of and the total study values of units in class hC . These
estimators are
combined with weights in order to obtain unbiased estimators of M ,
P and Y . A
ratio of two unbiased estimators of the parameters CY and M is
given for estimating
CY . In stratum h, unbiased variance estimators for the parameters
M , P and Y are
derived using Murthy’s variance estimator in (2.17). Under the
properties of variance
from independent variables, unbiased estimators of the variances
are obtained by
substituting the unbiased estimator of the variance in each
stratum. The calculation of
both bias and variance for the ratio estimation is derived by using
a linearization
method.
3.4 Simulation Study
In order to compare the estimators from the sampling designs
described above,
both real life and simulation data were used as populations for the
study. The
variance, mean squared error and squared bias of estimators from
stratified inverse
sampling were considered. The procedures for simulation study are
as follows:
Step 1 The population of size N is partitioned into L strata and
the number of sampled
units ( hm ) from the class hC is predetermined
Step 2 Stratified inverse sampling schemes are used to draw the
units from the
population and the study values observed. An estimate of a
parameter of interest θ ,
34
denoted by jθ , is calculated. In this case, estimates of Y , M , P
and CY are of
interest
Step 3 Step 2 is repeated 10,000 times
Step 4 The estimates of the variance, ( )ˆV θ , the mean squared
error, ( )ˆMSE θ and
squared bias of θ , ( )2 ˆB θ , are calculated, i.e.
( ) r
j j 1
1 ˆ 10,000 =
θ = θ∑ is the mean of the estimates of θ from 10,000 samples
Stratified inverse sampling with replacement is compared to
stratified inverse
random sampling without replacement. Following this, stratified
inverse PPS
sampling with replacement is compared to stratified inverse PPS
sampling without
replacement. Finally, a comparison of stratified inverse random
sampling and
stratified inverse PPS sampling is made.
CHAPTER 4
4.1 Introduction
In stratified inverse sampling, the population of N units with M
units of
interest is first divided into subpopulations of 1 2 LN , N , , N…
units corresponding to
1 2 LM , M , ,M… units of interest. These subpopulations are
non-overlapping and they
comprise the whole of the population, and are referred to as
strata. The following
symbols are used when constructing the stratified inverse random
sampling theory.
For stratum h , let h 1, 2, , L= … :
( ) h
class hC
σ = − − ∑ The true variance of a unit in
class hC
= − − ∑ The variance of a unit of a unit in
class hC
= − − − ∑ The variance of a unit of a unit in
class hC
h j C
σ = − −∑ The true covariance between x
and y of a unit in class hC
36
( ) ( )( ) h
h h j C
σ = − − − ∑ The true covariance between
x and y of unit in class hC
( ) ( )( ) h
h j C
= − − − ∑ The covariance between x and
y of a unit in class hC
( ) ( )( ) h
h h j C
1S x, y y Y x X N M 1 ∈
= − − − − ∑ The covariance between x
Ch
1y y m ∈
Ch
1y y n m ∈
( ) Ch
= − − ∑ The sample variance of units in
Chs
( ) Ch
Chs
( ) ( )( ) Ch
h j s
= − − − ∑ The sample covariance of units
in Chs
h h j s
1s x, y y y x x n m 1 ∈
= − − − − ∑ The sample covariance of
Stratification is a common technique used with large populations.
The
principal reasons for using it are as follows (Cochran: 1977: 89):
(1) If precision of
the estimates in certain subpopulations is wanted, it is advisable
to treat each
subpopulation as a population. (2) Administrative convenience may
dictate the use of
37
stratification. (3) Sampling problems may differ markedly in
distinct parts of a
population. (4) Stratification may provide a gain in precision in
the estimates of
parameters of interest.
4.2.1 The Sampling Scheme
In each stratum, samples are selected with equal probabilities and
with
replacement until the sample contains hm units from hC and the
selection of distinct
strata are independent. Let hn be a sample size in stratum h and
note that hn is a
random variable. With this sampling scheme, any unit may be drawn
more than once.
A sample can be represented as ( )1 2 Ls s ,s , ,s= … . The
probability for obtaining each
sample is ( ) hn
=
∏ , where hs contain hm -th units from hC at the hn -th
draw. Sample hs can be divided into Chs and Chs units from classes
hC and hC ,
respectively. This sampling scheme is called stratified inverse
random sampling
with replacement.
= , as well as a biased
estimator of CY , under stratified inverse random sampling with
replacement are
considered here.
Theorem 4.1 Under stratified inverse random sampling with
replacement, an
unbiased estimator of the population total of a study variable, Y ,
is
L
ˆ ˆ ˆY N P y (1 P ) y =
= + − ∑ , (4.1)
38
1y y m ∈
1y y n m ∈
st h Ch h h h hCh h hh 1
ˆ ˆ ˆ ˆ ˆV Y N Y Y V P E P E P 1 P m m 1=
σσ = − + + − − ∑ , (4.2)
( ) ( ) ( ) 22L 22 * *Ch Ch h
st h Ch h h h hCh h h hh 1
ss m 1ˆ ˆ ˆ ˆ ˆ ˆ ˆV Y N y y V P P P P m m 1 m 2=
− = − + + − − − ∑ , (4.3)
n 2
n 1 n 2 − −
= − −
.
Proof: The notations 2E and 2V denote the conditional expectation
and variance
given the sample size hn , and 1E and 1V are the unconditional
expectation and
variance over hn , respectively. Since the mean of a study variable
of units in stratum
h, hN
1Y y N =
= ∑ , can be written as ( )h h Ch h ChY P Y 1 P Y= + − ,
( ) L
st h h Ch h Ch h 1
= + − ∑ ,
= + − ∑ ,
ˆ ˆN Y E P Y (1 E P ) =
= + − ∑ ,
N P Y (1 P ) Y =
= + − ∑ L
=∑ Y= ,
because ( )2 Ch h ChE y | n Y= , ( )2 hCh ChE y | n Y= and ( )1 h h
ˆE P P= .
Let h h Ch h Ch ˆ ˆy P y (1 P ) y= + − and ( )* 2
1 h h ˆE P P= .
( ) L
=
∑ ( )
=
2 h 1 2 h h 1 2 h h
=
= + ∑ ,
( ) ( ) 22L 22 2 Ch Ch
h 1 h h 1 h Ch h Ch h h hh 1
ˆ ˆ ˆ ˆN E P 1 P V P Y 1 P Y m n m=
σσ = + − + + − − ∑ ,
h Ch h h h hCh h hh 1
ˆ ˆ ˆ ˆN Y Y V P E P E P 1 P m m 1=
σσ = − + + − − ∑ .
( ) ( ) ( ) 22L 22 * *Ch Ch h
st h Ch h h h hCh h h hh 1
− = − + + − − − ∑ ,
( ) ( ) 22L 22 * *Ch Ch h
h 1 2 Ch h h h h hCh h h hh 1
− = − + − − − ∑ .
When the sample size hn in stratum h is given, then ( )2 Ch h ChCh
ChE y y | n Y Y= ,
2 2 2 Ch
E y | n Y n m σ = + −
, 2
h E y | n Y
m σ = + , ( )2 2
2 Ch h ChE s | n = σ and
( )2 2 2 hCh ChE s | n = σ . It is known that ( ) ( )h h
ˆ ˆ ˆE V P V P = , then
( ) ( ) ( ) ( )( ) 2L 22 *Ch
st h 1 Ch h h hCh hh 1
ˆ ˆ ˆ ˆ ˆ ˆ ˆE V Y N E Y Y V P P V P m=
σ = − + + ∑
h h h h
m 1ˆ ˆ ˆ ˆP P V P m 1 m 2 n m
σ σ − + − + − − −
h Ch h h hCh hh 1
ˆ ˆ ˆN Y Y V P P V P m=
σ = − + + ∑
h h h h
m 1ˆ ˆ ˆ ˆE P P V P m 1 m 2 n m
σ σ − + − + − − − ,
h Ch h h h hCh h hh 1
ˆ ˆ ˆ ˆN Y Y V P E P E P 1 P m m 1=
σσ = − + + − − ∑ .
Corollary 4.1 Under stratified inverse random sampling with
replacement, an
unbiased estimator of the mean of a study variable in a whole
population is
( ) L
= + − ∑ , (4.4)
40
st h Ch h h h hCh h hh 1
ˆ ˆ ˆ ˆV y W Y Y V P E P E P 1 P m m 1=
σσ = − + + − − ∑ , (4.5)
( ) ( ) ( ) 22L 22 * *Ch Ch h
st h Ch h h h hCh h h hh 1
ss m 1ˆ ˆ ˆ ˆ ˆ ˆV y W y y V P P P P m m 1 m 2=
− = − + + − − − ∑ , (4.6)
Proof: Since st st 1 ˆy Y N
= =
N = and
( ) ( )st st2 1ˆ ˆ ˆE V y E V Y
N = .
Under this sampling scheme, consider the problem of estimating a
ratio
between two totals of study variables y and x ; h hN NL L
hj hj h 1 j 1 h 1 j 1
R y x = = = =
=∑∑ ∑∑ . A ratio of
unbiased estimators of the denominator and numerator are
considered. In a similar
manner, we obtain L
ˆ ˆ ˆX N P x (1 P ) x =
= + − ∑ as an unbiased estimator of
hNL
X x = =
=∑∑ .
Theorem 4.2 Let ( )hj hjx , y be the values of two variables x and
y associated with
unit j in stratum h , for h 1, 2, , L= … . If ( ) L
st h h Ch h Ch h 1
ˆ ˆ ˆY N P y 1 P y =
= + − ∑ and
( ) L
ˆ ˆ ˆX N P x 1 P x =
= + − ∑ , then under stratified inverse random sampling
with replacement,
h 1
= − −∑
( ) ( ) ( ) ( ) 2 2 Ch 2 Ch
h h h h h
x, y x, yˆ ˆ ˆE P E P 1 P m m 1
σ σ + + − − , (4.7)
( ) ( )( ) ( ) ( )2L Ch2 *
st st h Ch Ch h hCh Ch hh 1
= − − +
∑
h h h h
− + − − −
n 1 n 2 − −
− =
− .
Proof: The notations 2E and 2Cov refer to a conditional expectation
and covariance
given that the sample size hn , as well as 1E and 1Cov , have an
unconditional
expectation and covariance over sample size hn . Let h h Ch h Ch ˆ
ˆy P y (1 P ) y= + − and
h h Ch h Ch ˆ ˆx P x (1 P ) x= + − .
,
= =∑ ( )( ) ( )( )
=
= − − ∑ ,
( ) ( ) ( )( ) L
2 h 1 2 h h h 1 2 h h 2 h h
=
= + ∑ ,
h h hh 1
x, y x, yˆ ˆN E P 1 P m n m=
σ σ= + − − ∑
( ) ( )( )}1 h Ch h h Ch hCh Ch ˆ ˆ ˆ ˆCov P Y 1 P Y , P X 1 P X+ +
− + − ,
( )( ) ( ) ( ) ( ) 2L Ch2 2
h Ch Ch h hCh Ch hh 1
x, yˆ ˆN Y Y X X V P E P m=
σ = − − +
σ + − − .
42
( ) ( )( ) ( ) L
h 1
= − − ∑
h h h h h h
− + + − − − ,
h 1 2 Ch Ch h hCh Ch hh 1
= − − +
∑
h h h h h
− + − − −
.
When the sample size hn is given,
( )2 Ch h ChCh ChE y y | n Y Y= , ( )2 Ch h ChCh ChE x x | n X X=
,
( )2 Ch h ChCh ChE x y | n X Y= , ( )2 Ch h ChCh ChE x y | n X Y=
,
[ ] ( )2 Ch
x, y E x y | n X Y
m σ
= + , ( )2 Ch
x, y E x y | n X Y
n m σ
= + − ,
( )( ) ( )2 2 2 Ch h ChE s x, y | n x, y= σ and ( )( ) ( )2 2
2 hCh ChE s x, y | n x, y= σ .
Consequently,
( ) ( )( ) ( ) ( ) ( )( ) 2L Ch2 *
st st h 1 Ch Ch h h hCh Ch hh 1
x, yˆ ˆ ˆ ˆ ˆ ˆ ˆˆE Cov X ,Y N E Y Y X X V P P V P m=
σ = − − + + ∑
h h h h
x, y x, ym 1ˆ ˆ ˆ ˆP P V P m 1 m 2 n m
σ σ − + − + − − −
h Ch Ch h h hCh Ch hh 1
x, yˆ ˆN Y Y X X V P P V P m=
σ = − − + + ∑
h h h h
x, y x, ym 1ˆ ˆ ˆ ˆE P P V P m 1 m 2 n m
σ σ − + − + − − − ,
h Ch Ch h hCh Ch hh 1
x, yˆ ˆN Y Y X X V P E P m=
σ = − − +
σ + − − .
43
Theorem 4.3 The bias of a ratio estimator, st st
st
hj hj h 1 j 1 h 1 j 1
R y x = = = =
− = and
, where stY and stX are
defined as in theorem 4.2. An approximate mean squared error of stR
is given by
( ) ( ) ( ) ( )2 st st st st st2
1ˆ ˆ ˆ ˆ ˆMSE R R V X V Y 2RCov X , Y X
≈ + − . (4.9)
Proof: ( ) ( ) ( ) ( )st st st st st st ˆ ˆ ˆ ˆ ˆ ˆCov R , X E R X
E R E X= − ,
( ) ( )st st
( )st ˆY X E R= − .
Hence, ( ) ( )st st st
X = − , from which we get
( ) ( )st st st
− − = ,
st2 st
=
,
≤ = .
The parameter R can be written as a function of two parameters.
Let
( ) dh c,d c
= , so YR X
st
= = , where stY and stX are unbiased
estimators of the totals of variables y and x in a whole
population, respectively. The
partial derivatives are ( ) 2
∂ = −
d Y= , we then get 2 Y X
− and 1 X
44
( ) ( ) ( ) ( )st X,Y
∂ = ≈ + −
+ − ∂
.
We have ( ) ( )st st st2 Y 1ˆ ˆ ˆR R X X Y Y
XX − ≈ − − + − . An approximate mean squared
error of stR is
st st st st2
= − ≈ − − + − ,
( ) ( ) ( ) 2
= + − ,
( ) ( ) ( )2 st st st st2
= + − .
In order to estimate the mean squared error of stR , we can
substitute estimates
stX , stY , ( )st ˆ ˆV X , ( )st
ˆ ˆV Y and ( )st st ˆ ˆˆCov X ,Y into the expression for (
)st
ˆMSE R .
An appropriate estimator of ( )st ˆMSE R is obtained by
( ) ( ) ( ) ( )2 st st st st st st st2
st
1ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆˆMSE R R V X V Y 2R Cov X , Y X
= + − , (4.10)
where ( )st ˆ ˆV X and ( )st
ˆ ˆV Y are defined in theorem 4.1 and ( )st st ˆ ˆˆCov X ,Y in
theorem
4.2, respectively. The bias of stR is negligible relative to its
standard error, only
provided that ( )st ˆCV X is less than 0.1 (Cochran 1977:
p.158).
Next, the estimates of the prevalence of units of interest, MP
N
= , and the
mean of a study variable of the units in C, CY , is given.
Corollary 4.2 An unbiased estimator of the total of a study
variable of units in C, CY ,
is L
ˆ ˆY N P y =
=∑ with variance
hh 1
ˆ ˆ ˆV Y N Y V P E P m=
σ = +
( ) ( ) 2L
hh 1
= +
∑ ,
− =
n 1 n 2 − −
= − −
.
Proof: A simple proof is obtained by using a new variable hj h
hj
y , j C y
∈ ′ =
.
By substituting hjy′ into hjy from theorem 4.1, the results are
obtained.
Corollary 4.3 An unbiased estimator of the total number of units in
class C, M , is L
st h h h 1
ˆ ˆM N P =
L 2
ˆ ˆV M N V P =
=∑ .
An unbiased estimator of the variance of stM is ( ) ( )L h h2 st
h
hh 1
n 2=
. By substituting these new values into hjy from
expressions (4.1) to (4.3) in theorem 4.1, the results are
obtained.
Corollary 4.4 An unbiased estimator of the prevalence, MP N
= , is given by
ˆ ˆP W P =
( )( )
2 3L 2 h h h h
st h h h h hh 1
= + + +
+ + + ∑ … , (4.12)
where h hQ 1 P= − . An unbiased estimator of the variance of stP
is
46
n 2=
Proof: Since hP is an unbiased estimator of hP , ( ) L
st h h h 1
ˆE P W P P =
= =∑ . The variance
h h h h h
+