STRATIFIED INVERSE SAMPLING

Microsoft Word - 0 Title.docFulfillment of the Requirements for the Degree of
Doctor of Philosophy (Statistics)
School of Applied Statistics
2010
ABSTRACT
Author Mr. Prayad Sangngam
Year 2010
This dissertation is concerned with stratified inverse sampling and four
different sampling schemes are considered, namely inverse random sampling with
replacement, inverse random sampling without replacement, inverse PPS sampling
with replacement and inverse PPS sampling without replacement. Unbiased
estimators of the mean of a study variable in the whole population, the number of
units in a class of interest and the prevalence of a characteristic are given together
with their unbiased variance estimators. Estimation of the mean per unit in the class of
rare units is also presented and the bound of its bias derived.
A simulation study was employed to study the properties of these sampling
designs and the results of this indicate that inverse sampling without replacement is
more efficient than inverse random sampling with replacement. Inverse PPS sampling
with replacement gave higher efficiencies to estimates than inverse random sampling
with replacement when the correlation coefficient between the auxiliary and study
variables is large. In addition, inverse PPS sampling without replacement is more
efficient than inverse random sampling without replacement when the correlation
coefficient between the auxiliary and study variables is high. When the number of
sampled units in a class of interest increases, the variance and mean squared error of
the estimator decrease.
ACKNOWLEDGEMENTS
I would like to express my thanks and appreciation to everyone who has given
me help in completing this dissertation. In particular, I am indebted to my advisor,
Professor Dr. Prachoom Suwattee, for always giving invaluable suggestions and
encouragement. With his guidance and support, I have completed the dissertation.
Thank you very much for making it all possible.
I also gratefully acknowledge the committee members, Associate Professor
Dr. Samruam Chongcharoen, Associate Professor Dr. Jirawan Jitthavech and
Associate Professor Dr. Montip Tiensuwan, for their constructive comments and
helpful advice. I would also like to thank Associate Professor Dr. Vichit
Lorchirachoonkul and Associate Professor Dr. Pachitjanut Siripanich for their
suggestions.
I sincerely offer appreciation to Silpakorn University and the National Institute
of Development Administration for their financial support during the study.
I am grateful to my friends and my colleagues for their cheerfulness,
friendship and support during the study. I am very grateful to Dr. John McMorris for
his kindness in editing my English, which has also made the manuscript more
readable.
Finally, I would like to acknowledge my mother, my sisters and my wife for
their help, support and cheerfulness. This dissertation is dedicated to my beloved
father who passed away recently.
Prayad Sangngam January 2011
CHAPTER 2 LITERATURE REVIEW 2.1 Introduction
2.2 Review of Sampling Designs for Rare Populations
2.2.1 Two-Stage Cluster Sampling
2.3 Estimation of Parameters in Inverse Sampling
2.3.1 Estimation under Inverse Simple Random Sampling
2.3.2 Estimation under Inverse PPS Sampling
Page
iii
iv
v
viii
x
1
1
3
4
6
6
7
8
9
9
11
12
14
15
17
18
19
20
20
23
vi
3.3.1 Stratified Inverse Random Sampling
3.3.2 Stratified Inverse PPS Sampling
3.4 Simulation Study
4.1 Introduction
4.2.1 The Sampling Scheme
5.1 Introduction
6.2.1 Comparison of Estimates in Stratified Inverse
Random Sampling
Sampling
Random and PPS Sampling
CHAPTER 7 SUMMARY AND FUTURE RESEARCH
7.1 Summary and Conclusions
BIBLIOGRAPHY
APPENDICES
ESTIMATORS
BIOGRAPHY
99
101
104
106
106
107
108
110
117
118
123
129
Tables Page
4.1 The Values from the Units in the Stratified Inverse Random
Sample with Replacement 48
4.2 The Values from the Units in the Stratified Inverse Random
Sample without Replacement 60
5.1 The Values of a Study Variable and Initial Probabilities from
a Stratified Inverse PPS Sample with Replacement 76
5.2 The Values of a Study Variable from a Stratified Inverse PPS
Sample without Replacement 90
6.2 Averages of the Estimates from 10,000 Samples based on
four Stratified Inverse Sampling Designs under the Population
with 0.5ρ = 95
6.3 Estimates of Variances and Mean Square Errors of Estimators
from four Stratified Inverse Sampling Designs under the
Population with 0.9ρ = 96
6.4 Comparison of the Relative Efficiencies of the Estimates under
Stratified Inverse Random Sampling with and without
Replacement 98
Stratified Inverse PPS Sampling with and without Replacement 100
Stratified Inverse Random Sampling and Stratified Inverse PPS
Sampling with Replacement 102
Stratified Inverse Random Sampling without Replacement and
Stratified Inverse PPS Sampling without Replacement 103
6.8 Estimates of Variances and Mean Squared Errors of the Estimators
under the Life Population Study 104
LIST OF FIGURES
3.1 The Study Population
B.1 Frequency Distribution of the Estimator sty under Stratified
Inverse Random Sampling with Replacement
B.2 Frequency Distribution of the Estimator stP under Stratified
B.3 Frequency Distribution of the Estimator Csty under Stratified
Inverse Random Sampling without Replacement
Inverse PPS Sampling with Replacement
Page
12
13
26
29
123
123
124
124
125
125
126
126
127
xi
Inverse PPS Sampling without Replacement
127
128
128
1.1 Background
In some situations, the purpose of a sample survey is to estimate some
parameters that involve just a portion of the whole population. The problem becomes
more interesting when the portion is small. Kalton and Anderson (1986: 65) have
mentioned that the small portion might be as small as one tenth, one hundredth, one
thousandth, or even less of the whole population. In this situation, the population
might be subdivided into two classes, one of which is the subpopulation containing
only a small number of units of interest and the other containing the remaining units.
A population so described is called a rare population. A rare unit is sometimes defined
as a unit that belongs to the small subpopulation. Usually a sampling frame for
subpopulations does not exist, whereas a list of units in the whole population might be
available.
The development of a proper sampling design for a rare population is one of
the most challenging tasks confronting the statistician. Examples of rare populations
are the incidence of the AIDs virus in the human population and the counts of some
special plant in a quadrant within a given forest. Searching for units with rare
characteristics from a sampling frame is called screening. The sample size for
screening depends on the desired number of rare units and the prevalence of rare units
within the population. If the prevalence is small, the sample size for screening will be
large. Suppose that it is desirable to obtain a sample containing 10 rare units from a
population of size 10,000 with 1,000 rare units (i,e, 10 percent of which are rare
units), then the sample size for screening needs to be approximately 100 units. On the
other hand, if the population contains only 100 rare units ( P 0.01= ), the sample size
for screening needs to be approximately 1,000 units. Therefore, if the prevalence
decreases from 0.1 to 0.01, the sample size for screening increases from 100 to 1,000
2
units. In fact, the timing and cost of sample surveys are definitely a consideration as a
result of a large sample size.
Usually, there are three main parameters associated with these populations:
1) The prevalence of a characteristic which is used to define units of interest is
denoted by
MP N
= , (1.1)
where M is the number of units in the population possessing the characteristic of
interest and N is the size of the population.
2) The mean of a study variable y in a class of interest. This parameter is
denoted by
i C
i C
yY M∈
=∑ , (1.2)
where C is the set of all units of interest in the population
3) The mean of a study variable y in the whole population. This parameter is
represented by N
=∑ (1.3)
In a sample survey of such a population, it might be of interest to examine
several variables. A suitable sampling design to detect the appropriate number of units
of interest in a sample which will give a small variance of a corresponding estimator
is difficult to specify. It is a challenging area of research for statisticians to specify the
optimum number of units of interest in a sample.
Many methods of sampling for rare populations have been reviewed by
Sudman and Kalton (1986: 401-429). Some of these are two-phase, cluster, multi-
frame, snowball, stratified, inverse and network sampling, and the advantages and
disadvantages of these sampling methods are discussed in the paper. In addition,
examples of the application of these sampling methods are also given. Kalton (1993:
53-74) considered the applications of these methods in order to sample groups at risk
of contracting AIDS. The methods have been oriented mostly towards surveys
involving human populations (Sudman, 1985; Sudman et al. 1988; Koegle et al.
1996). In natural resource surveys in which the rare units are spatially clustered,
3
Christman (2000: 168-201) reviewed some sampling methods based on selecting
quadrants in order to estimate the abundance of rare units; one of these includes
adaptive cluster sampling. Kalton (2001: 9-14) considered sampling methods for rare
and mobile populations, such as international travelers, car passengers, visitors to
museums and shoppers at a shopping mall.
1.2 Statement of the Problem
In two-phase sampling, the classification of sample units from the first-phase
into two strata yields high efficiencies of the estimates of parameters when one
stratum contains a large number of rare units. The cost ratio for identifying the sample
unit in the first-phase per rare unit should be small. In practice, two-stage cluster
sampling is convenient to use but the estimates of parameters of interest have high
variances.
A two-stage method suggested by Waksberg (1987: 40-46) may be applicable
in telephone surveys but usually creates problems in face-to-face interview surveys. In
multi-frame sampling, the cost of the construction of the sampling frames is
sometimes too high and the estimate of a parameter of interest usually not
satisfactorily precise. In network sampling, a suitable multiplicity rule is difficult to
define and non-sampling errors are usually affected by linked units. In snowball
sampling, a sample is a non-probability sample but it might only be useful for
constructing a sampling frame of rare units. Christman (2000: 194) pointed out that
stratified random sampling is an appropriate design for the study of a rare event when
a disproportional allocation of sample size is adopted. However, the disproportional
allocation might increase the variances of estimators and some samples from strata
with a low prevalence of rare units can give samples without a rare unit present.
Inverse sampling has the advantage that each sample will contain the desired number
of rare units, but the sample size is not fixed. When the prevalence of rare units is
small, the inverse sampling scheme might give a large sample. In adaptive cluster
sampling, the probability of detecting a rare unit depends on both the definition of the
neighborhood and the condition for linking units to the neighborhood. Sometimes, an
4
appropriate definition of the neighborhood cannot be given for the linkage of units in
adaptive sampling.
Hence, of interest is to develop an efficient sampling scheme so that the
number of sampled units in a class of interest and the parameters will be estimated
efficiently. The objectives of the study are as follows:
1) To develop sampling designs identified as stratified inverse sampling
2) To give some unbiased estimators of the parameters Y , M , P and a ratio
estimate of CY for sampling designs
3) To derive the variances or the mean squared errors (MSE) of the estimators
under consideration
4) To find appropriate estimators of the variances or the MSE of the
estimators obtained in 3)
1.3 Scope of the Study
This study develops suitable sampling designs for a population which can be
divided into two classes; the number of sampled units in a class to be detected and
the efficiencies of the estimators are of interest. It is assumed that the population
{ }1 2 NU u , u , , u= … contains a finite number of N distinct and identifiable units.
Furthermore, it is also assumed that the population consists of M units of interest and
N M− remaining units, where N is known but M is unknown. A sampling unit
cannot be specified as a unit of interest until the value of its characteristic has been
observed. The population is divided into L strata, where L 2≥ . A stratum h contains
hN sampling units of which hM units are of interest. In addition, it is assumed that
hN is known but hM is unknown. For stratified inverse sampling, a fixed number of
sampled units in the class of interest, hm , from stratum h is prespecified, and
h h1 m M< ≤ , for h 1,2, ,L= … . In this study, the following points are concentrated
upon:
5
1) A sample of units is drawn one by one from stratum h until the sample
contains hm units of interest, where h 1,2, ,L= … . Different sampling schemes from
the strata are considered
2) Unbiased estimators of Y , M , P and a ratio estimator of CY in each
sampling design are developed
3) For the unbiased estimators, their variances are obtained, and for the ratio
estimator of CY , the MSE or its approximation given
4) Unbiased estimators of the variances are given and the estimation of the
MSE obtained
5) Numerical examples for calculation of the estimates and their variance or
MSE estimates are given. The sampling designs are compared using a numerical
study. Stratified inverse sampling with replacement is compared to stratified inverse
sampling without replacement. Furthermore, the variances and MSE’s of the estimates
in stratified inverse PPS are compared to stratified inverse random sampling
CHAPTER 2
LITERATURE REVIEW
2.1 Introduction
In order to obtain a sufficient number of sampled units in a class of interest,
screening a large number of units from a population is needed. If the prevalence of a
rare characteristic in the population is small, the sample size for screening is large.
Consequently, this usually costs time and money for a screening when the sample size
is large.
Researchers usually want to minimize the time and money on a survey
together with the specification of the degree of precision in the results. When
conducting a household survey, telephone interviewing can be used to reduce the cost
of screening. Although rare units can be identified during telephone conversations, the
non-response rate and inaccuracy of answers from the units can be affected.
Researchers must pay particular attention to the screening questions so as to minimize
the risk of non-response or wrong answers of the units in a sample. Furthermore,
when telephone screening is applied, some households without a telephone cannot be
included in a sample. In this case, telephone screening is biased. When face-to-face
interviews are used for screening, greater field costs are incurred because of travel
between the units. In this case, cluster sampling should be considered in order to
capture more rare units within large areas, such as complete city blocks or entire
villages.
Many researchers are interested in sampling methods which can quickly delete
some of the clusters without rare units. Sudman (1972: 335-339) suggested an
optimum sampling design for use with very rare human populations using a Bayesian
optimum sampling procedure to discard some of the clusters without rare units.
However, some conventional sampling designs can be applied to the sampling of a
7
rare population, for instance Sudman (1978: 300-304) examined a sampling method
combining telephone screening and face-to-face interviewing.
Kish (1980: 209-222) described sampling designs and estimation methods for
parameters in domains, and considered stratified sampling, cluster sampling and
network sampling. The parameters of interest consisted of the population mean, the
total and the difference between the means from some of the domains, and the
estimators of the parameters, their variances and variance estimates were given.
Kalton and Anderson (1986: 1986) reviewed various sampling designs for rare
populations based on available sampling frames. One of these was stratified random
sampling in which a disproportional allocation of sample size to strata was proposed
and the advantages and disadvantages of each design were discussed, as well as
examples for the application of these sampling designs being given.
Sudman and Kalton (1986: 401-429) described alternative methods for the careful
sampling of rare populations. Kalton (1993: 53-74) considered sampling methods to
select members of the human population with HIV. Christman (2000: 168-201)
reviewed various sampling designs of rare geographic clusters of a population in order
to estimate the prevalence of rare units. In addition, the sampling designs were
compared using a simulation study. The results showed that stratified sampling was
more efficient than other sampling designs when the estimate for this design had the
smallest variance and a sampling distribution was most similar to a normal
distribution. Magnani et al. (2005: 67-72) reviewed alternative sampling designs for a
hidden population with HIV. The methods discussed included snowball, facility,
targeted, time-location, respondent-driven and cluster sampling. Each of the sampling
designs suggested by various statisticians for the sampling of rare populations is
subsequently described.
2.2 Review of Sampling Designs for Rare Populations
Many sampling schemes can be applied to rare populations, such as two-stage
cluster, time-location, multiple frame, two-phase, snowball, network, adaptive cluster,
stratified, inverse and probability proportional to size sampling. These sampling
schemes are reviewed in this subsection. However, if a sampling frame for only rare
8
units exists then obviously it should be used, which makes these sampling methods
unnecessary.
2.2.1 Two-Stage Cluster Sampling
When a survey is conducted by face-to-face screening, high costs for finding
rare units are incurred. To reduce this cost, units should be grouped into clusters and
two-stage sampling will do well in these situations. In samples from household
surveys, households are the sampling units and they may be grouped into cities,
districts, etc. and regarded as primary units. During the first stage, a sample of
primary units is selected followed by a subsample of secondary units within each of
the selected primary sampling units during the second stage. When households are
grouped into cities, the cities may be selected by stratified random sampling. In the
second stage, households may be subsampled from each city already drawn in the first
stage by simple random sampling. However, two-stage cluster sampling leads to loss
of precision in the estimate of a parameter when compared with an unclustered design
for the same sample size (Kish, 1965: 161-164).
Two-stage cluster sampling can reduce the cost of obtaining sampling units by
observing a large number of subsample units. When rare units are geographically
clustered, it is more efficient to sample the primary units where the rare units are more
concentrated. A method for improving the efficiency of two-stage inverse sampling,
as described by Waksberg (1987: 40-46), can be adopted in this situation. This
procedure involves first selecting primary units with a probability proportional to the
measure size and then drawing a secondary unit from each sampled primary unit. If
the secondary unit is a rare unit, then the primary unit is accepted into the sample. The
secondary units are drawn until a fixed number ( m ) of rare units is attained. If the
secondary unit drawn in the second stage is not a rare unit, then the primary unit is
rejected and the process is repeated until the required number of primary units which
contain rare units is attained. However, some disadvantages of the scheme are that the
primary unit is often rejected and some clusters might consist of rare units with less
than the desired m units. A modified procedure which addresses this issue is where
the secondary unit consists of more than one rare unit and the clusters are large.
9
Sudman (1985: 20-29) has given a formula for calculating a good choice of the
number of sampled rare units in a cluster.
2.2.2 Time-Location Sampling
Time-location sampling is a specific method used to sample individuals who
visit certain locations, such as libraries, museums, shopping centers, bars, bookstores
and polling places. Sampling is usually conducted either as the visitors enter or leave
a location. These samples are generally obtained when convenient, with the recruiting
being conducted at a time when the numbers of visits to the locations are high.
Location sampling can readily produce a probability sample of visits with known
selection probabilities and hence visits are easily analyzed. Visits may be the
appropriate units of analysis for, say, a survey about satisfaction with visits to a
museum. However, for many surveys using location sampling, the visitor is the
appropriate unit of analysis.
In a clinical survey for AIDS testing, the sampling units are those who visit
four clinics in a given 13-week period. Clinics and times are separated as primary
units. Kalton (2001: 9-14) reviewed various sampling methods for rare and mobile
populations. One of these includes location sampling. There were two parameters of
interest including the number of rare units or prevalence of a rare characteristic and
the mean of a study variable per rare unit. Karon (2005: 3180-3186) proposed a
weighted analysis when time-location sampling was used to collect information.
Time-location sampling is an application of two-stage sampling and an important
feature of this procedure is that the sample units have an equal probability of being
selected.
2.2.3 Two-Phase Sampling
In some situations, accurate identification of rare units is difficult or
expensive, such as in a case where an expensive medical test is needed for a firm
diagnosis of an illness. Suppose that there is an inexpensive method for identifying
rare units but this method is imperfect, then it is cost-effective to apply two-phase
sampling in this situation.
10
In the first phase of sampling, a large number of units are drawn by simple
random sampling and the inexpensive method is used to divide the sample units into
two strata. The first stratum contains units with a high likelihood of being rare units
(the positive group) and the second stratum contains units with a smaller likelihood of
being rare units (the negative group). During the second phase, disproportional
stratified sampling is used with a high sampling fraction from the positive group. An
accurate method for identifying rare units, which is difficult and expensive, is used to
identify rare units in the subsample. The accurate measure of a characteristic and the
values of a study variable are collected in the second phase. Information from both
phases is used to make inferences about the parameters of interest. In the first phase,
the imperfect method is used to identify units and both positive and negative groups
may contain non-rare units.
Tenenbein (1970: 1350-1361) derived the maximum likelihood estimator of
the prevalence of a characteristic under a double sampling scheme. Optimum values
of the initial sample size and subsample sizes were found so that the variance of the
estimator was minimized under a fixed cost. However, two-phase sampling is useful
when the following two conditions are both satisfied. The first condition is that the
inexpensive method to easily classify units in the initial sample influencing the first
stratum should contain a larger prevalence of rare units than the second stratum. The
second condition is that the cost ratio in identifying a rare unit in the first phase to a
rare unit in the second phase is small. Deming (1977: 33) suggested that the cost ratio
should be lower than 1: 6. In the first phase, if the cheap method correctly classifies
population units to the negative group then only units in the positive group are drawn
in the second phase.
Mak and Li (1988: 105-111) considered the estimation of the mean of each
subgroup when the sample comes from a double sampling scheme. Shrout and
Newman (1989: 549-555) presented an optimal two-phase design for estimating the
prevalence of a rare characteristic. The relative efficiency of two-phase sampling
compared to simple random sampling was derived and a condition under which two-
phase sampling was more efficient than simple random sampling was found. Hughes-
Oliver and Rosenberger (2000: 315-327) developed a two-phase design to select units
with multiple rare characteristics. Udofia (2002: 82-89) applied two-phase sampling
11
to the selection of sampling units with a probability proportional to the measure size.
McNamee (2003: 1072-1078) derived a mathematical formula for the optimal cost
ratio when choosing between two-phase or simple random sampling.
In some cases, only sampling units in the positive group are selected to
identify rare units and to observe study values. Biases can occur in these cases since
the second stratum may contain rare units. The error from the misclassification of
population units can adjust this bias. Alonzo and Pepe (2003: 313-326) suggested an
estimator which avoids bias. McNamee (2004: 783-792) proposed two new
allocations of sample size to the subsamples for when two-phase sampling is applied.
Morvan et al. (2007: 261-269) considered methods for assessing the accuracy of an
inexpensive method to identify rare units in the first phase.
2.2.4 Snowball Sampling
According to the original idea of Goodman (1961: 572-579), snowball
sampling refers to methods applied when an initial sample is asked to identify a fixed
number of acquaintances, who in turn are asked to identify a fixed number of
acquaintances, and so on until either a fixed number of waves is reached or without
further cost being incurred. For instance, when taking a snowball sample of homeless
persons, first select a small number of homeless persons. The homeless persons in the
sample are asked to identify other homeless persons, and then each of these persons is
asked to identify homeless persons and so on until the number of desire homeless
persons is reached. One advantage of snowball sampling is in constructing a frame of
rare units. When the sampling frame has been compiled, a probability sample can be
drawn from the frame. However, some rare units in the frame may be omitted leading
to the survey estimate becoming biased. A disadvantage of using snowball sampling
is that the rare units must know other rare units and this condition may not hold for all
rare populations. Snowball samples are not probability samples. Some weighted
methods can not be applied to get an unbiased estimator of a parameter of interest. In
order to increase the efficiency of the estimator, it may be assumed that the population
has a probability distribution.
2.2.5 Multiple Frame Sampling
Although a complete sampling frame that contains all of the units of interest
may not be available, sometimes incomplete frames exist which cover every unit of
interest. In this situation, the frames may be combined in two ways. In the first
method, duplicate units are deleted in order to construct a complete frame. In the
second method, independent samples are selected from the incomplete frames and the
sample estimates combined to infer the parameters of interest. For the case of
independent samples, the original idea of multiple frames was proposed by Hartley
(1962: 203-206). He considered two frames, called a dual frame. Some rare units may
be listed in more than one frame. The units are then divided into the three subsets, as
shown in figure 2,1.
Figure 2.1 Three disjointed Subsets of two Frames
Let A and B be two given frames. The partitioning of A B∪ into 3
subsamples gives A B− , B A− and A B∩ , with A BN − , B AN − and A BN ∩ units,
respectively. Let p be the probability that a sampling unit lies in frame A and q be
the probability that a sampling unit lies in frame B under the condition p q 1+ = . An
estimate of the population total is A B A B B A B A A B A B A BY N y N y N (py qy )− − − − ∩ ∩ ∩′ ′′= + + + ,
where A By − is the sample mean of a study variable from sample units in frame A
only, B Ay − is the sample mean of a study variable from units frame B only, A By ∩′ is
the sample mean of a study variable from units in frame A but lie in frame B, and
A AB B
13
A By ∩′′ is the sample mean from units in frame B but lie in frame A. If A BN − , B AN − and
A BN ∩ are unknown, a new weighted estimate can be applied.
Casady and Sirken (1980: 601-605) gave a dual frame based on household and
telephone frames. An allocation of sample size into frames was considered so that the
variance of an estimator was minimized under a fixed cost when units were drawn by
simple random sampling. Casady et al. (1981: 444-447) considered a dual frame base
on household and telephone frames when the units were selected by cluster sampling
from both frames. Lepkowski and Groves (1984: 265-270) examined an optimal
allocation of sample size into frames when stratified multistage samplings were used
to select samples from both household and telephone frames. A model was also
developed for survey errors to include both the response error and bias from dual
frames. Iachan and Dennies (1993: 747-764) described the use of multiple frames to
sample homeless persons in Washington DC. Four frames were considered, namely
homeless shelters, soup kitchens, encampments (such as vacant buildings and location
under bridges) and streets.
Lohr (1999: 401) examined a dual frame to sample people with Alzheimer’s
disease where the first sampling frame consisted of the entire population that covered
every person with Alzheimer’s and the second sampling frame was adult-care centers
with a high prevalence of people with the disease. In this situation, units are divided
in two subsets of partitions A B∪ , as shown in figure 2.2. Let A and B be the two
given frames. The partition A B∪ into 2 disjointed subsets are A B− and A B∩ . In
this case frame A represents the sampling frame from the entire population and frame
B refers to the sampling frame of adult-care centers.
Figure 2.2 Frame B belonging to Frame A
A AB
14
This is a special case of partitioning as shown by the two frames in figure 2.1. An
estimate of the population total can be similarly derived.
Lorh and Rao (2000: 271-280) discussed various estimators of the population
total from a dual frame sample and a jackknife variance estimator of the estimators
was found. Asymptotic variances of these estimates were subsequently compared.
Srinath et al. (2004: 4424-4429) considered a dual frame where the first frame was a
complete household frame and the other only the households of interest. This dual
frame was adopted during the National Immunization Survey where the households of
interest consisted of households with children aged between 19 and 35 months. Palit
(2006: 3508-3513) proposed an alternative estimator of the population total when
multiple frames have been used to select a sample. Mecatti (2007: 151-157) proposed
a new estimator of the population total of a study variable when a sample was made
up of an aggregation of samples from multiple frames. The variance of the new
estimator was derived and an unbiased estimator of the variance was also given.
2.2.6 Network Sampling
In conventional sampling, a sampling frame provides a single listing for each
sampling unit. For sampling rare populations, the problem of duplicate listings within
frames may arise. In an example of a medical survey, the observation units are the
patients with certain diseases. A random sample of medical centers is drawn and,
from the records of each medical center in the sample, records of patients treated for
those diseases are obtained. However, some of the patients may have been treated at
more than one medical center. Network sampling has been developed to solve this
problem and refers to a sampling scheme where sampling units are drawn by
conventional sampling; not only are the sampled units selected but also other units
which are linked to them and the sampled and linked units are then combined. A
network is defined as a set of units which obey a given linkage rule. Network
sampling can improve the efficiency of the estimate of a parameter whenever a rare
unit in the population can report other rare units and this information may be used to
increase the number of rare units in a sample. It can be applied to increase the number
of rare units where the number of population units in the initial sample is small.
15
Birnbaum and Sirken (1965: 1-8) proposed three unbiased estimators of the
number of units with rare characteristics based on stratified network sampling. Each
estimator depends on the use of information from the observed units in the sample.
Sirken (1970: 257-566) considered network sampling for household surveys. In order
to compare the variances of the estimators, three statistical models were adopted. The
results showed that the estimator from the network sample gave a smaller sampling
variance than the estimators from a conventional design for the same sample size.
Sirken (1972: 224-227) compared the variances of two estimators from stratified
multiplicity sampling to an estimator from conventional sampling. One estimator
from the multiplicity sample permits a sampling unit to be linked to units from only
one stratum, whereas the second estimator permits a sampling unit to be linked to any
unit. The results showed that none of the three estimators always gave a smaller
sampling variance than the others. Levy (1977: 758-763) considered an optimum
allocation of sample size to strata under stratified network sampling. The cost
efficiency of network sampling was compared to that of conventional sampling. The
results showed that network sampling can be more cost efficient than conventional
sampling under certain conditions. Snowden (1983: 102-105) compared the biases
and mean squared errors of four multiplicity rule estimators for estimating the
prevalence of cancer. Czaja et al. (1987: 411-419) found that network sampling could
improve the efficiency of the estimator of a rare characteristic in a population. The
mean squared errors of two multiplicity rule estimators were compared to the mean
squared error of a conventional estimator. Czaja (1988: 38-43) applied network
sampling to a local crime victimization survey where six counting rules were
considered. Sirken (1998: 1-6) reviewed a short history in the application of network
sampling. Sirken (2006: 3664-3668) also described network sampling methodology
and some of its advantages and disadvantages.
2.2.7 Adaptive Cluster Sampling
Adaptive sampling means a sampling design in which the scheme for selecting
units depends on the values of a study variable in the sample during the survey
(Thompson: 1990: 1050-1059). The probability of getting a sample depends on the
values of a study variable in the sample. The rationale behind adaptive sampling is to
16
obtain a more precise estimator of the prevalence of a characteristic of interest or else
to increase the number of sample units with a certain characteristic. Adaptive cluster
sampling is a sampling design where an initial set of units is drawn using a
predetermined probability sampling scheme and whenever values of a study variable
of each sample unit in the initial sample satisfy a given condition then additional units
in the neighborhood of this unit are added to be sampled and so on, until no units
which satisfy the condition are left to add to the sample. The neighborhood refers to a
set of units which is identified by a set of rules. If the i-th unit is a neighborhood of
the j-th unit, then unit j is also a neighborhood of unit i. This relationship is referred to
as symmetric. In social surveys, a neighborhood may be defined by the social
relationship of the units. A collection of all of the units observed under the design
with a result from the i-th initial unit in the sample is called a cluster, which may also
consist of the union of various neighborhoods. A network refers to a collection of
units which satisfy the property that if a unit in a network is selected, then every unit
in the network will be sampled. A unit which does not satisfy the condition but is in
the neighborhood of one that does is called an edge unit. Thus, units partitioned into
networks can be mutually exclusive. However, an inappropriate neighborhood
definition leads to the selection of a large sample size without any units of interest.
Salehi and Smith (2005: 84-103) proposed two-stage adaptive sampling. In the
first stage, a sample of primary sampling units was drawn by one conventional
sampling scheme followed by the selection of initial secondary units from the primary
sampled units. In the second stage, if the initial secondary units satisfy a certain
condition, the secondary units in the primary unit are drawn and added to the sample.
Thompson (2006: 1224-1223) proposed what is called adaptive web sampling.
In the first stage, an initial sample is selected using a specified sampling design, and
either it or its subset is used to construct an active set. The active set depends on the
observed values of a study variable in the sample. In the k-th stage, an additional
sample is selected which depends on the active set or the sampling frame with
probabilities p and 1 p− , respectively. The procedure for selection is performed until
a fixed number of stages k or a fixed sample size has been obtained. Four estimators
of the population mean were given, two of which are unbiased and the remainder
biased. The variances and mean squared errors of the estimators were derived, and
17
estimates of the variances were also given. However, these estimators were obtained
by taking the expected value of an estimator conditioned on a minimum sufficient
statistic. One disadvantage of this technique is that if the sample size is large, the
estimators are cumbersome to compute.
2.2.8 Inverse Sampling
A problem of a fixed sample size sampling design for rare populations is that a
sample may not contain any rare units. A method for solving this problem is the use
of an inverse sampling scheme. Haldane (1945: 222-225) considered inverse simple
random sampling when the parameter to be estimated is the prevalence of a rare
characteristic. The units are drawn one by one with replacement until the sample
contains m rare units. An unbiased estimator of the prevalence of a rare unit was
found but an unbiased estimator of its variance was not given. Finey (1949: 223-234)
gave an unbiased estimator of the variance of Haldane’s estimator. Sampford (1962:
27-40) considered inverse sampling with a probability proportional to size using
clusters. The primary units or cluster are selected one by one with replacement and
with a probability proportional to size until the sample contains m 1+ distinct
clusters. The total sample size is n 1+ clusters after a final cluster draw. In order to
find an unbiased estimator of the population mean, he suggested that the last cluster
be rejected.
Pathak (1964: 158-192) showed that the method of inverse sampling with a
probability proportional to size in cluster sampling of Sampford (1962) is equivalent
to sampling with a probability proportional to size without replacement. In addition,
he also pointed out that an unbiased estimator of the population mean exists which is
uniformly better than the estimator given by Sampford (1962). Mikulski and Smith
(1974: 216-217) found the variance bounds of Haldane’s unbiased estimator. Pathak
(1976: 1012-1017) considered inverse sampling for a fixed cost survey. The sampling
units are drawn sequentially until the fixed total cost of the survey is reached,
assuming that the cost for observing a population unit is unknown in advance until the
information from this unit is completely observed. This method is preferable to the
elimination of the randomness of the total cost for a sample survey. Sathe (1977: 425-
18
426) improved the upper bound of the variance of Haldane’s unbiased estimator, as
did Prasad and Sahai (1982: 286).
Lan (1999), and Chistman and Lan (2001: 1096-1105), developed what is
called inverse adaptive cluster sampling. An unbiased estimator of the population total
was given but an unbiased estimator of the variance of the unbiased estimators was
not. Salehi and Seber (2001: 281-286) showed that Murthy’s estimator can be applied
to inverse sampling. An unbiased estimator of the population total and an unbiased
estimator of its variance were given when inverse adaptive sampling was adopted.
Salehi and Seber (2002: 63-74) considered cases where networks were selected
without replacement until the sample contains m networks with rare units in adaptive
cluster samples. This sampling scheme is called restricted adaptive cluster sampling.
An unbiased estimator of the population total and an unbiased estimator of its
variance were derived. Salehi and Seber (2004: 483-493) found an unbiased estimator
of the population total and an unbiased estimator of its variance when samples come
from general inverse adaptive cluster sampling. Greco and Naddeo (2007: 1039-1048)
considered inverse sampling when the population units were drawn with a probability
proportional to size and with replacement. An unbiased estimator of the population
total and an unbiased estimator of its variance were found. Espejo et al. (2008: 133-
137) considered inverse sampling without replacement.
2.2.9 Stratified Random Sampling
Stratified random sampling may be used to increase the number of rare units
in a sample. In a simple case, assume that there are only two strata such that one
stratum has a higher prevalence of rare units than the other. Since the first stratum
consists of more rare units, a large sample size is selected from this stratum. In
stratified random sampling of a rare population, there are two problems involving the
allocated sample size and the appropriate value of the ratio of two prevalences into
two strata ( )1 2r P P= .
Kalton and Anderson (1986: 73) suggested a disproportional allocation of
sample size to strata where the stratum with the largest number of rare units should be
oversampled. Srinath (1996: 226-231) proposed a new allocation of sample size in an
attempt to keep the sample size in a stratum close to the desirable sample size. This
19
allocation showed that the sample size should be close to the initial sample size and
the number of rare units in a sample should be close to a desirable number. However,
the efficiency of the new allocation depends on the initial allocation. Srinath (1999:
351-354) considered a varying proportional allocation which minimizes the increase
of the variance and tries to minimize the cost of the survey. Boyel and Kalsbeek
(2005: 2738-2793) considered an optimum value of 1 2r P / P= when the population
was divided into two subpopulations. LeFauve et al. (2006) compared the sample size
from a disproportional allocation to proportional allocations under the same variance
of an estimator using a real life population. Kalton and Anderson (1986: 73) pointed
out that disproportionate stratified sampling is preferential when two conditions are
satisfied; the first is that the prevalence of rare units in the first stratum is higher than
the other stratum and the second is that the total number of rare units in the first
stratum per total rare units in the population must be large ( * 1 1P M M= , where 1M
refers to the number of rare units in first stratum).
2.2.10 Probability Proportional to Size Sampling
Unequal probabilities are commonly used in agricultural surveys when the
selection probabilities are evaluated at the moment when the units are drawn. For
instance, previous census data of agricultural production are important indicators
when studying production during the current year. In this case, the study variable is
the total quantity of production by farmers in the current year while the selection
probabilities may be proportional to the quantity of production in previous years. In
order to estimate production of a particular commodity, knowledge of the acreage
used for an agricultural commodity, obtained via analysis of satellite pictures, might
be useful for evaluating the probabilities of unit selection. A quick eye-estimate by an
expert might be useful in suggesting a suitable sampling methodology for estimating
the volume of timber in a forest. Moreover, if it is believed that the values of a study
variable are proportional to the value of an auxiliary variable, then the selection
probabilities are usually chosen proportional to the value of the auxiliary variable in
order to reduce the variance of the estimate.
20
Probability proportional to size (PPS) sampling is a sampling technique used
in surveys in which the probability of selecting a sampling unit is proportional to its
size at each draw. The use of unequal probabilities in sampling was first proposed by
Hansen and Hurwitz (1943: 333-362). Before that, there had been substantial
developments in sampling theory and practice, but all these had been based on the
assumption that the probabilities of the selection of units within each stratum should
be equal. Hansen and Hurwitz demonstrated that the use of unequal selection
probabilities within a stratum would give a more efficient estimator of the population
total. In many situations, it may be preferable to draw a sample with unequal
probabilities and without replacement and there is a vast collection of literature
examining this under fixed sample size sampling schemes.
Madow (1949: 333-345) considered the use of systematic sampling with
unequal probabilities so as to avoid the possibility of units being selected more than
once. This suggestion was followed by a large number of alternative selection
procedures. Horvitz and Thompson (1952: 663-685) produced a general theory of
sampling with unequal probabilities without replacement based on the use of an
unbiased estimator of the population total. Midzono (1952: 99-107) suggested an
unequal probability sampling scheme under a fixed sample size. The first unit of the
sample is selected with an initial probability iz and the remaining units are drawn
with equal probabilities and without replacement.
2.3 Estimation of Parameters in Inverse Sampling
2.3.1 Estimation under Inverse Simple Random Sampling
Haldane (1945: 222) gave an unbiased estimator of the prevalence of rare
units, P , as
, (2.1)
where m is the number of rare units in the sample and n refers to the sample size
drawn by inverse simple random sampling with replacement.
21
However, the expression of the variance of P is complicated and an unbiased
estimator of the variance was not proposed. Finey (1949: 223-234) showed that an
unbiased estimator of the variance of Haldane’s estimator is
( )ˆ ˆP 1 P ˆ ˆV(P)
n 2
− . (2.2)
Mikulski and Smith (1976: 216-217) gave the upper and lower bounds of the variance
of Haldane’s unbiased estimator as
≤ ≤ − +
, (2.3)
where Q 1 P= − . When a sample is drawn by inverse random sampling with
replacement, Lan (1999) proved that an unbiased estimator of the population mean of
a study variable y in the whole population Y is
( )C C ˆ ˆy Py 1 P y= + − , (2.4)
with the variance as
2 2C C C C
ˆ ˆ ˆ ˆV y Y Y V P E P E P 1 P m m 1
σσ = − + + − − , (2.5)
where C
iC i s
i C
i C
σ = − − ∑ is the population variance of non-rare units.
In addition, C refers to the set of sampling units with rare characteristics in the
population, C is the set of sampling units that don’t possess rare characteristics in the
population, Cs is the set of sampled units with rare characteristics and Cs is the set of
sampled units that don’t possess rare characteristics in the sample. An unbiased
estimator of ( )V y was also given as
( ) ( ) ( ) 22
2 * *C C C C
ss m 1ˆ ˆ ˆ ˆ ˆ ˆV y y y V P P P P m m 1 m 2
− = − + + − − − , (2.6)
22
= − −
= − − ∑ , the sample variance of a rare
unit, and ( ) C
22 iC C
= − − − ∑ , the sample variance of a non-rare unit.
For inverse random sampling without replacement, units are drawn one by one
with equal probabilities and without replacement. The sampling continues until the
sample contains m rare units, which is assumed to be on the n-th draw. In this case,
Salehi and Seber (2001: 284) gave an unbiased estimator of the prevalence of a
characteristic, P , as
, (2.7)
which is the same as (2.1). However, the variances of the estimator and its estimate
are different. An unbiased estimator of the variance of P was also given as
( )ˆ ˆP 1 Pn 1ˆ ˆV(P) 1
N n 2
−− = − − , (2.8)
In addition, Salehi and Seber (2001: 284) also gave an unbiased estimator of the
population mean of a study variable, Y , as
( )C C ˆ ˆy Py 1 P y= + − , (2.9)
which is similar to expression (2.4). The variance of y was also given as
( ) ( ) ( ) ( ) 2
2 2C C C
= − + −
m 1 N M − + − − − −
. (2.10)
They also proved that an unbiased estimator of the variance of y is
( ) ( ) ( ) ( ) 22
2 * *C C C C
− = − + − + − − −
− + − − −
2.3.2 Estimation under Inverse PPS Sampling
In inverse PPS sampling with replacement, the sampling units are drawn with
unequal probabilities ( )iz and with replacement until the sample contains m rare
units. Recently, Greco and Naddeo (2007: 1041-1042) gave an unbiased estimator of
the total of a study variable in a whole population,
( )C C ˆ ˆ ˆY P y 1 P y′ ′= + − , (2.12)
where m 1P n 1 −
= −
Y for the parameter Y was also derived as
( ) ( ) ( ) ( ) ( ) 22
2 2C C C C
ˆ ˆ ˆ ˆ ˆV Y Y Y V P E P E P 1 P m m 1
′′ σσ ′ ′= − + + − − , (2.13)
Y y z ∈ ∈
Y y z ∈ ∈
the estimator Y was given as
( ) ( ) ( ) 22
2 * *C C C C
ss m 1ˆ ˆ ˆ ˆ ˆ ˆ ˆV Y y y V P P P P m m 1 m 2
′′ − ′ ′= − + + − − − , (2.14)
where ( )( ) ( )( )
n 1 n 2 − −
′ ′= − −
′ ′= − − −
∑
ˆ ˆV(P) n 2
− =
− .
In inverse PPS sampling without replacement, the sampling units are drawn
with unequal probabilities ( )iz of selection of the remaining units and without
replacement until the sample contains m units of interest. Usually, the Horvitz-
Thompson estimator is used to estimate the population total under PPS sampling
without replacement. The estimator depends on the first-order inclusion probabilities.
Unfortunately, these under an inverse sampling scheme depend on an unknown
parameter, M , the number of units of interest in the population. The Horvitz-
Thompson estimator cannot easily be applied in this sampling scheme. The problem
of sampling with unequal probabilities without replacement has received considerable
attention and Murthy’s estimator has also been used to derive an unbiased estimator
24
of the population total. Raj (1956: 269-284) gave an ordered unbiased estimator of the
population total. Murthy (1957: 379-390) used an unordered process to obtain an
unbiased estimator of the population total under a fixed sample size sampling design.
The probability of obtaining a sample s is denoted by ( )P s . The conditional
probability of getting sample s , given that the i-th unit is selected first, is given by
( )P s | i . The notation ( )P s | i, j refers to the probability of getting sample s given that
the i-th and j-th units are selected in any order in the first two draws.
Fractions ( ) ( )P s | i P s and ( ) ( )P s | i, j P s are used to determine Murthy’s
estimator and an unbiased estimator of its variance. Murthy’s estimator of the
population total is
P s= = ∑ , (2.15)
where n is the sample size. This estimator does not depend on the order of selection
( ) ( ) ( ) ( )
2 N N ji
i j i 1 j i s i, j i j
= − −∑∑ ∑
( ) ( ) ( )
( ) ( ) ( )
i j2 i 1 j i i j
= − −∑∑
. (2.17)
Salehi and Seber (2001) gave a direct proof that Murthy’s estimator can be
applied in any sequential sampling design including inverse sampling and some
adaptive sampling methods.
METHODOLOGY
In this chapter, the notation used is first set out. After this, some of the
definitions of probability sampling and properties of estimators are given. Thirdly, the
development of sampling schemes and estimators are described. The sampling
designs considered are stratified inverse random sampling and stratified inverse PPS
sampling, both with and without replacement. Finally, a comparison method of the
properties of the estimators under the proposed sampling designs is presented.
3.1 Notation
Let { }1 2 NU u , u , , u= … denote a finite population of N distinct and
identifiable units. The elements 1 2 Nu , u , , u… of U are called sampling units. The
identifiable units imply that there exists a one to one correspondence function
between the units and the integers 1, 2, , N… . Let y be a study variable. The values of
the study variable in the population are denoted by 1 2 Ny , y , , y… . The population is
partitioned into two subpopulations of C and C with cardinality M and N M− ,
respectively. It is assumed that C and C are unknown before sampling, for example,
{ }i iC u : y b= ≥ and { }i iC u : y b= < , where b is a given constant. The parameters to
be estimated include the number of units in C , the prevalence of units of interest, the
mean of a study variable of units in class C and the mean of a study variable of the
whole population. The population is stratified into L strata. A subpopulation consists
of hN sampling units and unknown hM 1≥ units of interest, for h 1,2, ,L= … , so
that L
=∑ . A subpopulation can be partitioned into hC and hC
with cardinality hM and h hN M− , respectively. as in figure 3.1.
26
Figure 3.1 The Study Population
The following notations are used for the stratified inverse sampling. For stratum h:
h h
NW N
Ch hj i Ch
units in class hC
units in class hC
h
1Y y M ∈
variable in hC
1Y y N M ∈
variable in hC
from class hC
h hn m− The number of units in a sample
from class hC
( )hh 1 2 ns i , i , , i= … The ordered sample from
selection
The prevalence of units of interest is written as L
h h 1
P M N =
hj
variable of units in C is denoted by L L
C hj h h 1 j Ch h 1
Y y M = ∈ =
=∑∑ ∑ .
However, the same notations are defined in chapters 4 and 5 but their
meanings are different.
3.2 Definitions
Let S denote the collection of all possible samples from a given finite
population U . In order to make inferences about parameters in the population, a
sample comes from a probability sampling, i.e. a sample obtained by a certain known
sampling design.
Definition 3.1 A sampling design1, based on a population U , is a pair ( ), PS where
P is a probability distribution on S such that:
1 The definition is due to Hedayat and Sinha (1991: 3)
28
i) ( )P s 0> for all s∈S
ii) For every unit in the population, there exists at least one sample s∈S
containing the unit
Definition 3.2 Choosing a subset of the population according to a probability
sampling design is called a probability sampling.
Definition 3.3 Let P be a probability distribution defined on S . An estimator T is
unbiased of the parameter θ with respect to a sampling design if
( ) ( ) ( ) s
= = θ ∑ S
.
Definition 3.4 The mean squared error of the estimator T for the parameter θ with
respect to a sampling design is defined as
( ) ( )( )2ˆ ˆMSE T E T s = − θ ,
( )( ) ( ) 2
∈
.
If T is an unbiased estimator of θ under the sampling design, then the mean squared
error of T is the variance.
Definition 3.5 Two samples, 1s and 2s , are said to be equivalent if they both contain
the same sampling units. For example, { }1 1 1 2 3s u , u , u , u= and { }2 1 2 3s u , u ,u= are
equivalent as they both lead to the inclusion of the first three sampling units.
Definition 3.6 A partition of S into subsets of equivalent samples is called a
sufficient partition. Thus, { }T Ts=S is a sufficient partition if each Ts of TS contains
only equivalent samples. Ts is called an element of TS . It is desirable to express a
29
sufficient partition together with its probability measure as ( ){ }T T Ts , P s=S , where
( ) ( ) T
∈ =∑ S
.
Definition 3.7 A statistic ( )T s is said to be sufficient if the partition, TS , induced by
T is sufficient.
Definition 3.8 Let ( )f s be a real-valued function defined on S , then the conditional
expectation of ( )f s , given a sufficient partition ( ){ }T T Ts , P s=S , is given by
( ) ( ) ( ) ( )1 1 E f s | T f s P s / P s= ∑ ∑ ,
where the summation, 1∑ , is taken over the samples Ts s∈ . Note that ( )E f s | T
is defined on TS .
This dissertation is concerned with stratified inverse sampling using four
different schemes, as showed in figure 3.2. For each sampling scheme, the method of
derivation of the estimators of the parameters of interest is described.
Figure 3.2 Diagram of the four Study Sampling Schemes
Stratified Inverse Sampling
30
In each sampling design, the parameters of interest are P , Y and CY . When a
probability proportional to size is made in each draw and without replacement of the
units, a PPS sample without replacements is obtained. Since the selection probabilities
change from draw to draw, suitable estimators taking this aspect into account must be
devised. In order to estimate the total of the whole population, the Horvitz-Thompson
estimator can be used when the inclusion probabilities of units are available.
Unfortunately, inclusion probabilities from inverse sampling depend on an unknown
parameter M , the number of units in class C , which means it is not easy to use the
Horvitz-Thomson estimator. However, Salehi and Seber (2001: 281-286) proved that
Murthy’s estimator can be applied to inverse sampling.
There are many sampling schemes developed for selecting units with PPS
sampling. The case of a sample size equal to 2 is of particular interest because
i i2zπ = . Not all sampling schemes are appropriate for development of inverse PPS
sampling without replacement. Of interest is the sampling scheme reported by
Midzono (1952: 99-107) because it is easy to select units. With this sampling scheme,
Murthy’s estimators can be applied to get unbiased estimators of the parameters of
interest in each stratum.
3.3.1 Stratified Inverse Random Sampling
When a population is divided into L non-overlapping strata, stratum h
contains hN units and hM units of interest, where hN is assumed known but hM is
unknown, for h 1,2, ,L= … . The number of units in the sample from stratum h falling
into hC is given by hm assumed to be fixed in advance; L
h h 1
Inverse random sampling with replacement is applied in each stratum
independently. Let hn be the number of units in stratum h needed to obtain hm units
of interest. So as to find the unbiased estimators of M and P , the estimator of hP is
applied as given by Haldane (1945: 222) in (2.1). The results are combined using
31
weights so that the estimators of their parameters are unbiased. In order to find an
unbiased estimator of Y , the estimator of hY in (2.4) as given by Lan (1999) is used
in each stratum. The results are combined using a sampling weight to obtain an
unbiased estimator of the mean of a study variable of a whole population. For finding
a ratio estimator of CY , the parameter can be written as the ratio of two unknown
parameters h
Y y = ∈
= ∑ ∑ and M , and is simply estimated using a sample ratio of
unbiased estimators of CY and M, respectively. Since sampling is carried out
independently from the strata and the variances of Haldance’s estimator and of Lan’s
estimator are known from inverse simple random sampling with replacement, the
properties of variance from independent variables can be applied to derive the
variances of the estimators of M , P and Y , respectively. Unbiased estimators of the
variances are obtained by substituting the unbiased estimators of variance given by
expressions (2.2) and (2.6). The ratio is a biased estimator of the parameter CY . The
calculation of the bias and the variance for the ratio estimates are given by using a
linearization method.
Inverse random sampling without replacement is also applied to draw the units
in a stratum when the selection between strata is independent. Let hn be the number
of the final sample size in stratum h , where h 1,2, ,L= … . So as to obtain unbiased
estimators of M and P , the estimator of hP in (2.7) as given by Salehi and Seber
(2001: 284) is applied in each stratum. The results are then combined using weights
so that estimators are unbiased with respect to the parameters M and P . So as to find
an unbiased estimator of Y , the estimator of hY as given in (2.9) is applied for every
stratum and sampling weights for the estimators are used. A ratio of two unbiased
estimators of the parameters CY and M is considered for estimating the mean of a
study variable of units in a class of interest. Since the samples are independent from
distinct strata and the variances of the estimators of hP and hY from inverse simple
random sampling without replacement are known, the properties of the variances from
the independent variables are applied in order to derive the variances of the estimators
of M , P and Y . Unbiased estimators of the variances are obtained by substituting
32
the unbiased estimators of the variance in each stratum given by expression (2.9) and
(2.11). Since the ratio is a biased estimator of the parameter CY , so calculation of both
bias and variance is given by using a linearization method.
Note that the estimators under inverse random sampling with and without
replacement have similar expressions but the variances and estimates of the variances
are distinct.
3.3.2 Stratified Inverse PPS Sampling
In stratified inverse PPS sampling with replacement, in stratum h, the
sampling units are drawn so that they are sampled with unequal probabilities with
replacement until the sample contains hm units of interest where the selections
between strata are independent, for h 1,2, ,L= … . So as to obtain an unbiased
estimator of Y , the estimator of hY in (2.12) as given by Greco and Naddeo (2007:
1041-1042) is used in each stratum. The results are combined using weights of the
estimators in order to obtain an unbiased estimator of the mean of a study variable of
a whole population. It can be seen that the number of units in class C, M , is the
summation of a study variable that takes values 0 and 1. To make inferences of M
and P , we use
′ =
By replacing the hjy in (2.9), an estimate of hM is obtained. The estimates of hM are
combined by using weights to obtain unbiased estimators of M and P . A ratio of two
unbiased estimators of the parameters CY and M is considered for estimating CY .
Since sampling is independent from distinct strata and the variance of estimator of hY
from inverse PPS sampling with replacement is known, the properties of the variance
from the independent variables can be applied to derive the variances of the
estimators of Y . Unbiased estimators of the variances are obtained by substituting the
unbiased estimators of the variance given by expression (2.14). In order to make
inferences of their unbiased estimates for the parameters M and P , we define
33
′ =
and substitute for the hjy . For the ratio estimator, both bias and variance are derived
by using a linearization method.
In stratified inverse PPS sampling without replacement, developing
Midzuno’s scheme to use inverse PPS sampling is desirable. For the Midzono
scheme, the first unit of the sample is selected with an initial probability hjz and the
remaining units are drawn with equal probabilities and without replacement. This
sampling scheme is applied as stratified inverse PPS sampling without replacement.
Under this method, Murthy’s estimator in (2.15) can be applied in order to find
unbiased estimators of the population total of a study variable in a stratum, the
number of and the total study values of units in class hC . These estimators are
combined with weights in order to obtain unbiased estimators of M , P and Y . A
ratio of two unbiased estimators of the parameters CY and M is given for estimating
CY . In stratum h, unbiased variance estimators for the parameters M , P and Y are
derived using Murthy’s variance estimator in (2.17). Under the properties of variance
from independent variables, unbiased estimators of the variances are obtained by
substituting the unbiased estimator of the variance in each stratum. The calculation of
both bias and variance for the ratio estimation is derived by using a linearization
method.
In order to compare the estimators from the sampling designs described above,
both real life and simulation data were used as populations for the study. The
variance, mean squared error and squared bias of estimators from stratified inverse
sampling were considered. The procedures for simulation study are as follows:
Step 1 The population of size N is partitioned into L strata and the number of sampled
units ( hm ) from the class hC is predetermined
Step 2 Stratified inverse sampling schemes are used to draw the units from the
population and the study values observed. An estimate of a parameter of interest θ ,
34
denoted by jθ , is calculated. In this case, estimates of Y , M , P and CY are of
interest
Step 3 Step 2 is repeated 10,000 times
Step 4 The estimates of the variance, ( )ˆV θ , the mean squared error, ( )ˆMSE θ and
squared bias of θ , ( )2 ˆB θ , are calculated, i.e.
( ) r
j j 1
1 ˆ 10,000 =
θ = θ∑ is the mean of the estimates of θ from 10,000 samples
Stratified inverse sampling with replacement is compared to stratified inverse
random sampling without replacement. Following this, stratified inverse PPS
sampling with replacement is compared to stratified inverse PPS sampling without
replacement. Finally, a comparison of stratified inverse random sampling and
stratified inverse PPS sampling is made.
CHAPTER 4
4.1 Introduction
In stratified inverse sampling, the population of N units with M units of
interest is first divided into subpopulations of 1 2 LN , N , , N… units corresponding to
1 2 LM , M , ,M… units of interest. These subpopulations are non-overlapping and they
comprise the whole of the population, and are referred to as strata. The following
symbols are used when constructing the stratified inverse random sampling theory.
For stratum h , let h 1, 2, , L= … :
( ) h
class hC
σ = − − ∑ The true variance of a unit in
class hC
= − − ∑ The variance of a unit of a unit in
class hC
= − − − ∑ The variance of a unit of a unit in
class hC
h j C
σ = − −∑ The true covariance between x
and y of a unit in class hC
36
( ) ( )( ) h
h h j C
σ = − − − ∑ The true covariance between
x and y of unit in class hC
( ) ( )( ) h
h j C
= − − − ∑ The covariance between x and
y of a unit in class hC
( ) ( )( ) h
h h j C
1S x, y y Y x X N M 1 ∈
= − − − − ∑ The covariance between x
Ch
1y y m ∈
Ch
1y y n m ∈
( ) Ch
= − − ∑ The sample variance of units in
Chs
( ) Ch
Chs
( ) ( )( ) Ch
h j s
= − − − ∑ The sample covariance of units
in Chs
h h j s
1s x, y y y x x n m 1 ∈
= − − − − ∑ The sample covariance of
Stratification is a common technique used with large populations. The
principal reasons for using it are as follows (Cochran: 1977: 89): (1) If precision of
the estimates in certain subpopulations is wanted, it is advisable to treat each
subpopulation as a population. (2) Administrative convenience may dictate the use of
37
stratification. (3) Sampling problems may differ markedly in distinct parts of a
population. (4) Stratification may provide a gain in precision in the estimates of
parameters of interest.
In each stratum, samples are selected with equal probabilities and with
replacement until the sample contains hm units from hC and the selection of distinct
strata are independent. Let hn be a sample size in stratum h and note that hn is a
random variable. With this sampling scheme, any unit may be drawn more than once.
A sample can be represented as ( )1 2 Ls s ,s , ,s= … . The probability for obtaining each
sample is ( ) hn
=
∏ , where hs contain hm -th units from hC at the hn -th
draw. Sample hs can be divided into Chs and Chs units from classes hC and hC ,
respectively. This sampling scheme is called stratified inverse random sampling
with replacement.
= , as well as a biased
estimator of CY , under stratified inverse random sampling with replacement are
considered here.
Theorem 4.1 Under stratified inverse random sampling with replacement, an
unbiased estimator of the population total of a study variable, Y , is
L
ˆ ˆ ˆY N P y (1 P ) y =
= + − ∑ , (4.1)
38
1y y m ∈
1y y n m ∈
st h Ch h h h hCh h hh 1
ˆ ˆ ˆ ˆ ˆV Y N Y Y V P E P E P 1 P m m 1=
σσ = − + + − − ∑ , (4.2)
( ) ( ) ( ) 22L 22 * *Ch Ch h
st h Ch h h h hCh h h hh 1
ss m 1ˆ ˆ ˆ ˆ ˆ ˆ ˆV Y N y y V P P P P m m 1 m 2=
− = − + + − − − ∑ , (4.3)
n 2
n 1 n 2 − −
= − −
.
Proof: The notations 2E and 2V denote the conditional expectation and variance
given the sample size hn , and 1E and 1V are the unconditional expectation and
variance over hn , respectively. Since the mean of a study variable of units in stratum
h, hN
1Y y N =
= ∑ , can be written as ( )h h Ch h ChY P Y 1 P Y= + − ,
( ) L
st h h Ch h Ch h 1
= + − ∑ ,
= + − ∑ ,
ˆ ˆN Y E P Y (1 E P ) =
= + − ∑ ,
N P Y (1 P ) Y =
= + − ∑ L
=∑ Y= ,
because ( )2 Ch h ChE y | n Y= , ( )2 hCh ChE y | n Y= and ( )1 h h Ê P P= .
Let h h Ch h Ch ˆ ˆy P y (1 P ) y= + − and ( )* 2
1 h h Ê P P= .
( ) L
=
∑ ( )
=
2 h 1 2 h h 1 2 h h
=
= + ∑ ,
( ) ( ) 22L 22 2 Ch Ch
h 1 h h 1 h Ch h Ch h h hh 1
ˆ ˆ ˆ ˆN E P 1 P V P Y 1 P Y m n m=
σσ = + − + + − − ∑ ,
h Ch h h h hCh h hh 1
ˆ ˆ ˆ ˆN Y Y V P E P E P 1 P m m 1=
σσ = − + + − − ∑ .
( ) ( ) ( ) 22L 22 * *Ch Ch h
− = − + + − − − ∑ ,
( ) ( ) 22L 22 * *Ch Ch h
h 1 2 Ch h h h h hCh h h hh 1
− = − + − − − ∑ .
When the sample size hn in stratum h is given, then ( )2 Ch h ChCh ChE y y | n Y Y= ,
2 2 2 Ch
E y | n Y n m σ = + −
, 2
h E y | n Y
m σ = + , ( )2 2
2 Ch h ChE s | n = σ and
( )2 2 2 hCh ChE s | n = σ . It is known that ( ) ( )h h
ˆ ˆ Ê V P V P = , then
( ) ( ) ( ) ( )( ) 2L 22 *Ch
st h 1 Ch h h hCh hh 1
ˆ ˆ ˆ ˆ ˆ ˆ Ê V Y N E Y Y V P P V P m=
σ = − + + ∑
h h h h
m 1ˆ ˆ ˆ ˆP P V P m 1 m 2 n m
σ σ − + − + − − −
h Ch h h hCh hh 1
ˆ ˆ ˆN Y Y V P P V P m=
σ = − + + ∑
h h h h
m 1ˆ ˆ ˆ Ê P P V P m 1 m 2 n m
σ σ − + − + − − − ,
h Ch h h h hCh h hh 1
ˆ ˆ ˆ ˆN Y Y V P E P E P 1 P m m 1=
σσ = − + + − − ∑ .
Corollary 4.1 Under stratified inverse random sampling with replacement, an
unbiased estimator of the mean of a study variable in a whole population is
( ) L
= + − ∑ , (4.4)
40
st h Ch h h h hCh h hh 1
ˆ ˆ ˆ ˆV y W Y Y V P E P E P 1 P m m 1=
σσ = − + + − − ∑ , (4.5)
( ) ( ) ( ) 22L 22 * *Ch Ch h
ss m 1ˆ ˆ ˆ ˆ ˆ ˆV y W y y V P P P P m m 1 m 2=
− = − + + − − − ∑ , (4.6)
Proof: Since st st 1 ˆy Y N
= =
N = and
( ) ( )st st2 1ˆ ˆ Ê V y E V Y
N = .
Under this sampling scheme, consider the problem of estimating a ratio
between two totals of study variables y and x ; h hN NL L
hj hj h 1 j 1 h 1 j 1
R y x = = = =
=∑∑ ∑∑ . A ratio of
unbiased estimators of the denominator and numerator are considered. In a similar
manner, we obtain L
ˆ ˆ ˆX N P x (1 P ) x =
= + − ∑ as an unbiased estimator of
hNL
X x = =
=∑∑ .
Theorem 4.2 Let ( )hj hjx , y be the values of two variables x and y associated with
unit j in stratum h , for h 1, 2, , L= … . If ( ) L
st h h Ch h Ch h 1
ˆ ˆ ˆY N P y 1 P y =
= + − ∑ and
( ) L
ˆ ˆ ˆX N P x 1 P x =
= + − ∑ , then under stratified inverse random sampling
with replacement,
h 1
= − −∑
( ) ( ) ( ) ( ) 2 2 Ch 2 Ch
h h h h h
x, y x, yˆ ˆ Ê P E P 1 P m m 1
σ σ + + − − , (4.7)
( ) ( )( ) ( ) ( )2L Ch2 *
st st h Ch Ch h hCh Ch hh 1
= − − +
∑
h h h h
− + − − −
n 1 n 2 − −
− =
− .
Proof: The notations 2E and 2Cov refer to a conditional expectation and covariance
given that the sample size hn , as well as 1E and 1Cov , have an unconditional
expectation and covariance over sample size hn . Let h h Ch h Ch ˆ ˆy P y (1 P ) y= + − and
h h Ch h Ch ˆ ˆx P x (1 P ) x= + − .
,
= =∑ ( )( ) ( )( )
=
= − − ∑ ,
( ) ( ) ( )( ) L
2 h 1 2 h h h 1 2 h h 2 h h
=
= + ∑ ,
h h hh 1
x, y x, yˆ ˆN E P 1 P m n m=
σ σ= + − − ∑
( ) ( )( )}1 h Ch h h Ch hCh Ch ˆ ˆ ˆ ˆCov P Y 1 P Y , P X 1 P X+ + − + − ,
( )( ) ( ) ( ) ( ) 2L Ch2 2
h Ch Ch h hCh Ch hh 1
x, yˆ ˆN Y Y X X V P E P m=
σ = − − +
σ + − − .
42
( ) ( )( ) ( ) L
h 1
= − − ∑
h h h h h h
− + + − − − ,
h 1 2 Ch Ch h hCh Ch hh 1
= − − +
∑
h h h h h
− + − − −
.
When the sample size hn is given,
( )2 Ch h ChCh ChE y y | n Y Y= , ( )2 Ch h ChCh ChE x x | n X X= ,
( )2 Ch h ChCh ChE x y | n X Y= , ( )2 Ch h ChCh ChE x y | n X Y= ,
[ ] ( )2 Ch
x, y E x y | n X Y
m σ
= + , ( )2 Ch
x, y E x y | n X Y
n m σ
= + − ,
( )( ) ( )2 2 2 Ch h ChE s x, y | n x, y= σ and ( )( ) ( )2 2
2 hCh ChE s x, y | n x, y= σ .
Consequently,
( ) ( )( ) ( ) ( ) ( )( ) 2L Ch2 *
st st h 1 Ch Ch h h hCh Ch hh 1
x, yˆ ˆ ˆ ˆ ˆ ˆ ˆÊ Cov X ,Y N E Y Y X X V P P V P m=
σ = − − + + ∑
h h h h
x, y x, ym 1ˆ ˆ ˆ ˆP P V P m 1 m 2 n m
σ σ − + − + − − −
h Ch Ch h h hCh Ch hh 1
x, yˆ ˆN Y Y X X V P P V P m=
σ = − − + + ∑
h h h h
x, y x, ym 1ˆ ˆ ˆ Ê P P V P m 1 m 2 n m
σ σ − + − + − − − ,
h Ch Ch h hCh Ch hh 1
x, yˆ ˆN Y Y X X V P E P m=
σ = − − +
σ + − − .
43
Theorem 4.3 The bias of a ratio estimator, st st
st
hj hj h 1 j 1 h 1 j 1
R y x = = = =
− = and
, where stY and stX are
defined as in theorem 4.2. An approximate mean squared error of stR is given by
( ) ( ) ( ) ( )2 st st st st st2
1ˆ ˆ ˆ ˆ ˆMSE R R V X V Y 2RCov X , Y X
≈ + − . (4.9)
Proof: ( ) ( ) ( ) ( )st st st st st st ˆ ˆ ˆ ˆ ˆ ˆCov R , X E R X E R E X= − ,
( ) ( )st st
( )st ˆY X E R= − .
Hence, ( ) ( )st st st
X = − , from which we get
( ) ( )st st st
− − = ,
st2 st
=

,
≤ = .
The parameter R can be written as a function of two parameters. Let
( ) dh c,d c
= , so YR X
st
= = , where stY and stX are unbiased
estimators of the totals of variables y and x in a whole population, respectively. The
partial derivatives are ( ) 2
∂ = −
d Y= , we then get 2 Y X
− and 1 X
44
( ) ( ) ( ) ( )st X,Y
∂ = ≈ + −
+ − ∂
.
We have ( ) ( )st st st2 Y 1ˆ ˆ ˆR R X X Y Y
XX − ≈ − − + − . An approximate mean squared
error of stR is
st st st st2
= − ≈ − − + − ,
( ) ( ) ( ) 2
= + − ,
( ) ( ) ( )2 st st st st2
= + − .
In order to estimate the mean squared error of stR , we can substitute estimates
stX , stY , ( )st ˆ ˆV X , ( )st
ˆ ˆV Y and ( )st st ˆ ˆˆCov X ,Y into the expression for ( )st
ˆMSE R .
An appropriate estimator of ( )st ˆMSE R is obtained by
( ) ( ) ( ) ( )2 st st st st st st st2
st
1ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆˆMSE R R V X V Y 2R Cov X , Y X
= + − , (4.10)
where ( )st ˆ ˆV X and ( )st
ˆ ˆV Y are defined in theorem 4.1 and ( )st st ˆ ˆˆCov X ,Y in theorem
4.2, respectively. The bias of stR is negligible relative to its standard error, only
provided that ( )st ˆCV X is less than 0.1 (Cochran 1977: p.158).
Next, the estimates of the prevalence of units of interest, MP N
= , and the
mean of a study variable of the units in C, CY , is given.
Corollary 4.2 An unbiased estimator of the total of a study variable of units in C, CY ,
is L
ˆ ˆY N P y =
=∑ with variance
hh 1
ˆ ˆ ˆV Y N Y V P E P m=
σ = +
( ) ( ) 2L
hh 1
= +
∑ ,
− =
n 1 n 2 − −
= − −
.
Proof: A simple proof is obtained by using a new variable hj h hj
y , j C y
∈ ′ =
.
By substituting hjy′ into hjy from theorem 4.1, the results are obtained.
Corollary 4.3 An unbiased estimator of the total number of units in class C, M , is L
st h h h 1
ˆ ˆM N P =
L 2
ˆ ˆV M N V P =
=∑ .
An unbiased estimator of the variance of stM is ( ) ( )L h h2 st h
hh 1
n 2=
. By substituting these new values into hjy from
expressions (4.1) to (4.3) in theorem 4.1, the results are obtained.
Corollary 4.4 An unbiased estimator of the prevalence, MP N
= , is given by
ˆ ˆP W P =
( )( )
2 3L 2 h h h h
st h h h h hh 1
= + + +
+ + + ∑ … , (4.12)
where h hQ 1 P= − . An unbiased estimator of the variance of stP is
46
n 2=
Proof: Since hP is an unbiased estimator of hP , ( ) L
st h h h 1
Ê P W P P =
= =∑ . The variance
h h h h h
+

Documents

STRATIFIED INVERSE SAMPLING