25
1411 VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN THE CONTEXT OF AN ESTABLISHMENT SURVEY Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada Daniel Hurtubise, Statistics Canada, R.H. Coats Building 11 th floor, Ottawa, Ontario, K1A 0T6, Canada [email protected] ABSTRACT The computation of the precision of estimates of interest provides an idea of data quality. It is sometimes difficult to exactly quantify this precision algebraically on account of the sample design complexity or the estimators produced. Our survey, the Survey of Employment, Payrolls and Hours (SEPH) provides such a challenge. The data for SEPH are collected and combined via two independent data sources: Administrative records and a sample survey. Synthetic estimators are used to produce the estimates. These synthetic estimators are functions of the administrative variables as well as of survey variables. A general estimation formula that summarises the various required estimates of interest is given. The associated variance is provided. Examples of some of the estimators produced are also given. Key Words: Independent sources of data, administrative data, survey data, employment and payroll, ratios of totals, jackknife. 1. INTRODUCTION Estimating the precision of estimates has always been an important component for measuring the quality of data from a survey. Statements of precision are sometimes based on approximate estimates of the true variance due to the complexity of the estimators and of the sample design. An example of such a survey is the Survey of Employment, Payrolls and Hours (SEPH). SEPH uses two independent sources of data: A sample of administrative records and a sample survey called the Business Payroll Survey (BPS). The former is used to obtain estimates of employment and payroll for a number of domains, while the latter allows for the estimation of several important ratios for a number of pre-defined subsets of the population (which are called model groups). SEPH uses synthetic estimators that combine the administrative and survey estimates. Ratios of functions of administrative variables (for some domains) and of survey estimates (over model groups) are computed. The variances of these synthetic estimators are computed using combination of the jackknife technique and post-stratified variance estimation. The survey design associated with SEPH is given in section 2. The two sources of data are described in section 3. The general estimation formula is given in section 4. Its associated variance and some examples are provided in section 5. 2. THE SURVEY OF EMPLOYMENT, PAYROLLS AND HOURS (SEPH) SEPH is a monthly program that publishes data on employment, payrolls, paid hours, overtime pay and hours, as well as summarised earnings for different categories of employment. The primary objectives of the survey are to provide: i) monthly estimates of the total number of paid employees, ii) payrolls average weekly earnings, iii) average weekly hours and other related variables for different domains of interest, which are the three digit Standard Industrial Classification (SIC3) levels (and all aggregations) for Canada and the provinces. The SIC2 (SIC aggregated to two digits) are the major fields of activity, e.g. construction, manufacturing, while SIC3 is a further detailed level. There are close to one million establishments in-scope for this program. The in-scope units include all employers in Canada, except in agriculture, fishing and trapping, private household, religious organisations and military services. SEPH was designed at the beginning of the 1980’s as a stratified sample of establishments. At that time, the sample was designed to obtain expected coefficients of variation (CV) of 3% for the estimated number of employees at the provincial-industrial (SIC3) levels. The estimation procedure was strictly based on the survey design weights and the data directly collected from the sampled establishments. That is, no auxiliary information was used in estimation. This sampling design was recently changed because of the new availability of administrative data from the Canadian Customs and Revenue Agency (CCRA). The availability of this new administrative data source from CCRA led to a redesign that was spread over three phases, and spanned over a period of four years. Since the last phase of the

VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1411

VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN THE CONTEXT OF ANESTABLISHMENT SURVEY

Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics CanadaDaniel Hurtubise, Statistics Canada, R.H. Coats Building 11th floor, Ottawa, Ontario, K1A 0T6, Canada

[email protected]

ABSTRACT

The computation of the precision of estimates of interest provides an idea of data quality. It is sometimes difficult toexactly quantify this precision algebraically on account of the sample design complexity or the estimators produced. Oursurvey, the Survey of Employment, Payrolls and Hours (SEPH) provides such a challenge. The data for SEPH are collectedand combined via two independent data sources: Administrative records and a sample survey. Synthetic estimators are usedto produce the estimates. These synthetic estimators are functions of the administrative variables as well as of surveyvariables. A general estimation formula that summarises the various required estimates of interest is given. The associatedvariance is provided. Examples of some of the estimators produced are also given.

Key Words: Independent sources of data, administrative data, survey data, employment and payroll, ratios oftotals, jackknife.

1. INTRODUCTION

Estimating the precision of estimates has always been an important component for measuring the quality of datafrom a survey. Statements of precision are sometimes based on approximate estimates of the true variance due to thecomplexity of the estimators and of the sample design. An example of such a survey is the Survey of Employment,Payrolls and Hours (SEPH). SEPH uses two independent sources of data: A sample of administrative records and asample survey called the Business Payroll Survey (BPS). The former is used to obtain estimates of employment andpayroll for a number of domains, while the latter allows for the estimation of several important ratios for a numberof pre-defined subsets of the population (which are called model groups). SEPH uses synthetic estimators thatcombine the administrative and survey estimates. Ratios of functions of administrative variables (for some domains)and of survey estimates (over model groups) are computed. The variances of these synthetic estimators arecomputed using combination of the jackknife technique and post-stratified variance estimation.

The survey design associated with SEPH is given in section 2. The two sources of data are described in section 3.The general estimation formula is given in section 4. Its associated variance and some examples are provided insection 5.

2. THE SURVEY OF EMPLOYMENT, PAYROLLS AND HOURS (SEPH)

SEPH is a monthly program that publishes data on employment, payrolls, paid hours, overtime pay and hours, aswell as summarised earnings for different categories of employment. The primary objectives of the survey are toprovide: i) monthly estimates of the total number of paid employees, ii) payrolls average weekly earnings, iii)average weekly hours and other related variables for different domains of interest, which are the three digit StandardIndustrial Classification (SIC3) levels (and all aggregations) for Canada and the provinces. The SIC2 (SICaggregated to two digits) are the major fields of activity, e.g. construction, manufacturing, while SIC3 is a furtherdetailed level. There are close to one million establishments in-scope for this program. The in-scope units include allemployers in Canada, except in agriculture, fishing and trapping, private household, religious organisations andmilitary services.

SEPH was designed at the beginning of the 1980’s as a stratified sample of establishments. At that time, the samplewas designed to obtain expected coefficients of variation (CV) of 3% for the estimated number of employees at theprovincial-industrial (SIC3) levels. The estimation procedure was strictly based on the survey design weights andthe data directly collected from the sampled establishments. That is, no auxiliary information was used in estimation.

This sampling design was recently changed because of the new availability of administrative data from the CanadianCustoms and Revenue Agency (CCRA). The availability of this new administrative data source from CCRA led to aredesign that was spread over three phases, and spanned over a period of four years. Since the last phase of the

Page 2: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1412

redesign, SEPH is based on two independent sources of data: The administrative data from CCRA and the BPS.These two sets of data are combined to get the estimates required each month. The use of administrative data allowsfor the reduction of the response burden while increasing the data quality, especially for the estimated number ofemployees.

For more information on SEPH, see Hidiroglou (1995), Lim and Daoust (1997), Rancourt and Hidiroglou (1998)and Grondin, et al. (1999).

3. SOURCES OF DATA

3.1. Administrative data

Enterprises are required to remit to CCRA all the amounts of deduction retained from the employees’ wages andsalaries. Enterprises send their remittances to CCRA for each list of employees. This results in a complete file of allBusiness Number (BN) accounts containing the following three variables: Remittances, total number of employees(employment) and total salaries (payrolls). This file is available on a monthly basis from CCRA. A BN is a uniqueidentifier assigned by CCRA to each legal entity participating in one or more of the following taxation programs:Corporation taxation, Goods and Services Tax (GST), Payroll deductions, Importer/Exporter tax. There are twotypes of remitters based on the number of remittances made each month. The first type (type O) represents oncemonthly remitters whose average monthly remittances in the previous year were less than $15,000. The second type(type T) represents remitters whose average monthly remittances in the previous year were at least $15,000. Type Tremitters remit one to four times a month depending on the type of payroll used (i.e., monthly, weekly, biweekly,etc.). Currently, all BN accounts (approximately 1,000,000) are available. However, only a sample of some 200,000selected BN’s which have been edited and imputed are used. Sampling the BN’s was necessary because editing andimputing the whole BN universe is computationally intensive and requires time-consuming manual intervention.

Sampling of the administrative file is done in such a way that some records are selected with certainty (take-all) andothers are selected using Poisson sampling (take-some). The take-all portion includes all the type T remittersaccounts as well as BN’s linked to multiple enterprises and large enterprises identified as take-all in the BPS sample.The take-some portion is a sample of type O remitters selected with Poisson sampling using the last digit of the BNnumber, along with a pre-determined sampling fraction in each province (Ontario, Quebec and British Columbia:10%, Yukon and Northwest Territories: 100%, all other provinces: 20%). Population counts are available for theadministrative file at the industry, geography and BN’s employment size. These population counts and thecorresponding sample counts are then used to obtain post-stratified weights.

3.2. Survey data

Only two (employment and payrolls) out of the three variables obtained from the administrative data files are usedby SEPH. However, SEPH requires a number of additional variables. These are number of paid hours, totalearnings, overtime hours, further broken out into different categories of employees such as hourly rated employeesand salaried employees. Some of the additional variables are regressed on employment and payrolls, while othersare used to compute some ratios of totals. Predicted values for these variables are then obtained for all theadministrative sample by applying these fits or ratios ( ADMkk XY ,

ˆˆˆ ×= β ). The additional variables are collected viaa sample of establishments from the Business Register (BR), known as the Business Payroll Survey (BPS). Anestablishment will be in-scope if it is an employer (have at least one employee), is active and belongs to an industrycovered by SEPH.

The establishments on the BR are grouped into independent and homogeneous model groups based on industry(sometimes also based on geography and size). The model groups form an exhaustive and non-overlapping partitionof the BR. They are chosen to provide the best fit for the variables to be predicted while retaining an analyticalmeaning. Each model group is split into a take-all and a take-some portion. The determination of the take-all portionof each model group uses a procedure known as the sigma gap method (Bernier and Nobrega (1998)). Allestablishments within a model group are ranked on employment size, say x(k). The take-all boundary x(k) is thesmallest value that satisfies the following conditions: i) the x(k) value is greater than the median of theestablishments’ employment size, and ii) x(k) -x(k-1)>aσx , where σx is the standard deviation of the x-values within the

Page 3: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1413

model group and a is a specified constant. All establishments whose employment size is higher than x(k) are includedin the take-all portion. Each take-all portion is also split into two portions of take-all, based on employment size.The establishments belonging to the larger take-all portion are excluded from the regression fit, and representthemselves in their own model group. The establishments belonging to the smaller take-all portion are included inthe regression fit of their associated model group.

The take-some portion of each model group is further split into strata based on geography, industry and employmentsize. The take-some sample is selected using a stratified sampling design and a “collocated sampling” within eachstratum. Each selected establishment stays 12 months in average in the sample, and one twelfth of the sample isreplaced each month. The BPS reaches approximately 10,000 establishments each month.

4. ESTIMATION

The required estimates are obtained by tabulating the variables associated with the file of administrative records.Some of these variables are either obtained from the regression models and/or the use of ratios from BPS. Thesevariables are called derived variables as opposed to the direct variables obtained from the administrative source(employment and payrolls).

4.1. Estimation of Paid Hours and Summarised earnings

These two derived variables, paid hours and summarised earnings, are predicted with regression models developedusing the BPS sample with employment and payroll as independent variables. Mass imputation is applied to theadministrative records using the regression coefficients computed for each model group. For simplificationpurposes, the paid hours and summarised earnings variables are considered as administrative variables in thevariance calculation (see section 5.1 for the justification of this simplification).

4.2. Estimation using synthetic estimators

SEPH publishes 37 different estimates each month for several domains of interest d. The majority of these estimatesare synthetic estimators based on a function of administrative totals for domains of interest and a function of ratiosof BPS totals at the model group level. The lowest domains of interest are defined within each model group, whilethe largest domains of interest might cover more than one model group. The estimates of predicted values areapproximately unbiased at the model group level. Estimates for domains smaller than model groups are syntheticand maybe biased. However, since the model groups are composed of homogeneous establishments based onindustry and geography characteristics, the bias of the synthetic estimator is assumed to be negligible.

For a given domain of interest d, the 37 estimators can be represented as �

×

×

=

ggdg

ggdg

d DC

BA

Z)(

)(

)( : Ag(d) and Cg(d) are

administrative totals, while Bg and Dg are functions of ratios of BPS totals. Variants of Z(d) are constructed by settingB, C or D to 1. Z(d) is estimated as

�=

ggdg

ggdg

d DC

BA

Z ˆˆ

ˆˆ

ˆ)(

)(

)(

where )()(ˆ and ˆ

dgdg CA are estimates of Ag(d) and Cg(d). These estimates are computed for the domain of interest d for

the corresponding model groups g; gg DB ˆ and ˆ are estimates of Bg and Dg, computed for model group g. )(ˆ

dZ is infact a ratio of dependent variables, and each variable is a total over all model groups linked to the domain of interest.Let gdgdggdgdg BAYDCX ˆˆˆ and ˆˆˆ

)()()()( == , be the synthetic estimators of the totals Xg(d) and Yg(d) respectively.

Page 4: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1414

Hence

)(

)(

)(

)(

)( ˆ

ˆ

ˆ

ˆ

ˆd

d

gdg

gdg

d X

Y

X

Y

Z ==�

�.

Example 1: In the estimation of the number of hourly rated employees H_EMPL, we have that

BPSADMBPS

BPSADM REMPL

EMPLEMPLH

EMPLEMPLH 1_

_ ×=×= . H_EMPL is the product of the number of

employees on the administrative source times the BPS ratio of the number of hourly rated employees to the numberof total employees calculated over model groups, which is ratio R1BPS. In this case, A=EMPLADM (employment fromadministrative source), B=R1BPS, C=D=1.

Example 2: Let us consider the average weekly earnings for the salaried employees,

.23

_BPSADM

BPSADM

REMPLRSumm

AWES××

= R2BPS is the proportion of salaried employees to total employees and R3BPS is the

proportion of total summarised earnings for salaried employees to total summarised earnings. Both ratios are fromBPS and are calculated over model groups. In that case, A=SummADM (Summarised earnings from administrativesource), B=R3BPS, C=EMPLADM (employment from administrative source), D=R2BPS.

5. VARIANCE FORMULA

5.1. Characteristics of the sample design

Samples are selected from two independent sources of data, resulting in two sources of variability. Since thesesources are independent there is no covariance between their estimators.

Two administrative variables (paid hours and summarised earnings) are mass imputed using regression modelsderived from the BPS sample. The variability of these regression models should be part of the variability generatedby the administrative portion. However, these models are associated with homogeneous model groups with good fits(the average R2 is above 95%). We consider that the variability resulting from the regression models is negligible ascompared to the sampling variability of the administrative records. Further, the sample size for the current sample islarge enough to yield stable ratios. This assumption has been checked on data by computing the variability of theregression models for paid hours and comparing its importance with respect to the other components of variability.

Another characteristic that affects the variability coming from the administrative source is the imputation of missingvalues (called classical imputation). Every administrative record with missing value are imputed according to thereported variables and to the historic data available (previous month), each variable being multiplied by therespondent trend in each imputation class. The variability of this imputation should be considered in the totaladministrative variability. However, this imputation variability is considered to be negligible as compared to thesampling variability of administrative records. The average imputation rate of the administrative file isapproximately of the order of 20%. Several imputation procedures are used, however the trend imputation is themost used procedure among those. The resulting variability should not be too important, however, it will beincorporated into the variance computations at some later point in time.

In summary, the total variance includes the sampling variability of the two sources of data.

5.2. Estimating the variance

The estimated variance of )(ˆ

dZ is obtained by noting that it is the ratio of two dependent variables. Hence

Page 5: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1415

Since the two sources of data are independent, we have that ( )gdg BAvoC ˆ,ˆˆ )( ,

( )gdg DAvoC ˆ,ˆˆ )( , ( ) gdg BCvoC ˆ,ˆˆ )( and ( )gdg DCvoC ˆ,ˆˆ )( are equal to zero.

Since ˆˆˆ)()( gdgdg DCX = and ˆˆˆ

)()( gdgdg BAY = , we have that

( ) ( ) ( ) ( ) ( ))()()()()()()()(ˆ,ˆˆˆ,ˆˆˆ,ˆˆˆˆˆ,ˆˆˆˆˆ,ˆˆ dgdgggdgdgggggdgdgdgdg CAvoCDBvoCCAvoCDBDBvoCCAXYvoC −+=

and ( ) ( ) ( ) ( ) ( )gdgdgggdggdgd BVAVABVBAVBAVYV ˆˆˆˆˆˆˆˆˆˆˆˆˆ)ˆ(ˆ)(

2)(

2)()()( −+== . (Goodman (1960)).

The following subsections explain how the sampling variance is estimated for each source.

5.3. Variability due to administrative data

There are four administrative variables for which the associated variance is computed: Employment, payrolls, paidhours and summarised earnings. The last two variables are derived from regression models, while the first two areeither directly obtained from the administrative source or imputed. Since the variability due to the use of regressionmodels and the one due to imputation are considered negligible, it can be assumed that these four variables arereported by the administrative source. The variance and the covariance between these variables are computed foreach domain of interest within each model group g. These variances are estimated as if the estimator had been post-

stratified. That is ( ) ( )�∈

��

��

�−=

gh h

dh

g

ghhdg n

S

N

NfNAV

2)(

22

)(

ˆ

ˆ 1ˆˆ and ( )�∈

−−

=hsi

dhdhih

dh AAn

S 2)()(

2)( 1

1ˆ where

otherwise. 0 , if )()( dhhidhi siAA ∈= Ng is the population size of model group g, sh is the portion of hth stratum

contained in model group g and sh(d) is the portion of the sample sh that belongs to domain d.

2

ˆ ��

��

g

g

NN

is set equal to

1 because Ng is not readily available from the administrative file. Furthermore, since the sample size is relatively

large, it is reasonable to assume that

2

ˆ ��

��

g

g

N

N is close to one.

5.4. Variability due to BPS

The variability considered for the BPS sample is the variability of the functions of ratios that are in the form ofsynthetic estimators. The variance of these functions of ratios is computed using the jackknife technique within eachmodel group. Some synthetic estimators are strictly of a ratio type. In that case, the covariance between the functionsof ratios of the numerator and the one from the denominator are also computed using the jackknife technique. Let Hg

be the number of strata within each model group g. The estimated variance for a given ratio gB computed within

model group g is ( ) ( ) ( )� �= =

−−

=g ghH

h

L

lghlg

gh

ghg BB

LL

BV1 1

2)( ˆˆ1ˆˆ where Lgh is the number of replicates within stratum h

(h=1,…,Hg) for model group g (Lgh=2 in our survey), Hg is the number of strata within model group g, )(ˆ

hlgB is the

estimate of Bg when the replicate l of stratum h is removed from the data and gB is the estimate of Bg with the fullBPS sample.

( ) ( ) ( ) .ˆ,ˆˆˆ2ˆˆˆˆˆˆ1)ˆ(ˆ

)()()()(2

)()(2)(

)(��

��

�−+= � � �

g g gdgdgddgddg

dd XYvoCZXVZYV

XZV

Page 6: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1416

5.5. Example

The estimated variance of the estimator of H_EMPL given in section 4.2 is given by[ ]� ×−×+×=×=

gdggdgggdgd AVBVABVBAVBAVZV )ˆ(ˆ)ˆ(ˆ)ˆ(ˆˆ)ˆ(ˆ)()ˆ(ˆ

)(2

)(2

)()( , where A=EMPLADM, B=R1, C=D=1.

The estimated variance of S_AWE is given by the formula in section 5.2 with A, B, C and D defined in section 4.2.

6. CONCLUSION

The synthetic estimators presented in this paper combine results from two independent sources of data. Theunderlying assumption to use these estimators is that the estimates computed from BPS on some model groups arevalid for each domain of interest for the administrative source.

SEPH has undergone a redesign in three phases since 1994 to incorporate administrative data to improve estimatesand to reduce respondent burden. The variance formula presented here was developed to take into account the newsample design of SEPH. In this formula, the sampling variance of independent sources of data is considered:administrative records and the BPS. This formula is simple to use as long as the estimators can be defined as afunction of administrative totals and BPS totals.

In the near future, SEPH will be updated to the new industrial classification NAICS. At the same time, alladministrative records (census) will be used. An automatic corrector will be put in place to insure a quickverification and correction of the data. With the census in place, the sampling variance of the administrative datawill be null. Therefore the variance due to classical imputation and the one due to regression models will no longerbe negligible in relation to the total variance. Furthermore, since the census of administrative data has 5 times moredata than the sample, the classical imputation might be more important. Variance due to imputation and variance dueto regression models will then be incorporated in the variance computations in the near future.

7. ACKNOWLEDGMENTS

The authors would like to thanks Chantal Grondin and Vincent Porte for their help in this project and Eric Rancourtfor the revision of this paper.

8. REFERENCES

Bernier, J. and K. Nobrega (1998), “Outlier detection in asymmetric samples: A comparison of an inter-quartilerange method and a variation of a sigma gap method”, Proceedings of the Survey Methods Section, StatisticalSociety of Canada, pp. 137-141.

Goodman, L.A. (1960), “On the exact variance of products,” Journal of the American Statistical Association, 55, pp.376-382.

Grondin, C., D. Hurtubise and V. Porte (1999), “Notes méthodologiques de la variance”, Internal document,Business Survey Methods Division, Statistics Canada.

Hidiroglou, M. A. (1995), “Sampling and estimation for stage one of the Canadian Survey of employment, payrollsand hours survey redesign”, Proceedings of the Survey Methods Section, Statistical Society of Canada, pp. 123-128.

Lim, A. and P. Daoust (1997), “Allocation of the sample for the stage III redesign of the survey of employment,payrolls and hours (SEPH)”, Internal document, Business Survey Methods Division, Statistics Canada.

Rancourt, E. and M. A. Hidiroglou (1998), “Use of administrative records in the Canadian Survey of Employment,Payrolls and Hours”, Proceedings of the Survey Methods Section, Statistical Society of Canada, pp. 39-47.

Page 7: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1417

DERIVING AND ESTIMATING AN APPROXIMATE VARIANCE FOR THE HORVITZ-THOMPSONESTIMATOR USING ONLY FIRST ORDER INCLUSION PROBABILITIES

Ken Brewer, Australian National UniversityAustralian Capital Territory 0200, Australia. <Ken.Brewer.anu.edu.au>

Using both purely design-based and model-assisted arguments, it is shown that under mild conditions the variance of theHorvitz-Thompson (HT) estimator depends almost entirely on first order inclusion probabilities. Approximate expressionsand estimators are derived for this "natural variance" of the HT estimator. Tentative formulae for the HT varianceestimator are also provided for the most important case where the HT variance does not take its natural value, namelywhen sampling is systematic from a deliberately ordered population. Corresponding estimators of the variance of theGeneralized Regression Estimator follow straightforwardly.

Key Words: Finite population correction; Finite population sampling; Model assisted survey sampling; Regressionestimation; Systematic sampling.

1. SOME APPROXIMATE FORMULAE FOR THE DESIGN-VARIANCE OF THE HT ESTIMATOR

Let iNi YY 1=• Σ= be the total of the item for a finite population of N units. If a sample of n units is drawn without

replacement from that population with first order inclusion probabilities, the Horvitz-Thompson (HT) (1952)estimator of total is 1ˆ −

∈• Σ= iisiHT YY π . For the important special case where n is fixed, Sen (1953) and Yates and

Grundy (1953) showed independently that HTY• had the variance

2111)(1

2 ))(()ˆ( −−=≠=• −−ΣΣ= jjiijiij

Nij

NiHT YYYV πππππ , (1)

where is the second order or joint inclusion probability of the i th and j th population units together in the samesample. They therefore suggested the variance estimator

2111)(

2 ))(()ˆ(ˆ −−−∈∉∈• −−ΣΣ= jjiijiijijsijsiHTSYG YYYV ππππππ . (2)

This is known to perform better than the variance estimator suggested by Horvitz and Thompson (1952) (the latter,however, usually being unbiased for random n ) but the total dependence of )ˆ(ˆ 2

HTSYG YV • on ijπ has provedproblematical (Brewer 1999). An alternative formulation to (1) is, however,

21121

2111

2 )()()ˆ( −•

−=

−•

−=• −Σ−−Σ= nYYnYYYV iii

Niiii

NiHT ππππ

))()(( 11111)(1

−•

−−•

−=≠= −−−ΣΣ+ nYYnYY jjiijiij

Nij

Ni πππππ . (3)

The first term in (3) is the same as the variance of the corresponding Hansen-Hurwitz (1943) estimator for samplingat n draws with replacement, the probability of selection for unit i at each draw being 1−niπ . The second termlooks very much like a finite population correction, so these two terms together plausibly constitute a firstapproximation to the entire variance of the HT estimator. For a more formal consideration of (3) the followingnecessary conditions on the first and second order inclusion probabilities (which also follow from the assumptionthat the sample size is fixed) will be useful:

iijN

ij n ππ )1(1)( −=Σ =≠ (a), )1(1)(1 −=ΣΣ =≠= nnijN

ijNi π (b), )(1)( iiji

Nij n ππππ −=Σ =≠ (c) and

21

21)(1 i

Niji

Nij

Ni n πππ ==≠= Σ−=ΣΣ (d). (4a-d)

If, in addition, we approximate ijπ by 2/)( jijiij cc += πππ� we obtain

Page 8: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1418

))()(( 11111)(1

−•

−−•

−=≠= −−−ΣΣ nYYnYY jjiijiij

Nij

Ni πππππ�

))(}(2/)2{( 11111)(1

−•

−−•

−=≠= −−−+ΣΣ= nYYnYYcc jjiijiji

Nij

Ni ππππ

21121 )()1( −

•−

= −−Σ−= nYYc iiiiNi ππ , (5)

and )ˆ(2HTYV • can then be approximated by

2111

2 ))(1()ˆ(~ −•

−=• −−Σ= nYYcYV iiiii

NiHT πππ . . (6)

Two convenient choices for ic suggested by the ratios of sums of ijπ to the corresponding sums of jiππ are)/()1( ii nnc π−−= , prompted by a comparison of (4a) with (4c) (7)

and )/()1( 21

1k

Nki nnncc π=

− Σ−−== , prompted by a comparison of (4b) with (4d). (8)

A third is )2/()1( 21

1k

Nkii nnnc ππ =

− Σ+−−= . (9)

Expression (9) is a modification of (7) and (8) suggested by the asymptotic expressions for ijπ obtained by Hartleyand Rao (1962) and by Asok and Sukhatme (1976) for random systematic selection (Goodman and Kish 1950) andfor Sampford's (1967) procedure respectively. For all three choices of ic , (6) is without error under srswor.

2. A MODEL ASSISTED CHECK ON THE USEFULNESS OF THE APPROXIMATE VARIANCEFORMULAE

Consider the following ratio model as a possible description of the population.

ξ : iiiY εβπ += ; 0=iE εξ ; 22iiE σεξ = ;. 2

ijiE σεεξ = , ij ≠∀ (10)

The prediction or model expectation of the approximate variance expression (6) under ξ is

2111 ))(1( −

•−

= −−Σ nYYcE iiiiiNi πππξ

211

11 ))(1( −

=−

= Σ−−Σ= ncE jNjiiiii

Ni επεππξ

21

21

2121

1121 )21()1( j

Njii

Niiii

Niiii

Ni cnncn σππσππσ ==

−−=

−−= ΣΣ−−Σ−−Σ= (11)

If this expression is to be equated to the corresponding expectation for )ˆ(2HTYV • itself, namely )1( 12

1 −Σ −= ii

Ni πσ

(Godambe 1955), a closed approximation is possible by imposing the condition that cci = for all i , in which casethe requirement on c is

21

121

21

221

121

1 )2)(1( iNij

Nji

Niii

Nii

Ni nnnc σσσπσσ =

−==

−=

−=

− ΣΣΣ+Σ−Σ−= . (12)

Under srswor, (12) becomes )}1(/{)1( −−= NnnNc , which yields the exact expression for the srswor variance.

Even without srswor, ii πσσ 22 = in (12) returns expression (8) for c . It is remarkable that the purely design-basedanalysis and the model-assisted analysis, which make such disparate assumptions, converge so nearly on the sameexpression for the "natural" or “high-entropy” variance of the HT estimator.

Page 9: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1419

3. ESTIMATING THE NATURAL DESIGN-VARIANCE OF THE HT ESTIMATOR

A plausible sample estimator of the approximate variance of the HT estimator shown in (6) would be

21112 )ˆ)(()ˆ(~ −•

−−∈• −−Σ= nYYcYV HTiiiisiHT ππ . (13)

This estimator is also exactly unbiased for the HT variance proper when sampling is srswor. To test its propertieswhen sampling is not srswor, we consider its expectation under the ratio model ξ of (10):

. 21112 )ˆ()()ˆ(~ −•

−−∈• −−Σ= nYYEcYVE HTiiiisiHT ππ ξξ . (14)

The most desirable expression is )1( 112 −Σ −−∈ iiisi ππσ , which has design expectation )1( 12

1 −Σ −= ii

Ni πσ . All three

suggested definitions of ic yield this expression as the leading term for that expectation. The "unwanted" terms left

over are in each case )( 1−NnO , as compared with )( 12 −nNO for the HT variance, and tend to cancel. Under srsworthey cancel out exactly as already noted.

When ic is defined by (7), which is the simplest of the three, the unwanted terms are

)()1( 11211jsjiiisi nnn πππσ ∈

−−∈

−− Σ−Σ− . When it is defined by (8) they are somewhat more complicated at

)(2 21

222121k

Nkjsjiisiiisi nn πππσπσ =∈

−∈

−−∈

− Σ+ΣΣ−Σ . Since, however, (8) defines ic independently of i , (14) with(8) might be considered as providing the most convenient estimator for general purposes. When ic is defined by (9)the unwanted terms are more complicated still but appear in sum to be smaller than for either (7) or (8). So each hasan advantage; (7) is the simplest, (8) is the most convenient and (9) is almost certainly the most accurate.

The corresponding purely design-based analysis of (14) is less straightforward but leads to very similar conclusions.The "unwanted" terms are of the same order of magnitude as those above, and cancel out exactly under srswor.

4. ESTIMATING THE DESIGN-VARIANCE OF THE HT ESTIMATOR WHEN USING DELIBERATELYORDERED SYSTEMATIC SAMPLING

Deliberately ordered systematic sampling (DS) is clearly not a procedure for which (14) with (8) could be expectedto provide a convenient and adequate estimator of the HT variance. An estimator that used only the differencesbetween consecutive observations would have a better chance of being useful.

Consider the situation where an even number of units, mn 2= , have been selected using ordered systematicsampling. We divide the population into m notional strata. These strata are of equal "size," not in terms ofnumbers of units but in terms of the measures of size that are used to determine inclusion probabilities.Consequently these notional strata are not in general composed of whole units.

If a certain unit straddles two strata (say "Stratum A" and "Stratum B") and is selected because it contains a selectionpoint for Stratum A, the whole of it is regarded as one of the two units representing notional Stratum A, regardless ofthe fact that part of the unit is actually in notional Stratum B. We will estimate the m stratum totals and add them toobtain an estimate of the entire population. We will also form estimators of the m stratum variances and add themto estimate the variance of the estimate of the population total. Then we will modify that variance estimator so as tomake use also of the 1−m differences between neighboring sample units in different but contiguous strata, andfinally produce a suitable formula to cover the case where n is odd.

First then, we consider the case where there is only a single stratum and the variance is estimated using (14) with theic of (7) above. This yields

21112 )ˆ)(()1()ˆ(~ −•

−∈

−• −+−−Σ−= nYYnnnYV HTiiiiisiHT ππππ

Page 10: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1420

2111 )ˆ)(1()1( −•

−∈

− −−Σ−= nYYnn HTiiisi ππ . (15)

The expectation of this under ξ is

)()1()1()ˆ(~ 112111122isiiiisiiiisiHT nnnYV πππσππσ ∈

−−∈

−−−−∈• Σ−Σ−+−Σ= . (16)

The first term in (16) has the desired design expectation, )1( 121 −Σ −

= iiNi πσ . As already noted, the remaining terms

cancel out under srswor. They could, however still be important for small n . We therefore proceed rathertentatively with the proposed estimator, (14) with (7), for the case 2=n . We then have

}))(1())(1){(2/1()ˆ(~ 2211

1222

2222

1111

2 −−−−• −−+−−= ππππππ YYYYYV HTDS (17)

2222

11121 )}(2/)2{( −− −−−= ππππ YY , (18)

where the subscripts 1 and 2 refer to the first and second units included in sample.

The second factor in (18) is the design-unbiased estimator of variance for srswr, and the first is recognizable as thevalue that would be taken by the multiplicative finite population correction factor under srswor. It therefore carriessome intuitive appeal and its empirical performance is worth investigating (Brewer and Gregoire 2000). It is alsohelpful, in the context we are considering, that it depends only on the iY and iπ values of the sample units and noton the iπ values for the other units in the population, since in our context those units are in many instances purelynotional.

Adopting (18) as our tentative variance estimator for a single stratum, the estimator of variance for the m strata is

2222

111211

2 )}(2/)2{()ˆ(~ −−=• −−−Σ= hhhhhh

mhHTDS YYYV ππππ . (19)

This is the appropriate formula if there are m pairs of units providing squared differences. If we now abolish thenotional strata and consider all 1−n squared differences from pairs of neighboring sampling units, our sum willhave mn /)1( − times as many terms, so to keep to the same expectation we will need to use

2211

11

11

2 )}(2/)2{()}1/({)ˆ(~ −++

−+

−=• −−−Σ−= kkkkkk

nhHTDS YYnmYV ππππ . (20)

Finally if there is an odd number of sample units selected (and we are assuming here that the selection is carried outin such a fashion that the number of sample units is predetermined) the value m must be replaced by the value ittakes when n is even. This is 2/n , so our tentative estimator in this context is

2211

11

11

2 )}(2/)2{()}1(2/{)ˆ(~ −++

−+

−=• −−−Σ−= kkkkkk

nhHTDS YYnnYV ππππ , (21)

regardless of whether is odd or even.

5. A MODEL-BASED VARIANCE ESTIMATION PROCEDURE FOR USE WITH DELIBERATELYORDERED SYSTEMATIC SAMPLING.

A problem with equations (18)-(21) is that the expectation of (18) under ξ is

)}(2/)2{()ˆ(~ 22

22

21

2121

2 −−• +−−= πσπσππξ HTDS YVE , (22)

where ideally it should be

Page 11: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1421

22

222

21

211

2 )1()1()ˆ(~ −−• −+−= πσππσπξ HTDS YVE . (23)

Design-based inference is powerless to help in this regard because the only relevant statistic available is22

221

11 )( −− − ππ YY . If, however we are prepared to supplement the ξ of (17) with γπσσ 222ii = where γ is known,

say 0.5, 0.75 or unity, then it is possible to estimate 2σ by 1222

221

2222

111

2 )()(ˆ −−−−− +−= γγ ππππσ YY . Caution is ofcourse necessary in extending this kind of estimation over a number of notional strata. The use of ordered systematicsampling implies that the ξ of (10) is itself likely to break down if extended over much more than a single notionalstratum, but a little experimentation may well result in the discovery of a suitable variance estimator.

6. ADAPTATION TO ESTIMATING THE VARIANCE OF A GREG ESTIMATOR

It is straightforward to adapt the approach used in this paper to the problem of approximating and estimating thevariance of a Generalized Regression or GREG estimator (Cassel, Särndal and Wretman 1976). The GREGestimator is asymptotically equivalent to a difference estimator in which the principal term is free from samplingerror and the only other term is similar in structure to the HT estimator of total. The difference 11 −

•− − nYY iiπ in the

variance expression, however, is replaced by 1)( −− iii YY π , where ββββTiiY X= , T

iX being the row p -vector of thesupplementary variable values for unit i , and ββββ being the column p -vector regression coefficient. The design-variance of the GREG estimator of total might therefore be approximated by a formula analogous to (3) andestimated by

2212 )ˆ)(()}/({)ˆ(~ −−∈• −−Σ−= iiiisiGREG YYcpnnYV ππ , (24)

where ββββˆ TiY X= and )/( pnn − is the bias correction factor usually regarded as appropriate when p regressors are

used in estimating ββββ . Here as before, ic could be specified using (7), (8) or (9).

The estimator (24) could also be used for the anticipated variance (Isaki and Fuller 1982). For the prediction-variance, however, the factor 2)( −− iiic ππ would need to be replaced by the near equivalent )1( −ii ww where iw isthe sample weight or "case weight" (Brewer 1999).

When selection is systematic from a deliberately ordered population, the corresponding estimator of design-variancewould be

21111

11

11

2 })ˆ()ˆ}{(2/)2{()}1(2/)}{/({)ˆ(~ −+++

−+

−=• −−−−−Σ−−= kkkkkkkk

nkGREGDS YYYYnnpnnYV ππππ (25)

An empirical study of the appropriateness and accuracy of the estimators derived in this paper can be found inBrewer and Gregoire (2000). It appears that the factor )/( pnn − in (24) and (25) is in fact insufficient to compensatefor the loss of “ p degrees of freedom” in this context, owing to the use of unequal inclusion probabilities. A moreappropriate correction factor has been derived using a combination of randomization-based inference and prediction-based inference under a generalized version of the regression model ξ , but the analysis required was not altogetherstraightforward and the correction factor that resulted is somewhat complex in structure. It was not used for Brewerand Gregoire (2000).

7. ACKNOWLEDGEMENTS

Dr P. S. Kott made valuable suggestions while this paper was in preparation, including notably equation (8) for c .Dr T.G. Gregoire’s empirical work in connection with Brewer and Gregoire (2000) led to the discovery that the useof the factor )/( pnn − in equations (24) and (25) was causing the GREG variance to be substantiallyunderestimated.

Page 12: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1422

8. REFERENCES

Asok, C. and B.V. Sukhatme, (1976), "On Sampford's Procedure of Unequal Probability Sampling WithoutReplacement," Journal of the American Statistical Association, 71, 912-918.

Brewer, K.R.W. (1999), "Cosmetic Calibration for Unequal Probability Samples," Survey Methodology, 25, 205-212.

Brewer, K.R.W. and T.G. Gregoire, (2000), "Estimators for Use with Poisson Sampling and Related SelectionProcedures," Invited paper, Second International Conference on Establishment Surveys (ICES II), Buffalo, N.Y.June 17-21.

Cassel, C-M., C-E. Särndal, and J.H. Wretman, (1976), "Some Results on Generalized Difference Estimation andGeneralized Regression Estimation for Finite Populations," Biometrika, 63, 615-620.

Godambe, V.P. (1955), "A Unified Theory of Sampling from Finite Populations," Journal of the Royal StatisticalSociety, Series B, 17, 269-278.

Goodman, R., and Kish, L. (1950), "Controlled Selection - A Technique in Probability Sampling," Journal of theAmerican Statistical Association, 45, 350-372.

Hansen, M.H. and W.N. Hurwitz (1943), "On the Theory of Sampling from Finite Populations," Annals ofMathematical Statistics, 14, 333-362.

Hartley, H.O. and J.N.K. Rao (1962), "Sampling with Unequal Probabilities and Without Replacement," Annals ofMathematical Statistics, 33, 350-374.

Horvitz, D.G. and D.J. Thompson (1952), "A Generalization of Sampling Without Replacement from a FiniteUniverse," Journal of the American Statistical Association, 47, 663-685.

Isaki, C.T., and W.A. Fuller (1982), "Survey Design under the Regression Superpopulation Model," Journal of theAmerican Statistical Association, 77, 89-96.

Sampford, M.R. (1967), "On Sampling Without Replacement with Unequal Probabilities of Selection," Biometrika,54, 499-513.

Särndal, C.-E., B. Swensson, and J.H. Wretman, (1992), Model Assisted Survey Sampling, New York:Springer-Verlag.

Sen, A.R. (1953), "On the Estimate of the Variance In Sampling with Varying Probabilities," Journal of the IndianSociety of Agricultural Statistics, 5, 119-127.

Yates, F. and P.M. Grundy (1953), "Selection Without Replacement From Within Strata With ProbabilityProportional To Size," Journal of the Royal Statistical Society, Series B, 15, 235-261.

Page 13: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1423

A COMPARISON OF JACKKNIFE AND BOOTSTRAP METHODS FOR VARIANCE ESTIMATION INTHE PRESENCE OF IMPUTATION IN ONS BUSINESS SURVEYS

Susan Full, Office for National Statistics, UKRoom D140, Government Buildings, Cardiff Road, NEWPORT, NP10 8XG, UK. [email protected]

Deterministic single imputation is used for unit non-response in the ONS business surveys. Some strata are completelyenumerated and other strata having non-negligible sampling fractions. The imputed values are treated as actual values invariance estimation, which can lead to an underestimation of the sampling variance and ignores any variance due toimputation. For these surveys the variance estimation method needs to be suitable for where there are several imputationmethods and the sampling fractions are non-negligible. This study compares the performance of jackknife and bootstrapmethods.

Key Words: non-negligible sampling fractions, several imputation methods.

1. INTRODUCTION

In ONS business surveys imputed values are generally treated as if they were actual responses in both point andvariance estimation. The variance is estimated using standard formulae which do not take into account any varianceintroduced due to imputation. Also for many imputation methods the sampling variance will be underestimated andthe confidence intervals invalid. Kovar and Whitridge (1995) suggested that extent of this underestimation can be ofthe order of 2 to 10 per cent in the case of 5 percent non-response and as high as 10 to 50 percent where there is 30per cent non-response.

Business surveys in ONS generally use more than one imputation method and are stratified designs with some stratawith high sampling fractions and the strata covering the largest businesses are completely enumerated. For thecompletely enumerated strata, there will be no sampling variance but unless there is full response there will bevariance due to imputation. The main objective of this study has been to identify and evaluate variance estimationmethods that take into account the variance due to imputation suitable for use in ONS business surveys. Themethods have been evaluated via a simulation exercise and also by application to actual survey data to indicate howthe methods might perform in practice. The Quarterly Inquiry of Distribution and Services (QIDSS) has been usedas a typical ONS business survey. This survey collects both turnover and employment data. Only turnover data hasbeen used in the study and total turnover is the survey variable of interest. A previous study (Full (1999) usedjackknife variance estimation methods and this study extends to using a bootstrap method.

2. IMPUTATION METHODS IN QIDSS

In the QIDSS survey imputation is used for all non-responders. Two main methods of imputation are used; one is anautoregressive method that uses previous returned values to impute for non-response and a second method uses ratioimputation with register information.

2.1 Autoregressive (period-on-period growth) imputation

The QIDSS survey is conducted using a sample rotation scheme, once selected a unit will be in the sample for fiveconsecutive periods. For surveys where units are in the sample for a number of consecutive periods then one of thebest predictors for a non-responder is the value that was returned for the previous period. Therefore, the mainimputation method used in this survey is a simple autoregressive imputation using a weighted respondent average ofperiod on period growth to determine the autoregression parameter, called an imputation link. Some robustness isintroduced by using two year’s link estimates from the same point of the seasonal cycle, and weighting thesetogether to produce the final imputation link.

Page 14: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1424

The links are calculated in the following way. For each contributor that responded in the current and previous periodthe ratio 1−tt yy is calculated. A trimmed mean of these ratios is then calculated as an estimate of the currentperiod on period growth, b1. (The default trimming is 20 per cent of the largest and 10 per cent of the smallestratios.) The growth for the same period a year previous, b2, is also estimated in the same way.

�−

=11

11

t

t

yy

mb (1)

�−−

−=1)12(

12

22

1

t

t

yy

mb (2)

where m1 , m2 are the number of matched pairs after trimming.

These two imputation links are then weighted together. (The default weighting value is 0.8 for the current link and0.2 for the previous link.)

21 )1( bwwbb −+= (3)

The imputed values, t,ky , are then calculated as

1,,ˆ −= tktk byy (4)

where 1, −tky is either the observed or imputed value for the previous period.

2.2 Ratio imputation using register data

If the non-responder has been selected for the survey for the first time there will be no value for the previous period.For this case ratio imputation is used with the register turnover value as the auxiliary variable. The ratio tt xy iscalculated for each responder in the current period. The trimmed mean of the ratios is calculated and in this case thedefault trimming is symmetric at 10 per cent.

�=t

t

xy

mb

33

1 (5)

where m3 is the number of responders after trimming. The imputed values, tky ,ˆ , are then calculated as

ktk xby 3,ˆ = (6)

2.3 Other imputation methods

In some instances an imputed value will be a user provided value. There is also a back imputation facility to imputevalues for previous periods. For the purpose of this study, user provided imputations and back imputation have beenignored.

3. Variance estimation methods

A number of different approaches have been proposed to estimate variance due to imputation; two-phase sampling(Rao and Sitter(1995)), model assisted (Lee et al (1994)), jackknife estimation (Rao and Shao(1992)), bootstrapresampling (Shao and Sitter (1996)). Multiple imputation is another alternative but single imputation hastraditionally been used in business surveys and is operationally convenient. For ONS surveys the method needs to beapplicable in the case of more than one imputation method with non-negligible sampling fractions. From reviewingthe published research jackknife and bootstrap methods seem to be suitable. Shao and Steel (1999) have proposedother methods suitable for non-negligible sampling fractions and more than one impuation method.

Page 15: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1425

3.1 Jackknife estimation method for finite populations

The standard jackknife estimator does not capture the increase in variance due to imputation. A number of studieshave shown that re-imputing for non-responders in the jackknife samples, when the deleted unit is a responder, willcapture the inflated variance. The adjusted jackknife estimator is then given by

( )2)()( ˆˆ1ˆ �∈

•• −−=sj

as

ajsJACK YY

nnV (7)

where )(ˆ ajsY• is the estimate with unit j deleted and reimputed values and )(ˆ a

sY• is the estimate with all units.

The other issue is how to incorporate the finite population correction (fpc). The usual method is to premultiply thejackknife estimator by the fpc. Lee, et al (1995) argued that the fpc need only be applied to the sampling componentof the variance. They proposed using a correction factor so that the jackknife estimator becomes

yrJACKJACK SNVV 2* ˆˆˆ −= (8)

where 2ˆyrS is the estimate of the population variance based on the responders only.

A previous study by Full (1999) found that this fpc corrected jackknife led to negative variance estimates in somecases when applied to the QIDDS data.

3.2 Bootstrap methods for finite populations

Non-parametric bootstrap methods for without replacement sampling have been studied by a number of researchers(Sitter (1992), Shao and Sitter (1996), Cabeça (1997), Davison and Hinkley (1997)). An outline of the variousmethods are given below, where n is the original sample size, f is the sampling fraction and N is the population size:

1. Modified sample size – take with-replacement resamples of size ( ) ( )fnn −−=′ 11 or without replacementresamples of size fnn =′ . If f<<1, the resample is much smaller than n, and the resampled statistics may bemuch less stable than those based on samples of size n.

2. Mirror-match – this procedure attempts to match the original sample size. A without replacement resampleof size m=nf is taken. The resampled units are replaced and the resampling procedure is repeated k=n/mtimes, thus ignoring integer restrictions, creating an overall resample of size n .

3. Population – the sample is replicated N/n times to create a pseudo-population of size N. A withoutreplacement resample of size n is than taken from the pseudo-population.

4. Super-population – a pseudo-population is created by taking a with-replacement sample of size N from thesample. A without-replacement resample of size n is then taken from the pseudo-population.

5. Rescaling – a with replacement resample is taken and then each resampled unit is rescaled to capture thewithout replacement element.

6. Without replacement – the sample is replicated ��

���

� −−=n

fnNk 11 times and a without replacement

resample of size ( )fnn −−=′ 1 is taken.

Shao and Sitter (1996) investigated three of the bootstrap methods; rescaling bootstrap, mirror-match bootstrap andwithout-replacement bootstrap. Their study found that the rescaling bootstrap was difficult to apply to imputed databut the other two methods performed equally well. Davison and Hinkley (1997) compared the modified sample,mirror-match, population and super-population bootstrap methods. Their study did not include imputed data buttheir conclusion was that the population and super-population methods were better than the others. For this study themirror-match method has been used.

Page 16: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1426

3.2.1 Mirror-match bootstrap procedure

The procedure to create the bootstrap samples and estimate the variance using the mirror-match method is as givenbelow:

1. Take a simple random sample without replacement of size n′ from the original sample, where nn ≤′≤1 .

2. Repeat step 1 ( ) ( )[ ] 1* 11 −−′−= fnfnk times independently, replacing the resamples of size n'h each time,

where nnf ′=* .3. Estimate Y from the bootstrap sample.4. Repeat steps 1-3 a large number of times, B, to produce ***

1ˆ,.....ˆ,....,ˆBb YYY .

5. Estimate variance of Y using the Monte Carlo approximation

( )2

1

** ˆˆ1ˆ �=

���

��� −≈

B

bb YY

BYV where �

=

=B

bbY

BY

1

** ˆ1ˆ . (9)

Choosing n'= fn implies that the resampling fraction is the same as the original sampling fraction f (mirror). In somecases n' and/or k will not be integer values and the randomisation procedure proposed by Sitter (1992) has beenused. Choosing n'=1 is equivalent to a with-replacement bootstrap. To capture the imputation variance it isnecessary to re-impute values for each bootstrap sample.

4. SIMULATION EXERCISE

The performance of the jackknife and bootstrap methods with QIDSS data were compared by a simulation exercise.A finite population was created using a subset of contributors that returned values to the QIDSS survey for twospecific consecutive quarters. This population was stratified into four strata by employment size, three sampledstrata and one completely enumerated stratum. The population and sample sizes are given in the following table.

Strata Population size Sample size Sampling fraction1 315 10 0.032 233 25 0.113 200 100 0.504 75 75 1.00Total 823 210

Table 1: Population and sample sizes used in the simulation exercise

A stratified sample without replacement was taken in each of 1,000 simulations. For each simulation sample non-response at a uniform rate of 16 percent was generated. For the non-responders, 85 per cent of values were imputedusing the autoregressive method and 15 per cent using the ratio method.

The estimated total turnover ( Y ) was calculated for each simulation. The variance of the estimated total, V( Y ), wasthen assumed to be the true variance.

The jackknife estimator used in the simulation was the fpc corrected one as proposed by Lee at al (1995).

For the bootstrap method, 100 bootstrap resamples were taken for each of the 1,000 simulations. (For stratum 4only 600 simulations were run due to the high computer intensity of the method). In strata 2 and 3 the resamplingfraction mirrored the sampling fraction, ie. n'=fn. In stratum 1 due to the small sampling fraction n' =1 was used,equivalent to a with-replacement bootstrap. For stratum 4, which was completely enumerated, there was no obviousstrategy. In the first place the with-replacement bootstrap with n' =1 has been used for this stratum.

Page 17: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1427

5. SIMULATION RESULTS

The percentage relative bias of the ordinary estimator, where the imputed values are treated as actual values, thejackknife estimator and the bootstrap estimator are given in Table 2. Table 3 gives the root mean square error(RMSE) of the three estimators and Table 4 shows the coverage of the 95 percent confidence interval of the

estimate, ie the percentage of simulations for which ( )YVYsˆ96.1ˆ ± contains the true value Y. Using these three

measures , for this dataset, the jackknife estimator appears to be the better method. For the completely enumeratedstrata , no method proved to be satisfactory, for the jackknife estimator the application of the fpc correction led tonegative variance estimates and for the bootstrap estimator it failed to converge within the number of simulations/resamples .

STRATA ORDINARY JACKKNIFE BOOTSTRAP1 -54.7 27.5 21.12 -7.8 9.5 -22.83 -12.3 0.2 -18.44 -100.0 1.5 149.6Table 2: Simulation results for ordinary, jackknife and bootstrap estimators: Percentage relative bias

STRATA ORDINARY JACKKNIFE BOOTSTRAP1 158 x106 422 x106 461 x106

2 420 x106 485 x106 464 x106

3 106 x106 135 x106 150 x106

4 2,687 x106 6,201 x106 9,290 x106

Table 3: Simulation results for ordinary, jackknife and bootstrap estimators: Root mean square error

STRATA ORDINARY JACKKNIFE BOOTSTRAP1 61.6 85.4 79.42 88.3 93.1 85.53 91.2 93.1 89.84 - 46.9 96.2Table 4: Simulation results for ordinary, jackknife and bootstrap estimators: Coverage

6. APPLICATION TO QIDSS SURVEY DATA

As well as the simulation exercise the two estimators have also been applied to actual survey data in order to gainsome insight into the performance of the estimators in practice. The table below gives the variance estimated usingthe jackknife method for a number of industries covered by the QIDSS survey. V-RESP is the estimate of variancebased on responders only, V-ORD is the estimate with imputed values treated as actual values and V-JK* is thejackknife estimator using the fpc correction. The jackknife estimate (V-JK*) illustrates the problem with somevariances estimates being negative.

Industry V-RESP V-ORD V-JK*1 100,577 25,344 115,9222 4,378 3,277 3,5583 4,339 3,034 5,1914 677 548 8,2825 5,005 2,496 5,5846 789 125 -1,179Table 5: Estimation of variance using the jackknife estimator in QIDSS survey data (x 106)

Applying the bootstrap method to actual survey data was more difficult due to not having a suitable bootstrapmethod for the completely enumerated strata. Another problem that this exercise highlighted was the number ofresamples needed for the variance estimate to converge. In the simulation exercise, in all but stratum 4, the overallvariance estimate had converged using 100 resamples in the 1,000 simulations. However, when applying to the

Page 18: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1428

survey data it became apparent that many more resamples were needed. The two graphs below show the varianceestimate against the number of bootstrap resamples for two strata. In the first case the variance estimate converges atabout 600 resamples but for the second strata the variance has still not converged after 1,000 resamples. Davisonand Hinkley(1997) suggested that the number of resamples should be about 10 times the sample size.

0

200000000

400000000

600000000

800000000

1000000000

1200000000

1400000000

1600000000

1800000000

10 110 210 310 410 510 610 710 810 9100

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

10 110 210 310 410 510 610 710 810 910

Figure 1: Variance estimate against number of bootstrap samples

7. CONCLUSIONS

Neither the jackknife nor the bootstrap methods have proved to be totally successful methods to estimate variance inthe presence of imputation in ONS business surveys. For the jackknife this is due to difficulties in incorporating thefpc factor leading to negative variances. For the bootstrap the main problem is the computational intensity and thelarge number of resamples needed for convergence of the estimate. Further work will be to investigate theperformance of other bootstrap methods and also to consider the methods proposed by Shao and Steel(1999).

8. REFERENCESCabeça, J. C. S., (1997) “Awithout replacement resampling procedure for survey data,” Proceedings of StatisticsCanada Symposium 97, pp 97-100.Davison, A. C. and Hinkely, D. V. (1997) “Bootstrap methods and their applications”. Cambridge University Press.Full, S. E., (1999) “Estimating variance due to imputation in ONS business surveys,” paper presented atInternational Conference on Survey Non-Response, Portland, USA.Kovar, J. G. and Whitridge, P. J. (1995) Imputation of business survey data. Business survey methods, (eds B.G.Cox, D. A. Binder, B. N. Chinnappa, A. Christianson, M.J. Colledge, P. S. Kott), pp403-423. New York. Wiley.Lee, H., Rancourt, E. and Särndal, C. -E. (1994), “Experiments with variance estimation from survey data withimputed values,” Journal of Official Statistics, 10, pp231-243.Lee, H., Rancourt, E. and Särndal, C. -E. (1995),”Variance estimation in the presence of imputed data for theGeneralized Estimation System,” Proceedings of the Section on Survey Research Methods, American StatisticalAssociation, pp 384-389.Rancourt, E., Lee, H. and Särndal, C. -E. ( 1994), “Variance estimation under more than one imputation method,”Proceedings of the International Conference on Establishment Surveys, pp374-379.Rao, J. N. K. and Shao, J. (1992), ”Jackknife variance estimation with survey data under hot deck imputation,”Biometrika, 79, pp811-822.Rao, J. N. K. and Sitter, R. R. (1995), “Variance estimation under two-phase sampling with application toimputation for missing data,” Biometrika, 82, pp453-460.Shao, J. and Steel, P. (1999), “Variance estimation for survey data with composite imputation and non-negligiblesampling fractions,” Journal of the American Statistical Association, 94, pp254-265.Sitter, R. R. (1992),”A Resampling Procedure for Complex Survey Data,” Journal of the American StatisticalAssociation, 87, pp. 755-765.

Page 19: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

������������� �������������� �������������� ��������� "!�#$���&%'!(���� ����!����%'%'�)�* �+������������ ,%���%'�����-� �.�(�/�0��12%'�����3%$465�3879;:.<.= ��>@?A465 :.:B9 >)CD��>@EGF;465 : E <IHJ9LK ��EM>ANO79QP EM>@R = >@?S�(TUTM4V��465XW = >)CY��Z = TUW[465 : ��>AEMFL465 : E <IHJ9;K�\46]XZ@> 9 T 9 R H%$465�3879;:*<B= ��>A?@465 :.:B9 >�CD��>@EGF;4^5 : E <IH_9LK ��EM>ANO79QP EM>@R0CD��4 PD= 5 < W`46> <(9;K 1 =a< ZA4bW =a< EM] : CYcQd0efdLg�)EU>0NO79LP EM>@R0CO�ih�46?@4b>P R = >@?Dj�W = ElkmTUEMn�k : 4

o piqIrIs�tbuBrvxwzy*{�|�}^{Q~������Q|����m{z��|�}^{a�L�ay.{Q|�y-��{6��y*���������G}^��~�}���y-�z{z�b{z}���{V�6�Q��{6�����l��������y'���L���I����y���y*��}��D���L���������|��z�������a��~l�������z�a���m}^{�}^����Q{z�m��yV�;}��z�z�������m}^{[��}�������}�� �8~l���Q�a� �����������z��y�¡Y�\y��G��y.�6�zy.{6����� ��y.�m� }^{����z��z��}X¢a���(����y({Q}��������m�m�l�}��$�f~�����{L�z���I�a��~�yB�£�;}���{6��yB~l���m�(����}��B¤�¥¦�§��y�����y��m{6��y*��y.~���y.�¨��{`|*}�{Q~������Q|����m{Q���l��}��¦~����zy.�_~��b����y�������|���{^��y*�������©¡\�\y~�wz}��Q���_wQ�X�^y�y.�6�Q���m�m�l��}��-��wzy�l��} {z}^{a�ª|�}���y.������y��z��}��Q���z�m���m����y.~.¤«§}��\y*��y.�.¡���wz��~'��~ }��¬��y.{ {z}��'��wzy�|*�^~�y(��{Q�8}�{zy(��y.�^~�}^{f�G}�� ��wz��~'��~-��wQ���'��wzy��L}^�m{6� y.~������(����}�� ��{L�£�m��~|�}�����y.~��L}^{Q�a��{z� ~�����{Q�z���I� y.����}��Y����y�|�}^����y*������y.�i¤�v8y��z��}��;}^~�y\��{�������y*��{Q�����m�^y��Q��}a|�yB�a�z��y)��wQ���Y�I���^y.~@�m{6��}��|*|*}��z{6����wz��~�|�}^����y*�������m}^{���{Q����y �G�z����wzy.�§���z�Q�m����wz��~�}^{����Q{z�m��y-�;}��z�z�������m}^{�~������z����{z��~������Q�����m}^{�¤Yv8y����~�}��z��y.~�y*{6�$~�}���y ~������Q��������}�{���yB~��z�m��~§�Q��~�y.��}�{���wzy�«§}^���b�m��­*�¦®�wz}^���Q~�}^{�yB~l���m�(����}��$}��Y����}��I�����Q{Q�ay*�¯D}���~�~�}�{�~������z����{z�Q¤°�±�²�³J´ s*µ�qB¶"·$tb¸�¹�p�s�t6r�¹ ´aºA»@¼�´ s�½;¹mrI¾B¿lÀ$Á ´zÂ�à q ´aº¨± qIr�¹  t^r ´ s » o�ÄLÅ;¹�¸�¹�t6s ² ¹ º;ÆU´ s  t6r�¹ ´bºe;k-�*��� !��(���V��� �.���Ç }�{L~����ay*�'��{�z{z�b{z}���{8�6�Q��{6���m�l��È���wQ���-�\y��G��}^�É~�}^��y��I��{Q�a}^�Êy�¢a�;y*���m��y*{6�'|*��{y.~������(����y��b�ÌËÈz¤�Íb�z�z�;}^~�y(����~�}��wL���§�\y"wL�X��y'��{�y�¢a�z����|*���§yB~l���m�(����}���ÎÏ8ÐlËÈ�Ñ)ÒG��wz��|IwV�m{���wQy ~�yB�^�Qy*�i��~§��~�~��Q��yB��{Q}�{a�¦{zy*�6�����m�^yBÓ)}��@��wzy'����������{Q|*y }�� ËÈQ¤®�wzy-�z��}a|�y.�z�z��y"�G}^�§|�}�{z�L�ay*{L|�y-�m{6��y*���X����|�}^{Q~l�����L|�����}�{�}��¬��y*{V��y*���myB~�}�{���wzy'��������y*�¦~������z��y �^~�~��z���a����}�{���wL���

Ò ËÈ'Ô8È^Ó�ÎϨРËÈ�ÑUÕ§Ö×(ØÙ Ú ÒUÛzÜ*ÝXÓ�Þ Ò�ÝBÓß {����z�z��}X¢b���(����y�ÝBÛ�ÛQÒ�Ý�Ô8àDÓ�á��l��}��¦~����zy.��~��b����y�������|"|*}�{a�;�ay*{Q|*y-�m{6��y.�������i}��DÈ���~\��wQy*��y��G}���yÈ�â ËÈ ã[ä ÎÏÐ ËÈ�Ñ*Ö× Ü Ò©å�Ó��wzy.��y-�\y"��y��$ä��;y-��wQy(Ý�Ô8àYæ�å��6�Q��{6���m��y-}��@��wzy ~�����{Q�z���I��{z}����(���i�a��~l�������z�a���m}^{�¤ç\y.~����zy.~D��wzy§�z��}��z��y*�è��wQ����ËÈ ��{L�¨ÎÏ8ÐlËÈ�Ñ;����y��;}^~�~����z�����z����~�y.��y.~������(����}��I~Y}���È ��{Q��Ï8ÐlËÈ�ÑL��y.~��LyB|�������y.�m�^¡OËÈ ��{Q�¨ÎÏ£ÐlËÈ�Ñ����y(}��¬��y*{[|*}�����y.������y.��¡A|*���Q~��m{z�V��wzy(�����zy��z��}^�Q���Q�m�����l�}�� ÒUå^Ó'��|����Q���m���8|*}���y.����{z��ÈV��}f�;y�~��m�^{z���;|*��{6���m�£�m}��\y*�'��wQ��{��wQy�|�}^{a�L�ay.{Q|�y'��y*�^y*�YݧԨà)¤�éz�Q����wzy*����}���y�¡a�\y'���m����{z}��$��y*�$�(~��b����y�������|"|*}�{a�;�ay*{Q|*y'�m{6��y*���X���U¡Q�m{Q~���yB������wQy*��y'���m����;y���{¨�z{b�Q������{L|�y��G}��'��wzy������zy��z��}��Q���z�m���m����y.~-}��\��wQy(�m}��\y*�'��{L�£�Q�z�Ly.�'�L}^�z{Q�z~ }��'ÒUå�Ó ��}V�m��y����;}���y���{Q�8�;y*��}��èÈ��y.~��LyB|�������y.�m�^¤®�wzyV�z���m�(�����¨�I��~��[���m���§�;yV��}¨��y.{zy*�I�������[�ay*���m�^y�� |*}�{a�;�ay*{Q|*yV�m{6��y.���������z��}a|*y.�a�z��y���wz��|IwJ��{Q|�}^���;}��I����y.~���{Q��a���m����~�y.~���wQy�|*}�����y.��������}�{¨�Ly*�l��y.y*{�ËÈ£��{Q�èÎÏ8ÐlËÈ�Ñ©¤ ß ��~�}{z}���y���wL���(���m��wz}^�z��w È£}��¬��y.{[��~���Q���I����y���y*�'�G}^���f�Q����������z{z�b{z}���{x�a��~������m�z�z����}�{J�G��}^�ê��wz��|Iw_��yf�z���X�ë��{_�U¤ �©¤ �i¤\~������z�my^¡\��wzyV�G}^�m��}����m{Q�¨������y*��{Q�����m�^y�|�}�{z�L�ay*{L|�y���{^��y*�������|�}^{Q~������Q|����m}^{���~�{z}��§�������m��yB����}�~��Q|Iw��^~�~��z���a���m}^{Q~*¤ì k-��í�î3�"��"!����Ì%'!�����"����!�����(!��������V�0� ���3ê�+������������ ��������� "!�#$���¥ª{[}^���ay.� ��}��{^����}a�a�L|�y��^�Xïl�Q~���yB�[���z�Q��}X¢a��������y�|*}�{a�;�ay*{Q|*y���{^��y*��������~*¡Y�\y������m��|�}^{Q~����ay.���£~�y*��}��$�^~��b���a��}�����|*���m�����}����m������y.�£��yB~l�I~'}��\��wQy��l�6�;y�ð�ñ�òiÈ�â�ÈBñ���y*�I~��Q~-ð(ó�òiÈJôâõÈXñ���{L���wzy�|�}^����y.~��;}�{Q�ay.{Q|�y���wzy*}^��y.�É}��\��yB~l�I~ ��{Q�|�}^{a�L�ay.{Q|�y-��y*�^�m}^{Q~*¤ß ~��'�z��y*���m����{Q������~l��y*����{���wz��~\�a�m��y.|�����}�{�~��z�z�;}^~�y§�Q�I~��O��wQ����È ��~O�b{z}���{0¤@®�wzy"|�}^����y*�������m}^{��;y��l�\y*y.{_ËÈ���{L� ÎÏ£ÐlËÈ�Ñ�(�X����wzy*{£�Ly��a��������~�y.����{���wzy����X����wQ���"��y�|*��{f���Xïl�Q~��VÎÏ8ÐlËÈ�Ñ@���m��wf��wzy�wzy*���f}��(ËÈ��6����wzyö�|*�������z��������}�{Q÷���y.|Iwz{Q���6�zy��~��zy.~�|����m�;y.���m{Vy^¤ �L¤ ß {Q�ay.��~�~�}�{�¡zø$y*������{���{Q��v£yB~l��wQ�����)Ò©å�Û�Û^Û^Ó�¤Av8y-��wzy.��y*�G}���y'~l�I�������b�(��wzy ��~�~��z���a����}�{���wQ���

Ò.ÎϣРËÈ�Ñ¦Ü ËÈ6Ó�ù ØÙ[Ú Ò�ÒGÏ8Ð ËÈ�Ñ©Ü�È^Ó�ù�Ü�ú�Ó�Ü ÒUû^Ó

Page 20: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

��wzy.��y ú�âýüÿþ ó�ó þ ó��þ �Ió þ ��� � Þ®�wz��~${Q}��������m�m�l���^~�~��z���a���m}^{f��~ |�y.���I����{z�m�V{z}�� �������X�b~§y.��~�����}���}������X����y��m{~��;y.|��m�L|�|*��~�y.~.¡L������wz}��Q��w��\y�~�wQ�����A~�y*y��wL���§�m�§��~���{Q�����z�I���i���z�z��}X¢b���(�����m}^{��G}��§|*y*�������m{V~������z�m��{z���z��}a|�y.�z�z��yB~\��wzy*{�~������z���m{z���G��}��ë���L{z����y'�L}^�z�z�������m}^{�¤¥ª�(���^�m{zy"�G}^�$����}���y*{6����wL����ÒMû6Ó�wz}^���z~§y�¢z��|�������¤)®�wzy.{�¡L~���{Q|�y-��wzy |�}^{Q�a�m����}�{Q����y�¢a�;y.|��I�����m}^{�}��'ÎÏ8Ð ËÈ�ÑA������y.{ ËÈ(��~ϣРËÈ�Ñ�����Ò ËÈ-Ô£È6Ó�¡Q��wQy*��y��Éâ þ ó�Xæ þ ���¡Q��{Q�����z�I�����^�Xïl�Q~�����y*{6�§}��-ÎϣРËÈ�Ñ0���m��w���wzy'wzy*���V}�� ËÈ���~�Ï£ÐlËÈ�Ñ �����$â,ÎÏ£ÐlËÈ�ÑQÔ��`Ò�ËÈ'Ô£È^Ó�Ü��wb�Q~§�6��y*���a��{z����wQ������wzy |*}�����������{Q|�y � Ð �Ï8Ð ËÈ�Ñ ����� Ü ËÈ�ÑiâxÛQÞ«§}��\y*�^y*�B¡��ý��~\�Q~��Q���m������{��z{z�b{z}���{��6�Q��{^�����l����{Q�����m��w2Î�Éâ Ëþ ó�� æ Ëþ ��� ¡6�G}���~�}^��y yB~l���m�(����}��I~ Ëþ ó�� ��{L� Ëþ �� ¡a��wz��|Iw����y)�L��~�y.� }^{ y*�m��wQy*�@��wzy�y*¢a�^|��@y�¢a�z��y.~�~���}�{Q~ þ ó� ��{Q� þ �� ��y.~��;y.|����m�^y*���-}��D���z�z��}X¢a�m�(����y)�^y*�I~���}�{L~�}��Q��wzyB~�y��6�Q��{6���m����y.~.¡�\y"�^y�����wzy ���Xïl�L~l��y.���X��������{L|�y-y.~������(����}^�

ÎϣРËÈ�Ñ ����� â,ÎϣРËÈ�ÑQÔ Î�`Ò ËÈ'Ô£È^Ó�Ü��wzy.��y-�\y-�^~�~��z��y ��wQ����ÎÏ£ÐlËÈ�Ñ ������� ÛQ¤ ß �$���Q�I~����^����{Q|�y ��wQ��~�y*¢a�z��yB~�~���}�{���}b}��a~�wQ���I�(��}��Q~�y'~��m{L|�y-�m{V��y.���m�m�l�(È���~�}��|�}^�z�I~�y'�z{Q�6{Q}���{�¡L�Q�a� wzy.��y-��wzy�|*}�����yB~��;}�{L�ay*{Q|*y'�Ly*�l��y.y*{V��y.~���~$��{Q�f|*}�{a�L�zy*{Q|*y ��y.����}�{Q~§wzy*���Q~.¤���y�� �Q~§��wzy*��y��G}^��y�a��}�����wzy ���������;|���������~�~��z���a����}�{�}��@È��;y*��{z�(�b{z}���{V��{Q��|�}^{Q~����ay.�

������â2Ò ËÈBñ$Ô£ÈXñBÓ*Ò.ÎϨРËÈXñ*Ñ ����� Ó Õ Ö× Ü��~�����y.~��$~l�I������~�����|"�G}�����yB~l���m{Q�ð ñ ò�È�âxÈ ñ �^y*�I~��Q~ýð ó ò�ÈVôâ_È ñ Ü��wzy.��y-�\y"��ylïlyB|��$ð ñ ������wzy ���z�Q��}X¢a��������y-~�����{Q���L|.��{Q|*y-�my.��y*��à[��wzy*{

� � � � � � äQÞ®�wzy'|�}^����y.~��;}�{Q�a��{z��|�}^{a�L�ay.{Q|�y-��y*����}�{�}��DÈ(������wzy ���z�z��}X¢a�m�(����y"|*}�{a�;�ay*{Q|*y'�my.��y.�YÝ�Ô8à ��~\��wQy*��y��G}���y� È�ò � � � �"! ä�#6Ü��wz��|IwV��~���y.�^�a�m����~�y*y.{���}��;y-��wQy'�m{6��y.�������

È�â ËÈ$�[ä � Î� å ã`ä&%;ÎÏ8Ð ËÈ�Ñ"�[ä � Î� �')( Ö× Þ Ò*'6ÓÇ }����Q������{z�Ò*'6Ó§��}¨ÒUå^Ó�¡i�\y �Q�I~l�-}��L~�y.���^y'��wQ��� �G}^� � Î� ��+ Û���wzy���{^��y*��������Ò 'bÓ§��~$�����ay.�.¤-,�}^��y.}���y*�B¡z��wzy��m}��\y*�"��{Q��z�z�;y*�$�L}^�z{Q�z~$}���Ò '6Ó�����y'�;}���wV�m{Q|*��yB��~��m{Q���G�Q{Q|����m}^{Q~§}���Î� ¤ ß ~�~��Q����{z���Q�I~�����wQ���¨Î� + Ûz¡a��wzy ��y.~��z������~��������b�������G}����wQy'�z�z�;y*�§�;}��z{L����{Q����}�~�wz}�� �m���G}�����wQy'�m}��\y*���;}��z{Q���\y ~������z�����a�/.;y.��y.{^��������y�ÒM�'¤ �B¤ �B¤"Î�Ó

ä � Î� å Ô¨ä&%LÎϣРËÈ�Ñ0�[ä � Î� �'1( Ö×��{L���^y�� ä �å üiÝ�Ô ä Î2 �Ò.ÎÏ£ÐlËÈ�Ñ��[ä � Î2 ×3 Ó Ö×� + ÛzÞ

Ç }�{L~����ay*����{z�(�m{L~l��y.������wzy�|.��~�y'��wzy*��y Î� ! ÛQ¡Q�\y���{Q���m}^��}^�Q~�����|*}�{Q|*�m�Q�zy-��wQ���$��wQy �z�z�;y*� �L}^�z{Q����~§��{���{Q|*��yB��~��m{z��G�z{Q|�����}�{V}���Î� ¤

Page 21: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

¥¦�)��~O~��m����w6��������{z{z}��b��{z�"��wQ���O��wQy$���Xïl�L~l��y.����{^��y*�������Q�;y.|*}���y.~O�����ay.�.¡6~�y.y$é@�m�^�z��y$åa¡��Q�a���m�O���Q��{Q~)}��a�)��wQ���O��wz��~O��~��wQy��z����|�y��L��������}-��y��)�-�L������{Q|*y\��{���wzy��l�\}'�a��~Gïl}^�m{6�O{z}^{a�¦|*}���y.������yO�z��}��L���z���m�m����y.~.¡���wQ��|Iw�����~�}-�b��y*���z~Y��}���y���|*|*�z�I����y}�{Qy��ª~����ay.����{^��y*��������~*¤¥ª{���wzy"�G}����m}�����{z����y"�����m��|*}�{Q|*y*{6���I����y$�z�;}�{���}���wzy'~������Q�����m}^{���wQy*��y �\y"~������z��y �G��}^�,� �Q{Q����y"�;}��Q�z��������}�{V��{Q���wzy.��y"��wQy'����{Q�a}���{zyB~�~\}^{z�m��y*{6��y.��~\��wz��}^�z��w���wQy |Iwz}���|�y-}��@}��aïlyB|��I~��m{���wQy ~������Q�my^¤g0k������i���0� � ���(�����*��"���,����� "!�#$���è�\��!+�+���������54Êí�"� �0��12%'�����3 ��!��(1 �����������î%��(%'���)�-� �.���Íb�z�Q�L}6~�y ��wQ������y"wQ�X��y"� �L{z����y-�L}^�z�z��������}�{76&}��A~��m­.y98 ¡b��wzy.��y ��}�}��zïly.|��;:Y��{<6&��~��������^|Iwzy.����{��z{z�b{z}���{�~����L�a��������zy-=?>�¡@:Oâ&Ý^Ü*Þ*Þ.Þ*Ü8 ¤O¥ª{V}��I�ay*����}(y.~������(����y ��wzy'�;}��z�Q��������}�{���}������0È�âBA�C âED > =?>�¡z��y �z���X� ������{Q�a}�� ~������Q�myF }��A~���­*y�G`ÒG�z¢ay.��}����I��{Q�a}^��ÓY�G��}��H6�¤Yv8y-���m���iwzy.��y-|�}^{Q~����ay.�\|*}�{a�L�zy*{Q|*y"��{6��y.��������~)�G}��;A�C'�Q�^~�yB��}^{��l�\}�~��;y.|��m�L|y.~������(����}��I~$}��IA C ò$®�wzy�«§}^���b�m��­��ª®�wz}^���L~�}^{fy.~������(����}^� ËA C ��{Q����wzy�}��a���m�(���AyB~l���m�(����}�� ËA CKJML�N ¡i��wzy*��y ��wzy���������y.�"��~�f��y*�^��yB~�~���}�{£�l�6�;y�}��§y.~������(����}�� ��wQ�����a��������~�y.~����z¢b�����������£�m{a�G}^���(������}�{�¤8éz}����ay*��������~���{L�[|�}^����y.{^�I~�}�� ËA CKJML�N {z}���z��}��b���ayB����{�Íby.|�����}�{fåa¤ åa¡z~�y*y-O§��}VÒ�ÝQPRPS'6Ó)}�� ß {Q�ay.��~�~�}^{�T�UWVSX/Y@ÒUå�Û�Û�Û6Ó�¤g0kMe���?"ZBn :.< 4b?x] 9 >\[@?@4b>A]^4fEM> < 4^5XF = T K¦9 5 =<B9Q<B= TW] =Q: 4b? 9 > < ZA4í 9 5XF�E <_^a` � Z 9 W PD:.9 >J4 :.< EUW =a<B9 5Ç }�{L~����ay*����wzy'�z{b�z����~�y.��«§}^���b�m��­��ª®�wz}^���L~�}^{�y.~������(����}��ËÈ�â ËA�C-â)b > cRd =?>e > Ü

}��&A�Cb¡a��wzy.��y e >D��~���wzy'��{Q|����Q~��m}^{��Q��}^�Q���z�������l�(}��@}^�aïly.|���:O�m{���wzy ~������z�my F ¤ ß {Q����}��^}��Q~��m�^¡ e > � ��{Q� e > �f ���m������{���wzy�G}�����}����m{z�(�ay.{z}���y"��wzy |*}�����yB~��;}�{Q�z�m{z����{Q|����Q~��m}^{��z��}��L���z���m�m����y.~�}��0��wzy'�Q����� � :IÜhg�#'��{L���������z��y � :IÜig^Üjk#-��y.~��;y.|����m�^y*����¤®�wzy-����������{Q|�y-}�� ËA�C���~'ÒM������wV��wzy'{z}���������}�{��m{6����}a�a�Q|*y.���m{���wQy'�z��y.�b�m}^�Q~�~�yB|�����}�{;Óþ �� â ϣРËA C Ñ�â)b > l �nm > � =?>*= �e > e � Ü Òio�Ó

��wzy.��y m > �"â e > �$Ô e > e ��¤p$~���{z����wzy «$}����6�m��­*�¦®�wQ}����Q~�}�{���y.|Iwz{Q���6�zy-��}�yB~l���m�(����y�Òqo^Ó�¡z�\y-}��a�I����{���wzy'�z{b�z����~�y.������������{Q|�y-y.~������(����}��Ëþ ��� â,ÎϣРËA C Ñ�â b> l � cRd m > �Q= > =r�e > � e > e � Þ

ø§}��'¡b��wzy ~�����y"��y.|IwQ{z���6�zy'�b��y*���z~Ëþ ó� âîÎ� ЩÎϣРËA�CBÑ©Ü ËA�CBÑ�â b> l � l f cRd m > �f m �f =?>*= � = fe > �f e �f e > e � e f Ü Òqs^Ó

��wzy.��y m > �f â e > �f-Ô e > e �fz¤¥¦�D��y-wQ�X�^y"���z¢ay.�V~������z��y'~���­*y'�ayB~�����{V�\y-|.��{��m{L~l��y.�����Q~�yËþ �� â&Ô Ýå b> l � cRd m > �e > � % = >e > Ô =r�e � ( � Ü

��{L� Ëþ ó� âÌÔ Ýå b>*l � l f cRd m > �f m �f =?>e > �f e �f e > % = �e � Ô = fe f ( � Þ Òht�Ó¥ª{`y.����wzy*�(|.��~�y��\y����Q~l����w6�L~�wL�X��y�¡Y��{_���z�a�m����}�{`��}£��wzy��Q�I~�����{Q�`~�y.|*}�{Q� }^���zy*���m{Q|*�m�L~���}�{��z��}^�Q���Q�m�������myB~ e >��{L� e > � Ü<:vu F ¡���wz��|Iwÿ����y�{zy.y.�ayB�J�G}��£Ò©å�Ó�¡§��|*|*y.~�~���}[��wzy���wz�����x}^���zy*�(��{Q|����Q~��m}^{x�z��}^�Q���Q�m�������myB~ e > �f Üw:xu F ¡��wzy.{fy*�������Q������{z�(��wQy����Xïl�L~l��y.�f|*}�{a�L�zy*{Q|*y �m{6��y*���X����Ò*'6Ó�¤ éz}��$�z¢ay.��~������z��y�~��m­.y��ayB~�����{Q~���wzyB~�y��^�L��{6���m����y.~ |*��{��;y|��Q���;y*�I~�}^��y\}��Yy.��y.{������L}6~�~��m�Q�my\��}'|*}����z�a��y�¡��Q{z�myB~�~D�\y��Q~�y���{�yB�6�Q���a�z��}��L���z���m�m�l� �ayB~�����{�~��L|Iw��^~@~������Q�my��I��{Q�z}��

Page 22: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

~������z���m{z�L¤@«$}���y.��y.�.¡^�Q~���{z� y^¤ �L¤���wzy$�I��{L�a}���~������z�my ~���­*y"�ay.~��m�^{(�b{z}���{���~�¯D}���~�~�}�{(~������z���m{Q�Q¡6��wzy*��y§y.�^|Iw(}��aïlyB|����~Y�m{L�ay*�;y*{Q�zy*{6�������z���X��{����m��w���{���{Q�a���b���a�L���z��{Q|����Q~��m}^{(�z��}��Q���z�m���m�l��¡�~��m���z���m�Qy.~D��wzy§y*¢b�Q��yB~�~��m}^{Q~D����y.��������¡�~���{Q|�y���wzy*{Ëþ �� â)b >*cRd Ý�Ô e >e �> = �>

��{L� Ëþ ó�� â)b >*cRd ÒlÝ�Ô e >¦Ó�

e�y> = y> Þv8y'���X������~�}�{z}���y ��wQ������wz��~��G}����,}��-ÎÏ£ÐlËÈ�Ñ�â Ëþ ��"�M��|��������I����yB~\��wzy"�Q{L�a�m{Q�(}��@~��0z�|���y*{6�§|�}^{Q�a�m����}�{Q~\�G}^�\��wzy ���z�z��}X¢6���������y'�z��������������y"{z}����(�������l�£ÒMû^Ó���}�wz}����i¤{ k �0��12���)�-� �B�(��¥ª{�}^���ay.�O��}�y*���z������|.�����m��~l���Q�a����wzy$y�.;yB|���~�}����L~���{z� ��wzy �^�Xïl�Q~���yB��|�}^{a�L�ay.{Q|�y �m{6��y*���X���AÒ*'6ÓY��{Q~���yB���(}��i��wzy-~l�I��{Q�Q���I�ÒUå^Ó�¡��\y�wQ�X�^y�|�}^{Q�a�Q|���yB�f��~��(�����D~������z�������m}^{f~l���Q�a��¤'®�wzy��6�Q��{6�����l����}��Ly�y.~������(����y.��wzy.��y���~-���Q{z�m��y��L}^�z�z��������}�{��}������KA C }��0~�}���y$~����L�a���X���������Q�my§�b����y.��{Q~O}��i��wzy"«§}����b����­��ª®�wz}����Q~�}�{�yB~l���m�(����}�� ËA C ��~O��{�ÍbyB|����m}^{�ûz¤@v£y"wQ�X�^y��G}����wQ��~D�(������y.�O~������Q�myB� �G��}��Ì��wzy��L}^�z�z�������m}^{|,vp"åS}zÝ-ÒM~�y*y�y^¤ �L¤�Í ~����{Q�z���6y*�O���©¤ Ó�¡���wz��|Iw�|*}�{Q~���~���~@}���åS}QÝ����z{z��|����Q�����m����y.~}���Íb��yB�ay*{0¤ ®�wzy�~����L�a�������������z�my�|Iwz}^~�y*{�\�^~ ¯�}Ro�ÒG��wzy�{b�z���;y*�-}��)�m{QwQ���z�m����{6��~ �m{JÝQPR}Ro�Ó�¤-Íz�����z��y.~"��y.��y��a�I�X��{��|.|�}^���a��{z����}f�l��}f�a�/.�y*��y*{6� e �Q~�¯D}���~�~�}�{¨�ay.~�����{Q~.¡A��wQy*��y���wzy���{Q|*�m�Q~���}�{ �z��}��Q���z���m�m����y.~'�\y*��y�|���y.����yB�£�b�£��wzy�~���­*y��y.�^~��z��y.~A�z��}��b���ayB� �b�-��wQy����a¢a�m���������"�����������z��y.~@¯Wt?o ÒL��wQy\{b�z���Ly.�D}��L��{zwQ���z�m����{6�I~A��{�ÝQP�t?o�Ó0��{Q� Ç Í0}^å�Ò¬��wzy�{6�Q���;y*�}���|�}^{Q~�y.������������y�~�yB����~-��{¨���z{Q��|*�m�Q���O|�}^�z{Q|����O��{�Ý_P?}6å�Ó$��yB~��;y.|�������y*����¤(®�wzy(y*¢a�LyB|���yB�£~������z��y�~��m­.y�����~-û^Û���{Q��G}��y.�^|Iw�|.��~�y-��y-wQ�X�^y"~��m���z������yB�7o�Û�Û�Û�Û�~������z��y.~.¤®�wzy-{z}^����{Q����|�}�{z�L�ay*{L|�y'��y*��y.�i}��A��wzy'��{6��y*��������~��\�^~;PRo6á���{Q����wQy'��yB~��z�m��~�����y'�a��~��z���X�^y.���m{�®@���z��y�Ý�¤®@���z��y§Ý�ò Ç }���y*�I���^yA�I����y.~�Ò Ç O$Ó�¡.{z}^{Q|�}��^y*�I���^y@�I����y.~��G}^�0��{6��y*��������~�����~�~��m{Q����}$��wzy��my*�¬�\Ò*�@Ó���{Q�-������w6��ÒqO$Ó0��yB~��;y.|�������y*�����{L�'�X��y*�I���^yA��y*{Q����w'}��b��wzy\~l�I��{Q�z�����"��{6��y*��������~��a���b���ayB�-�b�"�X�^y*�I����y@��y*{z����w'}��b��wzy��^�Xïl�Q~���y.�-��{6��y.��������~.¤)Ò ß ��dhN�æ ß � ����� ÓÒ ß �m���Q���z��y.~���{á�Ó

Ç O � O ß � dhN æ ß �\�����~l�I��{Q�Q���I� P^ûz¤ ' oa¤�t Ûz¤ P¯Wt?o P�oa¤ Û���Xïl�L~l��y.� P?'Q¤ û åa¤ ' ûz¤ û~l�I��{Q�Q���I� P?'Q¤ û 'Q¤ ' Ý�¤ ûÇ Í0}6å P"tb¤�Ý���Xïl�L~l��y.� P?'Q¤ P åa¤�Ý ûz¤ Û

éz}^�'��wQy(�l��}f|.��~�y.~ ~����L�a�myB�¨wzy*��y ËA C ��{L�&ÎϣРËA C Ñ�����y��m{L�ay*yB�[~l����}^{z�����|*}�����y.������y.�8���m��w[|*}�����y.��������}�{¨|�}by�z�|���y*{6�Ûz¤ P?}�ÒM¯Wt?o�ÓY��{Q��ÛQ¤ P^Û(Ò Ç Í0}6å�Ó�¤Y®�wzy"|�}��^y*�I����y��I����y�ÒG��y*�������m�^y§{b�z���;y*�\}��0|*}�{a�L�zy*{Q|*y§�m{6��y*���X����~���|����Q���m���(|�}���y.����{z��A C Ó��{Q|���y.�^~�yB~\�G}��$�;}���wf�ay.~�����{Q~���wQy*{��Q~��m{Q����wzy��^�Xïl�Q~���yB�V�m{6��y.�������0��{Q~l��y.�^��}��D��wzy�~�����{Q�z�����V��{L�i¡Q��}���y'�����;}�������{6������¡�\y"}^�a�����m{������L|Iw��;y�����y*�$�Q������{Q|*y-�Ly*�l��y.y*{���wzy'yB~l���m�(����y.��{z}^{a�¦|*}���y.������y§�z��}��L���z���m�m����y.~.¤c�k-!� ��"!�"���-�ß {L�ay*�I~�~�}�{ ¯D¤���¤m¡Yø$y*������{���¤A��{Q�¨v£yB~l��wQ�������L¤�Ò©å�Û^Û�Û^Ó�¡ ö�$yB~�����{z�©�Q�^~�yB�}^�a�����(������y.����y.~�~���}�{8y.~������(������}�{ }��§�V�z�{z�m��y �;}��Q�z��������}�{V��}��������Q~��m{Q����~��6���a��}�����|-{z}^���(���m�m�l�������^�z��y*{6��~.¡ ÷����x��Va�@�"������� ��U©¡�� y*�L��������y*{6�§}���,f����wzy.�(������|*~.¡Ç wQ���m��y*�I~;p§{Q�m�^y*�I~��m�l�(}��Y®0yB|Iwz{z}^�m}^������{L�w��~}���y*�;}�����p§{z����y.��~��m�l��¤O§��}��L¤ ø�¤ ��¤iÒ�ÝQP?P?'6Ó�¡@ö�)~������(������{z����}��I����~���{Q���a��~��������z�a����}�{��G�z{Q|�����}�{Q~��L~���{z�����a¢a���m���������m{a�G}^���(������}�{�������wzy-y.~������(�������}�{�~�������y�¤ ÷z¡ �0�a�0���kVaX\���� �����*VaX �KUMVaUi���¡Ui�*��� e0¢ ¡a�z�0¤�Ý_o�û��IÝQsRoz¤Í ~����{Q�Q��� Ç ¤ �§¤�¡QÍb��y.{Q~�~�}^{�ç"¤z��{L��v8��y����(��{w�L¤0ÒlÝ_P?P6å�Ó�¡�£x�_¤?T�X�Vr�¡������UhT¡¤��¡�0��¥aT�¦���VS���kX�����§�¤�ø$y*�©¨O}����;ò)Ía�z����{z�^y*�B¤

Page 23: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1434

DISCUSSION OF FOUR PAPERS ON VARIANCE ESTIMATION (SESSTION 30)

Phillip S. Kott. Bailey, National Agricultural Statistics ServicePhillip S. Kott, NASS, Room 305, 3251 Old Lee Hwy, Fairfax, VA 22030, USA

[email protected]

These four very different papers all address variance estimation when the sample is large. In three of the papers, the populationneed not be much larger.

Key Words: Asymptotic; Confidence interval; Finite population correction, Jackknife.

1. INTRODUCTION

All four of these paper have interesting things to say, not all of which can be addressed here. I will tackle the papersin an order I find convenient. Three papers address the problem of finite population correction in some form. The firstI will discuss does not.

2. ANDERSSON AND NERMAN

This paper has engaged my imagination like nothing since the second season finale of Buffy the Vampire Slayer. Although it is framed in a survey sampling context, the paper really addresses a broader issue. The pivotal statistic,t = (m � µ)/s, where m is an unbiased estimator for µ, and s2 is an unbiased estimator for σ2, the variance of m, isasymptotically normal. When m is the mean of independently and identically distributed skewed random variables,however, the sample size may need to be large before asymptotic normality can be invoked. The authors develop aclever method of speeding up the asymptotics when constructed two-sided confidence intervals.

Briefly, they suggest replacing s in the pivotal by s* = �{s2 � K(m �µ)}, where K = E[s2 , (m � µ)]/σ2. In practice Kmust be estimated from the sample, and (m � µ) is unknown. The first problem is handed by estimating K consistently,say by k. The second is finessed entirely.

The authors construct a confidence interval by solving for µ in the equation, (m � µ)2 /s*2 � z2, where z is the normalscore for the two-sided confidence interval of interest. Their result has the form:

µ = m + z2k/2 ± z�(s2 + z2k2/4) .

This causes the confidence interval to be asymmetric even though it is clearly two-sided (recall that we began with(m � µ)2 /s*2 � z2). Surprisingly, the asymmetry is not only a function of K (though k), but of z.

Under mild conditions, s is Op(1/�n), where n is the sample size on which m is based. The asymmetric term, z2k/2, isOp(1/n), which is of a smaller asymptotic order than s, but still important when n is not too large. The remaining term,z2k2/4, is asymptotically unimportant considering that its impact on the confidence interval is the same (under mildconditions) as that of estimating K by k in the asymmetric term. One thing the authors do not consider is the effect on the confidence interval of the variance of s*2, a subject of my ownresearch (Kott, 1994) . They point out, however, that s*2 has less variance than s2 when K � 0. There was no empirical work in the draft I saw. Nevertheless, I am encouraged by the similarity of the authors’confidence interval and the empirically validated score confidence interval for an estimated proportion based on a simplerandom sample (see, for example, Agresti and Coull, 1998). The idea there is to create a confidence interval for aproportion P based on an estimate p, using (p � P)2/{P(1 � P)/n} � z2. The score confidence interval is asymptoticallyidentical (up to OP(1/n) terms) the authors’ interval in this context. That suggests to me that an adaptation of the authorsapproach is needed to properly extend score confidence intervals to the analysis of data from complex samples. Themethod I used in Kott and Carr (1997) now seems embarrassingly ad hoc.

Page 24: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1435

3. HURTUBISE ET AL.

I have relatively few things to say about this paper, which does not mean the topic is unimportant or that the analysisis incorrect.

The authors are essentially interested in estimating the variance (or mean squared error) of a statistic of the form:

a x b t = ������� , c x d

where a, b, c, and d, are unbiased estimator of A, B, C, and D, respectively; a and c come from one survey, while b andd come from another. When faced with a complex estimator like t, I would immediate suggest using a groupedjackknife, but finite population correction matters in the surveys used by the authors, so a jackknife can not easily beapplied.

When E(ab) = AB as it does in this case, we know that Var (ab) = B2Var(a) + A2Var(b) + Var(a)Var(b). By contrast,the unbiased variance estimator has a subtly different form: var (ab) = b2var(a) + a2var(b) � var(a)var(b) (the authorsgot the last sign wrong in the draft I saw, but it hardly matters). This is because E(a2) = A2 + Var(a), and E(b2) =B2 + Var(b).

I prefer invoking more general relationships. When (a �A)/A and (b �B)/B are OP(1/�n),

relVar(ab) = relVar(a) + relVar(b) + 2relCov(a, b), and

relVar(a/b) = relVar(a) + relVar(b) � 2relCov(a, b).

These equations are similar and easier to remember. The term Var(a)Var(b) is missing because it is asymptoticallyignorable (which is why the authors’ wrong sign hardly mattered).

4. FULL

I have my doubts about the bootstrap in this context (two-phase sampling problems, which nonresponse adjustmenteffectively is), so I will focus my remarks on the jackknife.

The usual theory for a jackknife assumes no finite population correction and this imputation:

yk* = xk [ �R yi / �R xi ],

where R is the imputation cell (reweighting group) containing k. Moreover, each such cell must be large.

The author abandons any hope of good quasi-randomization-based properties with her use of two different models andmean-of-ratios (yk* = xk r -1 �R (yi /xi ), where r is the number of respondents) imputation Still, how well does the fpc-adjusted jackknife work?

Effectively, there is simple random sampling in four distinct populations in the author’s empirical work. I will brieflydiscuss the two of them with disappointing results.

In Stratum 1, the number of respondents is small, which causes the jackknife to have an appreciable bias even underideal conditions. Also the the normal-based confidence intervals needs to be adjusted for the effective degrees offreedom (see Kott, 1994).

Page 25: VARIANCE ESTIMATION FOR SYNTHETIC ESTIMATORS IN …Daniel Hurtubise, Yves Morin, Pierre Lavallée and Michel Hidiroglou, Statistics Canada ... size. The take-some sample is selected

1436

In Stratum 4, variance estimates are derived from what turns out to be a very small number of sum of squares, hencethe often-negative estimates. It should be noted that conventional imputation/reweighting would have allowed a single-phase jackknife and simple finite population correction.

5. BREWER

The author develops variance expressions and estimators for the general Horvitz-Thompson estimator, t = �S yk /πk,without those pesky joint-selection-probability terms. Essentially, he offers variance estimators of the form:

var(�S yk / πk) = [n/(n�1)] �S Qk[�S yk / πk � n-1 (�S yi / πi)]2

where 1 � πk or

Qk = 1 � πk + (πk � �U πi2/n)/n or

1 � πk � (πk � �U πi2/n)/n

Observe than when Qk = 1, this becomes the standard with-replacement variance formula. Effectively, Qk is a finite-population correction term, which varies across the sampling units. How can we choose Qk from among the author’s proposals? Let me offer another. Consider the model, yk = β(πk + εk),where the εk are uncorrelated, have means of zero, and identical variances. This is the model under which theHorvitz-Thompson estimation strategy is ideal. Setting Qk = 1 � 2πk + �U πi

2 /n = 1 � πk � (πk � �U πi2/n) renders the

variance estimator above unbiased for the model variance of t. Like the authors three variants, the model-based choicefor the Qk collapses to the standard form when all the selection probabilities are equal.

In the Hartley-Rao (1962) variance estimator for systematic PPS sampling from a randomly order list, Qk can be shownto be effectively 1 � πk � �S πi /n + �U πi

2/n, when the population is large (but not too large), yet another choice. It isa choice I prefer because it is based on an actual sampling design and some theory. When n is large all the author’s variants, the model-based estimator, and the Hartley-Rao are about the same. Moreover,when we extend the analysis to a regression estimator, there are bigger problems (as it happens, the author and I areworking together on them). When n is small, the standard variance formulae are not so problematic.

Finally, the author has some useful suggestions for systematic PPS sampling from an ordered list, but I will not discussthem further here.

REFERENCES

Agresti, A. and B.A. Coull (1998), “ Approximate is Better than ‘Exact’ for Interval Estimation of BinomialProportions,” American Statistician, pp. 119-126.

Hartley, H.O. and J.N.K. Rao (1962), “ Sampling With Unequal Probabilities and Without Replacement. Annals ofMathematical Statistics, pp. 350-374.

Kott, P.S. (1994), "Hypothesis Testing of Linear Regression Coefficients with Survey Data," Survey Methodology, pp.159-164

Kott, P.S. and D.A. Carr ( 1997), “Developing an Estimation Strategy for a Pesticide Data Program, Journal of OfficialStatistics, 1997, pp. 367-383.