11
Metrika (1998) 48: 177–187 > Springer-Verlag 1999 Estimation of ratio of population means in survey sampling when some observations are missing H. Toutenburg1, V. K. Srivastava2 1 Institut fu ¨ r Statistik, Universita ¨t Mu ¨ nchen, D-80799 Mu ¨ nchen, Germany (e-mail: [email protected]) 2 Department of Statistics, University of Lucknow, Lucknow 226007, India Received: June 1998 Abstract. This paper considers the estimation of the ratio of population means when some observations are missing. Four estimators are presented and their bias and mean square error properties are studied. Key words: Ratio estimators, missing values, relative mean squared error 1 Introduction In most of the developments concerning the use of auxiliary information in the estimation of parameters in survey sampling, it is typically assumed that all the observations on selected units in the sample are available. This may not hold true in many practical situations encountered in sample surveys and some observations may be missing for various reasons such as unwillingness of some selected units to supply the desired information, accidental loss of information caused by unknown factors, failure on the part of investigator to gather correct information. In fact, missingness of observations is not an uncommon feature in opinion polls, market research surveys, mail enquiries, socio-economic investigations, medical studies and other scientific experiments. In such circumstances, the traditional procedures for deducing inferences cannot be applied straightforwardly. Ratio of two population means in survey sampling is conventionally esti- mated by the ratio of corresponding sample means; see e.g., [1] and [2]. This estimation procedure does not work when some of the observations are miss- ing. Assuming that some observations are missing on either the study charac- teristic or both the study and auxiliary characteristics, [3] have considered the estimation of ratio of means. In this article, we consider a situation in which there are some observations missing on one of the characteristics at a time and thus the missingness phenomenon occurs for both the characteristics sepa- rately but not, of course, simultaneously. As an illustration, let the study

Estimation of ratio of population means in survey sampling when some observations are missing

  • Upload
    v-k

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Metrika (1998) 48: 177±187

> Springer-Verlag 1999

Estimation of ratio of population means in surveysampling when some observations are missing

H. Toutenburg1, V. K. Srivastava2

1 Institut fuÈr Statistik, UniversitaÈt MuÈnchen, D-80799 MuÈnchen, Germany(e-mail: [email protected])2Department of Statistics, University of Lucknow, Lucknow 226007, India

Received: June 1998

Abstract. This paper considers the estimation of the ratio of population meanswhen some observations are missing. Four estimators are presented and theirbias and mean square error properties are studied.

Key words: Ratio estimators, missing values, relative mean squared error

1 Introduction

In most of the developments concerning the use of auxiliary information inthe estimation of parameters in survey sampling, it is typically assumed thatall the observations on selected units in the sample are available. This may nothold true in many practical situations encountered in sample surveys andsome observations may be missing for various reasons such as unwillingnessof some selected units to supply the desired information, accidental loss ofinformation caused by unknown factors, failure on the part of investigatorto gather correct information. In fact, missingness of observations is not anuncommon feature in opinion polls, market research surveys, mail enquiries,socio-economic investigations, medical studies and other scienti®c experiments.In such circumstances, the traditional procedures for deducing inferencescannot be applied straightforwardly.

Ratio of two population means in survey sampling is conventionally esti-mated by the ratio of corresponding sample means; see e.g., [1] and [2]. Thisestimation procedure does not work when some of the observations are miss-ing. Assuming that some observations are missing on either the study charac-teristic or both the study and auxiliary characteristics, [3] have considered theestimation of ratio of means. In this article, we consider a situation in whichthere are some observations missing on one of the characteristics at a time andthus the missingness phenomenon occurs for both the characteristics sepa-rately but not, of course, simultaneously. As an illustration, let the study

characteristic �Y � be the consumption expenditure of a household in the cur-rent month and the auxiliary characteristic �X� be the consumption expendi-ture in the same month three years ago. Then there may be some householdsfor which the values of X are lost or could not be recorded earlier, and thehouseholds are now not able to recall the correct value of X while the corre-sponding values of Y are readily available. Similarly, there may be somehouseholds for which values of X are available but the corresponding valuesof Y cannot be obtained, for instance, when such families have gone on holi-days or have permanently moved out of town.

The plan of this article is as follows. In Section 2, we describe the dataframework and present four simple estimators for the ratio of populationmeans. Large sample approximations for their relative biases and relativemean squared errors are given in Section 3 and using them a comparison ofestimators is made. Some remarks are then placed in Section 4. Finally, thederivation of results is outlined in Appendix.

2 Estimation of ratio

Consider a population of N units from which a random sample of size n isdrawn according to simple random sampling without replacement procedurefor the estimation of ratio R � �Y=X � of population means X and Y of twocharacteristics X and Y, respectively.

It is assumed that a set of �nÿ pÿ q� complete observations �x1; y1�;�x2; y2�; . . . ; �xnÿpÿq; ynÿpÿq� on selected units in the sample are available. In

addition to these, observations x�1 ; x�2 ; . . . ; x�p on p units in the sample are

available but the corresponding observations on Y characteristic are missing.Similarly, we have a set of q observations y��1 ; y��2 ; . . . ; y��q on Y characteristic

in the sample but the associated values on X characteristic are missing. Fur-ther, the quantities p and q denoting the number of incomplete observationsare assumed to be random following [3].

If we write

x � 1

nÿ pÿ q

Xxi

y � 1

nÿ pÿ q

Xyi

x� � 1

p

Xx�i

y�� � 1

q

Xy��i

�2:1�

the following estimators for the ratio R � �Y=X� can be formulated:

r1 � y

x�2:2�

r2 � �nÿ q�y�nÿ pÿ q�x� px�

�2:3�

178 H. Toutenburg, V. K. Srivastava

r3 � �nÿ pÿ q�y� qy��

�nÿ p�x �2:4�

r4 � nÿ q

nÿ p

� � �nÿ pÿ q�y� qy��

�nÿ pÿ q�x� px��2:5�

The estimator r1 is based on complete observations numbering �nÿ pÿ q�and ignores all the incomplete pairs of observations. The estimators r2 and r3make use of incomplete observations only partly. For example, the estimatorr2 utilizes the p observations on X characteristic only while r3 uses the q ob-servations on Y characteristic only. The estimator r4, however, incorporatesall the available observations.

3 Comparison of estimators

For the comparison of the performance properties of the four estimators ofratio R, we introduce the following quantities:

C2x �

1

�N ÿ 1�X 2

XN

�Xi ÿ X�2

C2y �

1

�N ÿ 1�Y 2

XN

�Yi ÿ Y�2

r � 1

�N ÿ 1�XYCxCy

XN

�Xi ÿ X��Yi ÿ Y �

fs � E1

nÿ s

� �ÿ 1

N

�3:1�

where expectation operator in the last quantity refers to all possible values ofthe non-negative integer valued random variable s.

Thus we have

fp U f�p�q� �3:2�

fq U f�p�q� �3:3�

fpd fq if pdq �3:4�We now present the results which are derived in Appendix.

Theorem 3.1. When sample size is large, the relative biases of the four estima-tors of R can be approximated by

RB�r1� � Er1 ÿ R

R

� �� �Cx ÿ rCy�Cx fp�q �3:5�

Estimation of ratio of population means in survey sampling 179

RB�r2� � Er2 ÿ R

R

� �� �Cx ÿ rCy�Cx fq �3:6�

RB�r3� � Er3 ÿ R

R

� �� �Cx fp�q ÿ rCy fp�Cx �3:7�

RB�r4� � Er4 ÿ R

R

� �� �Cx fq ÿ rCy fp�Cx �3:8�

It is clear from the above expressions that all the estimators are generallybiased like the traditional ratio estimator.

For the comparison of the magnitudes of biases, we consider the squaredexpressions for the relative biases to the given order of approximation.

We ®rst observe that the estimator r1 has always larger amount of bias incomparison to r2. The same is true for r1 when compared with r3 providedthat the correlation coe½cient r is negative. For positive values of r, this re-sult continues to remain true provided that the following condition is satis®ed:

r > 2y 1� fp

fp�q

!ÿ1�3:9�

where y � �Cx=Cy�.The opposite is true, i.e., r1 has smaller bias in magnitude in comparison to

r3 when

0 < r < 2y 1� fp

fp�q

!ÿ1: �3:10�

Similarly, if we compare r1 with r4, we observe from (3.5) and (3.8) that r1has larger bias in magnitude than r4 when r is negative. The same result re-mains true for positive values of r satisfying the following constraint:

f 2p�q ÿ f 2

p

f 2p�q ÿ fp fq

!r2 ÿ 2yr� f 2

p�q ÿ f 2q

f 2p�q ÿ fp fq

!y2

" #> 0: �3:11�

Next, comparing the estimators r2 and r3 which utilize the additionalavailable observations only partially, we ®nd the magnitude of bias of r2 issmaller (larger) than that of r3 when the quantity G is positive (negative)where G is de®ned by

G � � f 2p ÿ f 2

q �r2 ÿ 2� fp�q fp ÿ f 2q �yr� � f 2

p�q ÿ f 2q �: �3:12�

180 H. Toutenburg, V. K. Srivastava

Finally, let us compare r4 with r2 and r3.From (3.6) and (3.8), we observe that r4 has smaller magnitude of bias in

comparison to r2 when any one of the following conditions is satis®ed

fp > fq and 0 < r < 2y 1� fp

fq

!ÿ1�3:13�

fp < fq and r > 2y 1� fp

fq

!ÿ1�3:14�

fp < fq and r < 0: �3:15�

Similarly, it follows from (3.7) and (3.8) that r4 has smaller magnitude ofbias than r3 when

r <fp�q � fq

2 fp

!y �3:16�

which is obviously satis®ed at least as long as the correlation coe½cient isnegative.

Next, let us compare the estimators with respect to the criterion of meansquared error. For this purpose, we have the following results derived inAppendix.

Theorem 3.2. When sample size is large, the relative mean squared errors of thefour estimators of R are approximated by

RMSE�r1� � Er1 ÿ R

R

� �2

� �C 2x � C2

y ÿ 2rCxCy� fp�q �3:17�

RMSE�r2� � Er2 ÿ R

R

� �2

� �Cx ÿ 2rCy�Cx fq � C2y fp�q �3:18�

RMSE�r3� � Er3 ÿ R

R

� �2

� �Cy ÿ 2rCx�Cy fp � C2x fp�q �3:19�

RMSE�r4� � Er4 ÿ R

R

� �2

� �Cy ÿ 2rCx�Cy fp � C2x fq: �3:20�

Estimation of ratio of population means in survey sampling 181

It is seen from (3.17), (3.18) and (3.19) that both the estimators r2 and r3(which utilize the incomplete observations on either X or Y) are more e½cientthan r1 (which ignores the incomplete observations all together) at least aslong as the correlation coe½cient r between X and Y is negative or zero. If r ispositive, the estimator r2 continues to be more e½cient than r1 provided that

r <y

2�3:21�

while r3 remains more e½cient than r1 when

r <1

2y: �3:22�

We thus observe that utilizing the incomplete data partially (i.e., ob-servations on either X or Y characteristics) is a better proposition than ignor-ing them completely so long as r is negative or zero. This result carries overfor positive values of r too, at least as long as

r <1

2min y;

1

y

� �: �3:23�

On the other hand, it may not be worthwhile using the incomplete data setso long as

r >1

2max y;

1

y

� ��3:24�

which follows from the observation that r1 is better than r2 in case 2r > y andbetter than r3 in case 2ry > 1.

Comparing r2 and r3, we ®nd that r2 is more e½cient than r3 when

rdF if fqd fp �3:25�

while the opposite is true, i.e., the estimator r2 is less e½cient than r3 when

rcF if fqc fp �3:26�

where

F � � fp�q ÿ fp�2y� fq ÿ fp�

1ÿ fp�q ÿ fq

fp�q ÿ fp

!y2

" #: �3:27�

The above statement provides conditions under which use of incompleteobservations on X characteristic is superior or inferior than the use of incom-plete observations on Y characteristic.

Next, let us consider the estimator r4 which utilizes all the available ob-servations on both X and Y characteristics. Comparing it with r1 which isbased on complete cases only, we observe from (3.17) and (3.20) that r4 is

182 H. Toutenburg, V. K. Srivastava

more e½cient than r1 when the correlation coe½cient is negative or zero. Thisresult holds for positive values of r also provided that

r <1

2y1� fp�q ÿ fq

fp�q ÿ fp

!y2

" #: �3:28�

When the positive value of r is such that it satis®es the inequality (3.28)with a reversed sign, r4 is no more superior to r1 meaning thereby that it isappropriate to discard the incomplete observations.

Now comparing r4 with r2, we ®nd that r4 is more e½cient than r2 when

r� fq ÿ fp� <fp�q ÿ fp

2y

� �: �3:29�

This inequality is satis®ed when one of the following conditions holds true

(a) fq � fp

(b) r � 0(c) fq < fp and r > 0

(d) fq < fp and r < 0 but jrj < C

(e) fq > fp and r < 0

(f ) fq > fp and 0 < r < C

where

C � fp�q ÿ fp

2yj fq ÿ fpj: �3:30�

Under any one of the above conditions, we thus observe that using all theavailable observations on X and Y is a better strategy than using all theavailable observations on X only and discarding those for which Y values areavailable but X values are missing.

Just the opposite is true, i.e., using all those observations for which X val-ues are available is more appropriate than using the entire set of observationswhen the inequality (3.29) holds true with a reversed sign. This can happenwhen the magnitude of r is greater than C provided that fq < fp for negativevalues of r and fq > fp for positive values of r.

Similarly, comparing the expressions (3.19) and (3.20), it is interesting to®nd that r4 is invariably more e½cient than r3 implying that it is always ben-e®cial to use all the available observations ± complete or incomplete.

It may be observed that quantities like 2 and y are involved in the con-ditions for the superiority of one estimator over the other rendering them hardto verify in any given application. However, some prior information aboutsuch parameters is quite frequently available. For instance, 2 (correlation co-e½cient) and y (ratio of coe½cients of variation) may exhibit su½cient stabil-ity in repeated surveys and/or other similar experimental studies. Sometimestheoretical considerations may be able to specify their values fairly well. Oftenpilot surveys and other preliminary investigations may shed enough light

Estimation of ratio of population means in survey sampling 183

not only on the sign of 2 but also on the magnitudes of 2 and y. Such priorinformation can then be fruitfully utilized to check the environment for thesuperiority of one estimator over the other.

4 Some remarks

We have considered the problem of estimating the ratio of population meanswhen observations on some selected units in the sample drawn according tosimple random sampling without replacement on either X characteristic or Ycharacteristic but not on both of them simultaneously are missing. Accord-ingly, we have formulated four simple estimators for the population ratio. The®rst estimator is essentially the ratio of sample means employing all the com-plete pairs of observations and ignoring the rest. The second estimator usesthe incomplete pairs, in which only X values are available, in addition tocomplete cases while the third estimator utilizes incomplete pairs, in whichonly Y values are available, besides the complete cases. The fourth estimatoris based on all the complete as well as incomplete pairs of observations. Per-formance properties of the four estimators are analyzed with respect to thebias and mean squared error criteria using the large sample theory and theconditions are obtained for the superiority of one estimator over the other.Such an exercise has shed light on the role of missingness of observations onthe ratio method of estimation.

When the population mean X is known one can straightforwardly developestimators for the population mean Y , and the role of X characteristic in im-proving the estimation of Y can be examined by comparing the performanceproperties of these estimators with the estimators y and ��nÿ pÿ q�y� qy���=�nÿ p�. Similar investigations can be carried out when some otherprocedure like the product method of estimation is followed. Extending theresults for more than two characteristics with varying patterns of missingnesswill be an interesting exercise.

Appendix

Let us de®ne

u � xÿ X

X

� �

v � yÿ Y

Y

� �

u� � x� ÿ X

X

� �

v�� � y�� ÿ Y

Y

� �so that

184 H. Toutenburg, V. K. Srivastava

r1 ÿ R

R

� �� vÿ u

1� u

� �vÿ u� 1ÿ u� u2

1� u

� ��A:1�

whence the relative bias and relative mean squared error to order O�nÿ1� onlyare given by

RB�r1� � E�v� ÿ E�u� � E�u2� ÿ E�uv� �A:2�

RMSE�r1� � E�v2� � E�u2� ÿ 2E�uv�: �A:3�Now we observe that

E�v� � E1E�vjp; q�� 0

E�v2� � E1E�v2jp; q�

� E11

nÿ pÿ qÿ 1

N

� �C2

y

� �� C 2

y fp�q:

Similarly, we have

E�u� � 0

E�u2� � C 2x fp�q

E�uv� � rCxCy fp�q

Substituting these results in (A.2) and (A.3), we obtain the results (3.5) and(3.17).

Proceeding in the same manner, we can express

r2 ÿ R

R

� �� vÿ �nÿ pÿ q�u� pu�

�nÿ q�� �

1ÿ �nÿ pÿ q�u� pu�

�nÿ q�� �

�Op�nÿ32� �A:4�

whence up to order O�nÿ1�, we have

RB�r2� � E�v� ÿ E�nÿ pÿ q�u� pu�

�nÿ q�� �

� E�nÿ pÿ q�u� pu�

�nÿ q�� �2

ÿ E 1ÿ p

nÿ q

� �uv

� �ÿ E

p

nÿ q

� �u�v

� �

Estimation of ratio of population means in survey sampling 185

� C2x fq ÿ rCxCyE1 1ÿ p

nÿ q

� �1

nÿ pÿ qÿ 1

N

� �� �� C2

x fq ÿ rCxCy fq �A:5�to the order of our approximation. This leads to the result (3.6).

Similarly, from (A.4), we have

RMSE�r2� � E vÿ 1ÿ p

nÿ q

� �uÿ p

nÿ q

� �u�

� �

� E�v2� � E 1ÿ p

nÿ q

� �2

u2

" #� E

p

nÿ q

� �2

u� 2" #

ÿ 2E 1ÿ p

nÿ q

� �uv

� �ÿ 2E

p

nÿ q

� �u�v

� �

� 2E 1ÿ p

nÿ q

� �p

nÿ q

� �u�v

� �

� C2y fp�q � C 2

xE1 1ÿ p

nÿ q

� �1

nÿ pÿ qÿ 1

N

� �� �

� C 2xE1

p

nÿ q

� �21

pÿ 1

N

� �" #

ÿ 2rCxCyE1 1ÿ p

nÿ q

� �1

nÿ pÿ qÿ 1

N

� �� �� C2

y fp�q � C 2x fq ÿ 2rCxCy fq �A:6�

dropping the terms with higher order order of smallness than nÿ1.This provides the result (3.18).Next, observing that the relative estimation error of r3 to order Op�nÿ1� is

given by

r3 ÿ R

R

� �� �nÿ pÿ q�v� qv��

�nÿ p� ÿ u

� ��1ÿ u� �A:7�

the results (3.7) and (3.19) can be easily derived in the same way as indicatedin case of r2.

Finally, let us consider the relative estimation error of r4 to order Op�nÿ1�which can be expressed as

r4 ÿ R

R

� �� �nÿ pÿ q�v� qv���nÿ p� ÿ �nÿ pÿ q�u� pu�

�nÿ q�� �

� 1ÿ �nÿ pÿ q�u� pu�

�nÿ q�� �

�A:8�

186 H. Toutenburg, V. K. Srivastava

from which the expressions (3.8) and (3.20) for the relative bias and the rela-tive mean squared error to order O�nÿ1� can be easily found.

Acknowledgements. The authors express thanks to the referee for some helpful comments.

References

[1] Cochran WG (1977) Sampling techniques. Wiley, New York[2] Sukhatme PV, Sukhatme BV, Sukhatme S (1984) Sampling theory of surveys with applica-

tions. Iowa State Univercity Press, Iowa[3] Tracy DS, Osahan SS (1994) Random non-response on study variable versus on study as well

as auxiliary variables. Statistica 54:163±168

Estimation of ratio of population means in survey sampling 187