Analysis of Grouped Data from Multinomial Populations

Analysis of Grouped Data from Multinomial PopulationsAuthor(s): Khursheed AlamSource: Journal of the American Statistical Association, Vol. 79, No. 385 (Mar., 1984), pp. 132-135Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/2288347 .

Accessed: 14/06/2014 15:46

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journalof the American Statistical Association.

http://www.jstor.org

This content downloaded from 185.2.32.21 on Sat, 14 Jun 2014 15:46:54 PMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=astata

http://www.jstor.org/stable/2288347?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


Analysis of Grouped Data From Multinomial Populations

KHURSHEED ALAM*

This article deals with a problem in which n independent choices are made of r out of m factors, resulting in a multinomial distribution with (m) cells. The interest centers on the number of times each factor is chosen. This leads to a grouping of the cell counts into m overlapping groups. As the groups overlap, the distribution of the group counts is not multinomial. A large sample theory is given for testing various hypotheses regarding the grouped-cell probabilities, similar to the hypotheses that arise for testing homogeneity of parallel samples in a contingency table.

KEY WORDS: Contingency table; Quadratic form in multinomial variables.

1. INTRODUCTION

The multinomial distribution arises in various statistical problems, such as the analysis of contingency tables and tests for goodness of fit. The usual tests of homogeneity in parallel samples or the independence of cross- classification in a contingency table are based on the asymptotic properties of the multinomial distribution and the resulting chi-squared statistics. In this article we consider a problem in which n independent choices are made of r out of m factors, resulting in a multinomial distribution with (m) cells. The interest centers on the number of times each factor is chosen. This leads to a grouping of the cell counts into m overlapping groups. As the groups overlap, the distribution of the group counts is not multinomial. However, a large sample theory leads to certain chi-squared statistics for testing various hypotheses regarding the grouped-cell probabilities, similar to the hypotheses that arise for testing homogeneity of parallel samples in a contingency table.

The following example is an illustration of the problem under consideration. Suppose that there are m candidates contesting for r seats in an election. Each voter votes independently for r different candidates. There are k = (m) difference choices for each voter to vote for. The vote counts for the different choices from a sample of n voters are jointly distributed according to a multinomial distribution with k cells. It may be of primary interest in this example to consider the total number of votes cast for the individual candidates. This would lead to a group-

* Khursheed Alam is Professor, Department of Mathematical Sci- ences, Clemson University, Clemson, SC 29631. This work was sup- ported by the Office of Naval Research under Contract N00014-75-C- 0451.

ing of the cell counts into m groups. Note that the groups overlap when r > 1, and that the value of k would often be very large; for example, if m = 10 and r = 3, then k = 120. Therefore, if n is not very large, some of the cell counts would be very small. In fact, some of the counts would be almost nil even for large n if a certain combination of the candidates seems undesirable, even though the candidates making that combination may be individually highly favored by the voters.

A slight variation of this example would be the situation in which the construction of certain facilities, such as a parking lot, a library, or a recreational facility may be under consideration by the residents of a city. Each res- ident may vote for a pair of facilities. The two facilities that are individually favored by the largest number of voters may be finally selected for the construction.

In Section 2 we derive the chi-squared statistics for testing various hypotheses from the grouped data. In Sec- tion 3 we apply the given results to the data collected from a sample survey. Some results are given in the Ap- pendix on the asymptotic distribution of a quadratic form in multinomial variables. These results are used in Sec- tion 2.

2. ANALYSIS OF GROUPED DATA Consider a multinomial population with k = (m) cells,

where each cell is associated with r out of m given factors. In the example given in the preceding section, the factors represent the candidates running for the election. Let the k cells be indexed by r suffixes (i,j,. . . , u), representing the associated factors. Let Pij...u denote the probability associated with the (i, j, . . . , u)th cell, and let xij...u denote the corresponding cell count in a sample of size n. Let

Pij. = Pijl...u and

Xxj. = xiI...u,

where the summation E extends over all values of 1,. u. Similarly, let

Pi. = Pij...u

and

Xi. = ...u

? Journal of the American Statistical Association March 1984, Volume 79, Number 385

Theory and Methods Section

132



Alam: Grouped Data From Multinomial Populations 133

where the summation E extends over all values of]j, .... u. Clearly, xi. is distributed according to the binomial distribution B(n, pi.).

Let x = (xl., . . . , x,.)'. The moment generating func- tion of x is given by

M(t) = E exp(t'x)

= ( Pij...u exp(ti + * + tu)n (2.1)

where t = (t1, . . , tY)'. Using (2.1) it is easy to show that x is asymptotically distributed for large n according to a multivariate normal distribution N(np, nfl) where p = (p 1. p m * , pm.)' and

Q =Q - pp' (2.2)

with Q = (qij) being given by qii = pi., qij = pij. (i 4

j). It is important to note that the large sample approximation of the distribution of x is valid if npi. is moderately large for each i = 1, . . . , m, even though some of the cell probabilities Pi... u may be very small.

Let e = (1, . . . , 1)'. It is seen that p'e = r and Qe = rp, so that fle = 0, the null vector. Hence, the covariance matrix fl is singular. It is shown in the Appendix that Q is nonsingular. We have

p' Q'p = (p'e)lr= 1. Therefore,

flQ'= Q - pp =fl.

Thus Q` is a generalized inverse of Q. This is a useful result. It follows that n(x - p)' Q` (x - p) is asymptotically distributed for large values of n as Xm - 12 (chi- squared with (m - 1) degrees of freedom), where x = x'n.

The nonsingularity of Q follows from the fact that it is possible to choose all combinations of r factors from the given m factors. It may be possible to apply the results when the choice is constrained (see Remark, Appendix).

When Q is unknown it is estimated by Q = (qij), where qii = xi.In and qij = xij.1n (i =# j). Since Q approaches Q as n -> oo, the statistic n(xc - p)' Q` (x - p) is asymptotically distributed as Xm- 12 as ni x. Therefore, the statistic T('" = n(x - po)' Q 1 (x7 - po) can be used to test the hypothesis H"') that p has a given value po. Note that for r = 1 when there is no grouping of the multinomial data, Q reduces to a diagonal matrix whose diagonal elements are the cell probabilities, in which case n(xc - p?)' Q1l (x - p?) represents the familiar Pearson's chi- squared statistic used for testing the hypothesis that p = p0. The statistic TV') is asymptotically distributed according to a noncentral chi-squared distribution when p # p0 and p - p0 = 7lQ for some vector C. Therefore, we reject H(') for large values of T()

Now consider the two-sample case for a test of homogeneity in parallel samples. Let mrr and X2 be two multinomial populations, each with k = (f') cells, where each cell is associated with r out of m given factors in both the populations. Let the corresponding cell probabilities be

denoted by Pij...U1() and p - - (2) for the two populations. We extend the notation used in the one-sample case to the two-sample case and the multisample case discussed next, in the obvious manner. Consider the hypothesis

H(2) p(I) (2)

Given a sample drawn from i, with n = n 1 and a sample from '2 with n = n2, let xij...ul) and xij...u(2) denote the corresponding cell counts. As (n2Q(l) + n1 Q(2))-I is a generalized inverse of (n2 l(') + n1 fl(2)), given H(2), it follows that the statistic

T(2) = 2(x(I) -x2) nQ1+nQ(2)) - I (xC-(1)-x()

is asymptotically distributed for large values of n 1 and n2 as Xm- 12, under H 2), where Q(1) and Q(2) are estimates of Q(l) and Q(2), respectively, derived similarly as Q for the one-sample case. We reject H(2) for large values of T(2). The statistic T(2) is asymptotically distributed according to a noncentral chi-squared distribution if p(1) -

p(2) = (n2fi(l) + nl1f2(2)); for some vector C. Next, consider the multisample case. Let a,, . . .,s

be s multinomial populations with k = (m) cells. First, consider the hypothesis

H(s) p(l = = p, Q() = ... = Q(S).

Suppose that a sample of size ni is drawn from 7Ti, i = 1, .. .,s. Let

is

n i=1

where n = n1 + + n,. From the result (A.3) of the Appendix it follows that the statistic

s

-s) = n niOc-") - y)'(n,Q i = 1

+ + ns (s) (x(i) - Y)

is asymptotically distributed under H(s) as X(m - 1)(s - 12 for large values of the sample numbers n 1, . . , nS. We reject H(s) for large values of T(s). Note that T(s) reduces to T(2) for n1 = n2, s = 2.

Next consider the hypothesis (s) 1 () =s..= )

This is the hypothesis H(s) without the condition Q(1) - *. = Q(s). Let

y = (n1 (Q(I))1 + * + ns (5s)) )

x (n1(Q(')Y 1 x(l) + ... + n5(Q(s) 1

and let s

T, (s) =Eni xi y)' (Q()-l(()_ iQ= 1

From the result (A.4) of the Appendix it follows that T is asymptotically distributed as Xs(m - 1)2 - v, where v is a random variable stochastically smaller than Xm - 12. We reject Hls for large values of Tls)



134 Journal of the American Statistical Association, March 1984

We have mentioned that the usual analysis of a contingency table is based on the large-sample approximation of the multinomial distribution. An exact test is known for the independence of cross-classification in a contingency table that can be used when the sample size is small. A similar test can be derived for small samples, based on grouped data. Consider an r x s contingency table consisting of r rows and s columns arising from a multinomial population where the individuals of the population can be described as belonging to one of r categories Al, . . . , Ar with respect to an attribute A, and to one of s categories B 1,. . . , Bs with respect to an attribute B. Let mTij and nij denote the probability and the observed frequency associated with the ijth cell in the table. Let

s r s r

7TFf = 'T ij, 7r.j = Tr.j, nfj.= Y nij, n.j = nil j=1 i=1 j=1 i=1

and r s

n.. = f l. = n.j. i=1 j=1

If the attributes A and B are independent then ITij = 7rr. rT.j, and so the conditional probability of the observed cell frequencies given the marginal totals n1., ..., nr.

and n.1, . n. , n.5 is given by

p(n i's, n .'s, nj's) = fn1 Inn.A[flfli]

(2.3)

As the conditional probability (2.3) does not depend on the hypothetical values of wri. and T.i it is possible to derive an exact test of the independence of the attributes (see Rao 1965, Sec. 6d.3). Now, suppose that the r rows and s columns of the table are grouped into r* and s* overlapping groups of rows and columns, such that the marginal totals n I.,. . . , nr. and n. I,. . . , n .s are uniquely determined, given the marginal totals n 1i.,. .n , fr*.* and n *, . . ., n.s** of the observed frequencies in the grouped table, as in the examples given. Then the conditional probability of the observed cell frequencies in the grouped table, given the marginal totals n .*, . . ., nr.** and n.l*, . . ., n.s** is obtained by summing the terms in the square bracket on the right side of (2.3). As this probability does not depend on the hypothetical values of the cell probabilities of the grouped table, exact tests can be derived for a test of independence, based on the observed cell frequencies in the grouped table.

3. SAMPLE DATA ON RECREATION NEEDS

Table 1 gives the results of a sample survey of the recreation needs of the residents of a small university town in South Carolina. A random sample of 343 residents of the town were questioned on the recreation needs of the town. The respondents were divided into two categories, voters and nonvoters. Those indicating that they had

Table 1. Comparison of Voters and Nonvoters: Recreation Needs

Voters Nonvoters (N = 131) (N = 212)

# Choosing # Choosing

1. Baseball Diamonds 29 55 2. Basketball Courts 27 45 3. Supervised Swimming Areas 52 93 4. Bicycle Pathways 51 82 5. Equipped Neighborhood Miniparks 44 71 6. Lake Walkways, Benches, and Picnic 35 59

Tables 7. Preserve Tree and Green Areas 46 68 8. Football and Soccer Fields 8 24 9. Tennis Courts 54 63

10. Indoor Recreation Buildings 36 58 11. Golf Course 11 18

TOTAL 393 636

voted in the last local election were treated as voters and those who had not voted were treated as nonvoters. There were 131 voters and 212 nonvoters in the sample. Each respondent was asked to name his or her top 3 choices from a list of 11 recreation needs. The table gives the number of respondents favoring the different recreation needs, among voters and nonvoters. The table is repro- duced from Falk (1979) with the permission of the author.

We have tested the hypothesis H(2) that p(l) = p-(2)

where p(l) and p(2) are the probability vectors associated with the voters and nonvoters for the choice of the recreation needs. The value of the test statistic T2) came out to be equal to 8.11, approximately. From the chi-squared table one can see that P{X1o2 > 8.2} = .609, approximately. Therefore, we accept H(2). There is no significant difference in the choice of the recreation needs between the voters and nonvoters. The formula for f2) given in Section 2 involves the computation of the covariance mat- rices Q(1) and Q(2), which are not shown here.

APPENDIX

Distribution of a Quadratic Form

Let Zl, . . . , Z, be m-component independent random vectors, where Z/' N(pij, 1j), i = 1, . . ., s. Consider a test of the hypothesis H: [LI = = ps,S the covariances Y.i being known. Given H, a maximum likelihood estimate of the common mean vector is given by

A I E + *@. +, XS- )-l (1 -

Iz, + ...

+ Xs I-1 )

A test statistic for the hypothesis is given by S

t = (Z 1 A )Y X- (Z,i _ ), i = 1

which is distributed as Xm(s-1)2, under the null hypothesis. The distribution of t is noncentral chi-squared when wi $ A for some values of i and j. So we reject H for large values of t.



Alam: Grouped Data From Multinomial Populations 135

If Zi is singular, we substitute for t the statistic s

T = E (Zi - Z)' Yi (Zi - Z) i= 1

s

= Zi' liiZi -Z ; + **+ :Ls)Z (A.1) i = I

where + + Ys)- I Zi + + Es Zs)

and l, denotes a generalized inverse of li, given by i Ywi Yi = Yoi. It is assumed that the matrix (Xl + + Y,) is nonsingular. We consider below the distribution of T under H.

Suppose that H is true. As T is invariant under trans- lation of each Zi by a common vector, we assume without loss of generality that L,i = 0, i = 1, . . . , k. Let 1i denote the rank of Zi, and let 1 = 11 + + ls. From Theorem 4.7.1 of Graybill (1976) it follows that Zi'iZi d xi2. Hence, L I Zi'iZi d X12. As

cov(I,Z, + * + xsZs)

= (1111 + + IsIsis) = A,

we have that

Z'(X1 + * + Ys)Z d Xr2 (A.2)

if (X1 + + X5)- 'A is an idempotent matrix, where r denotes the rank of A. From (A. 1) it follows (see Rao 1965, 3b.4. (iv)) that

TdXi-r2. (A.3)

Clearly, (Xl + * + is)-'A is idempotent if C,l, = = C,Y, for some positive numbers C1, .. , Cs.

For application to the multinomial distribution discussed in the previous section, let us consider the case when A(ij + + is)` is not idempotent. Suppose that

A - (E, + + s) 0 (positive semidefinite),

where A denotes a generalized inverse of A. Then

Z'(Y, + + Ys)Z= (Y,Zl + + 5Z5s)'

x (E, + ... + Es) (1,ZI + + Y*Z5s) ?(Y; Z1 + + sZ5)'A(Y1Z, + + sZs)dXr2

Therefore, I v (A.4)

where v is a nonnegative random variable, stochastically smaller than Xr

Rank of Q We show that the matrix Q given by (2.2) is positive

definite. Let y denote the multinomial vector with k =

(m) components, where m > r - 2, and let x be defined as in Section 2, preceding (2.1). Clearly

x = Ly (A.5)

where L is an m x k matrix whose elements are 0 or 1, such that each row has s = (7 ?j) elements equal to 1 and (k - s) elements equal to 0, and that any two rows have elements equal to 1 in the same column in t = (r-2 ) columns. Now

LL' = M= (Mij)

where Mi, = s and Mjj = t(i $j). As s > t, the matrix M is positive definite. Therefore, Rank(L) = m.

Let nq and nl denote the mean and covariance of y. We assume that no component of q is equal to 0. Let D denote a diagonal matrix whose ith diagonal element is equal to the ith component of q. From the relation

= D - qq'

we have

Q = -cov(x) = LDL' = (Lq)(Lq)' = LDL' - pp'. n

Hence

Q = fl + pp'

= LDL' > 0 (positive definite)

by virtue of the fact that Rank (L) = m. A fortiori, Q is nonsingular.

Remark. The nonsingularity of Q follows from the spe- cial structure of the matrix L in the representation of the vector x in A(5). In the given situation it is possible to choose all combinations of r factors from the m factors. Suppose that the choice is constrained as follows. Let the m factors be grouped into 1 exclusive groups a 1, .... ,rr, say. Let mi denote the number of factors in rri, where ml + + ml = m. Let ri factors be chosen from -ir, assuming that it is possible to choose all combinations of ri factors from the mi factors in wri, where r, + + r, = r. It is easy to see that the matrix Q for the new situation is also nonsingular.

It is interesting to observe that the asymptotic nor- mality of the distribution of x follows from the representation x = L(y) in (A.5).

[Received September 1982. Revised June 1983.]

REFERENCES FALK, E.L. (1979), Citizen Participation in Public Administration, un-

published Ph.D. thesis, University of Georgia, Dept. of Public Ad- ministration.

GRAYBILL, F.A. (1976), Theory and Application of the Linear Model, Belmont, Calif.: Duxbury Press.

RAO, C.R. (1965), Linear Statistical Inference and Its Applications, New York: John Wiley.



Documents

Analysis of Grouped Data from Multinomial Populations