Distribution of sample correlation coefficients

DISTRIBUTION OF SAMPLE CORRELATION COEFFICIENTS*

Khursheed Alam

Clemson University Clemson, South Carolina

ABSTRACT

Let ( Y, X I , . . . , A',, be a random vector dislribu(ed according to a multivariate normal d is t r ibu~~on,wi~ere X I , . . . , X , are considered as predictor variables and y is the predictand. Let r, and R, denote the population and sample correlation coefficients, respectively, between Y and A', , The population correlation coefficient rr is a measure o f the predictive power of q. The author has derived the joint distribution of R , , . I . , R, and its asymptotic property. The given result is useful in the problem of selecting the most important predictor variable correspon~ing to the largest absolute value of c

1. INTRODUCTION

The problem of selecting a variable or several variables from a set of predictor variables (X,} occurs frequently in the design of experiments. The correlation between a predictor variable X, and the predictand Y measures the "leverage" of XI upon Y. If A', and Yare jointly distributed according to the standard bivariate normal distribution, with correlation c o e ~ c i e n t r, , then the conditional distribution of Y given X, is normal N ( r , 4 , 1 - r,' 1. The larger the absolute value of r , , the smaller is the variance of the conditional distribution, and therefore the higher is the predictive power of X, . Thus, the predictor variable corresponding to the largest value of r,' may be considered as the most important (best) predictor variable.

The problem of selecting one or more of the predictor variables which have larger correlations with the predictand than the rest of the variables arises in the test of accuracy of a weapon system. The accuracy may be described by the radial distance between the target and the point of impact of a projectile released by the weapon. There are a number of contributory factors in missing the target. Let XI, X 2 , . . . denote the variables measuring the effects of the contributory factors. These variables are positively correlated with Y and among themselves. If the correfation between Y and X I , say, is much larger than the correlations between Y and X2, etc., then the factor associated with X I may be considered as a major con~r~butory factor in missing the target which should be looked into for better control of the projectile. Generally, it is expensive to measure the variables, since the measurements invole the destruction of the projectile. Therefore, it is desirable to estimate the correlations between the variables from a sample of observations. The results of this paper, which deals with the joint distribution of the sample correlation coefficients, would be useful in screening a set of predictor variables for the selection of one or more of them on the basis of their correlations with the predictand.

*The author's work was supported by the Office of Naval Research under Contract N00014-75-0451.

327

328 K. A L A M

The problem of selecting the predictor variable associated with the iargest correlation with the predictand has been considered recently by Ramberg 131. Rizvi and Solomon [41 and Alam, Rizvi, and Solomon [l] have also considered the problem of selecting, from p 2 2 multivariate normal populations, the population with the largest multiple correlation between a single variate, classified as the predictand, and the remaining variates,

Let the random vector ( Y , XI, . . . , X K ) be distributed according to a multivariate normal distribution. Suppose that a sample of n observations is taken from the given distribution. Let r , and R , denote the population and sample correlation coefficients between Y and X,, respectively. Let t-,'= r,(l - r ,!)-1/2 and R,'= R,(1 - R , q - ' / 2 . In the following section we derive the distribution of R * = ( R i, . . . , R L ) . The distribution of R* is expressed in terms of the joint distribution of three independent random variables. The asymptotic distribution of R *for large n is also given.

2. DISTRIBUTION of R" d

Let = mean "distributed as". Without loss of generality we can assume that the variables Y, X1, . . . , X, are standardized, that is, they are distributed with mean 0 and vyiance 1. Let r,, denote the correlation coefficient between X, and X, , and let T: = (r, i) and X = ( r , 3 . Let Y,, X l f , . . . XK,) denote the r-th observation in the sample, and let

Then

(2.1)

and

(2.2)

1 F, = - x,/, B ,=1

v; =

w; = n c (X,, - X,)* - v,*.

/ = I d

From the theory of linear regression analysis it is seen that W , =: (1 - q 2 ) x : - ~ , chi-square with n-2 degrees of freedom, independ~nt of V, and Y = ( Y1, . . . , Y,,)', V, '1 N(r,S, 1 - r,2), and cov ( V , , V,) = r,, - r,rt , conditionally given Y. Let A,, = r,/ - r,r, and R = (A,,). It is also seen that W, can be represented as the sum of squares of ( n - 2) orthogonal linear func- tions of Xj1, . . . , &. That is,

(2.3)

where 2, = ( Z , , , . . , , Zkt)' are identical~y and independently distributed as "0, R ) , independent of V , , , . . , VK and S.

Let T = (TI , , . . , TK)' be a random vector distributed as N ( 0 , R ) , independent of S and W = ( W , , . . . , WK)'. Then

(2.4)

Therefore,

DISTRIBUTION OF SAMPLE CORRELATION COEFFICiENTS 329

THEOREM 2.1: The joint distribution of the sample correlation coefficients between the predictand and the predictor variables of a multivariate normal distribution is given by (2.41, where T = N ( O , O ) , S2 = x : - ~ , and the distribution of W Is given by (2.3). Moreover, S, T and W are jointly independent.

d d

For large n, W is asymptotically distributed as N [ ( n - 2)f, 2(n - 2)h1, by Theorem 4.2.4 of Anderson 121, where f = (1 - r f , . . . , 1 - ri ) ’ and h = (A;). Therefore,

COROLLARY 2.1: The asymptotic distribution of R * is given by (2.4), where

T =: N ( O , a ) , S 2 z x.i-1, W =: N [ ( n - 2)f, 2 (n - 2 ) h ] , and S, T, and W are jointly independent.

d d d

The following corollary gives the asymptotic distribution of 6 (R * - r “1, which is derived from (2.4) but follows also from the central limit theorem or Theorem 4.2.4 of Ander- son.

Let y,, = (r, , - r , r , ) (1 - r,*)-’’* (1 - r;)-i’* and I‘ = (?I:). From (2.4) we have for large n

= Ir;(l - r/2)-1’2 + -5.- ( A - B,) + OJn-”9, Jz d d

where A = N ( 0 , 11, B = ( B I , ... , BK)‘ = N(O, r). Moreover T, A and B are jointly independent. Therefore, (R* - r*) is asymptotically distributed as N ( 0 , C ) , where

(2.6) G,=l + r,*2

Therefore

COROLLARY 2.2: For large n, fi (R * - r *) is asymptotically distributed as N ( 0 , C ) , where Cis given by (2.6).

It is interesting to consider the following special cases: (1) r, = 0, r,, = 0, i f j for all i and j ; that is, the variables Y, XI, . . . , XK are jointly independent. We have C = I and

fi (R * - r *) = N ( 0 , I ) , asymptotically. (2) r, = 0, r,, - p, i f j for all i and j ; that is, the predictor variables XI, ... , XK are equi-correlated and independent of Y. We have C,, = 1 , C,, = p, i # j . (3) r, = p , r,, - 0, i f j for all i and j ; that is, the predictor variables are jointly independent and equi-correlated with Y. We have

d

C,,=fl - p2)-1

and

Consider the problem of selecting the best predictor variable. A standard procedure is to select the variable from the predictor variables corresponding to the largest value of the squared

330 K. A L A M

correlation coefficients R f , . . . , Ri or, equivalently, R ;*, . . . , R i 2 . By Corollary 2.2 the probability of a correct selection can be derived for large n from the multivariate normal distribution function. In special case ( I ) we have that n max ( R ;2, . . . , R;*) is distributed as the largest order statistic in a sample of K observation from x: - chi-square with 1 degree of freedom. This result can be used also to test the significance of the correlation between the selected predictor variable and the predictand. Similar results are obtained for cases (2) and (3) .

Let r o be a given value of r*. From Corollary 2.2 we have that n ( R * - r4' C-' (R* - 19 is asymptotica~ly distributed as x&,, noncentral chi-square with K degrees of freedom and noncentrality parameter 8 = n (r * - r 9 ' C' (r * - r 9. Let ? = C(R *, ( R , , ) denote the estimate of C where R,j are the sample correlation coefficients between X, and X J; ? converges in probability to C as n - 00. Therefore, the statistic d = n ( R * - r 4 ?-' (r * - r 9 can be used to test the hypothesis that r * = r 9

REFERENCES

[I1 Alam, K., M. H. Rizvi, and H. Solomon, "Selection of Largerst Multiple Correlation

[21 Anderson, T. W., A n Introduction to Multivariate Analysis, (Wiley, New York, 1958). 131 Ramberg, J. S . , "Seiecting the Best Predictor Variate," C o ~ ~ u n i c a ~ i o ~ s in S ~ a ~ ~ ~ ~ i ~ s A6,

[41 Rizvi, .M. H., and H. Solomon, "Selection of Largest MultipIe Correlation Coefficients:

Coefficients: Exact Sample Size Case," Annals of ~ t a t ~ s t i e ~ 4, 614-620 (1976).

1133-1147 (1977) .

Asymptotic Case," Journal of the American Statistical Association, 68, 184-188 (1973) .

Documents

Distribution of sample correlation coefficients