Chapt19 multicollinearity

8/4/2019 Chapt19 multicollinearity

1/10

19: MULTICOLLINEARITY

e

o

Multicollinearity is a problem which occurs if on

f the columns of the X matrix is exactly or nearly

t

m

a linear combination of the other columns. Exac

ulticollinearity is rare, but could happen, for

r

"

example, if we include a dummy (0-1) variable fo

Male", another one for "Female", and a column of

M

ones.

ore typically, multicollinearity will be approxi-

v

mate, arising from the fact that our explanatory

ariables are correlated with each other (i.e. they

f

w

essentially measure the same thing). For example, i

e try to describe consumption in households (y ) in


2/10

t

- 2 -

erms of income (x ) and net worth (x ), then it1 2

1

a

will be hard to identify the separate effects of x

nd x on y . The estimated regression coefficients

1

2

2b and b will be hard to interpret. The variance of

b and b will be very large, so the corresponding1 2

t-statistics will tend to be insignificant, even though

d

R

the F for the model as a whole is significant an

is high. Further, the coefficient of x , and the

c

21

orresponding t-statistic, may change dramatically if

the seemingly insignificant variable x is deleted2

from the model.


3/10

- 3 -

For a numerical example, consider a data set on

)

i

the monthly sales of backyard satellite antennas (y

n nine randomly selected districts, together with the

enumber of households (x ) in the district, and th1

2 e

d

number of owner-occupied households (x ) in th

istrict. (Both x and x are measured in units of

1

1 2

0,000 households). The multiple regression of y

son x and x indicates that neither variable i1 2

2 e

o

linearly related to y . However, R = .9279, and th

verall F test is highly significant, indicating that at

least one of x and x is linearly related to y .1 2


4/10

- 4 -

s

D

Satellite Antenna Sale

istrict Sales#

Households#

Owner-( ) ( ) Occupiedy x 1Households

( )x 2

2

1 50 14 11

73 28 18

4

3 32 10 5

121 30 20

0

6

5 156 48 3

98 30 21

5

8

7 62 20 1

51 16 11

79 80 25 1


5/10

- 5 -

The reason why the results of the two t-tests are

-

l

so different from the result of the F-test is that col

inearity has destroyed the t-tests by strongly reduc-

t

b

ing their power. The Pearson correlation coefficien

etween x and x is r = .985, so the two variables1 2

1

g

are highly collinear. A simple regression of y on x

ives a t-statistic for b of 9.35 (highly significant),1

2 -

s

while a simple regression of y on x gives a t

tatistic for b of 8.62 (also highly significant).2

2Note also that the R values for these two simple

o

regressions are .9259 and .9139, respectively, both

f which are almost as high as the multiple R for

the full model, .9279.

2


6/10

- 6 -

To get some mathematical insight into the gen-

(

eral problem, we use the spectral decomposition

Jobson, p. 576) to write

(XX) = p p ,p

0

i =

1i1

i i

w i

here are the eigenvalues of XX and

-P = [p , . . . , p ] is an orthogonal matrix of eigen0 p

vectors of XX.

If there is exact multicollinearity, then for some

t

(p +1)1 vector 0, we must have X = 0, so tha

is an eigenvector of XX, and the corresponding

teigenvalue is zero. Therefore, one of the musi

e

(

be zero. In this case, XX is not invertible, sinc

XX) would have to satisfy1


7/10

- 7 -

,(XX) (XX) = 01

that is, = 0, which is ruled out by the definition of

O

.

ur computer will (hopefully) be unable to calcu-

l

late the least squares estimator b , since b is no

onger uniquely defined, and (XX) does not exist.1

-

e

Due to roundoff and other numerical errors, how

ver, some packages will be able to carry out their

.

d

calculations without any obvious catastrophe (e.g

ividing by zero), and therefore they will produce

u

output, which will be completely inappropriate and

seless.


8/10

- 8 -

If there is approximate multicollinearity, then one

tor more of the will be very close to zero, so thai

1i1

i i

p

0i = y

l

the entries of (XX) = p p will be ver

arge. Since var(b ) = (XX) , we see that1j

a

j u2

j

pproximate multicollinearity tends to inflate the

sestimated variance of b for one or more (perhapj

all) j . As a result, the t-statistics will tend to be

b

insignificant. The overall F is not adversely affected

y multicollinearity, so it may be significant even if

none of the individual b is.j

It can also be shown that the prediction variance

e

o

(incurred in "predicting" either the response surfac

r a future value of y at a particular value of the


9/10

e

- 9 -

xplanatory variables) will not be disastrously

s

o

affected by multicollinearity, as long as the entrie

f obey the same approximate multicollinearities

as the columns of X.

Keep in mind, though, that multicollinearity often

-

n

arises because we are trying to use too many expla

atory variables. This tends to inflate the prediction

S

variance. (See the handout on model selection).

o, although the effect of multicollinearity on the

-

c

predictions may not be disastrous, we will still typi

ally be able to improve the quality of the predic-

tions by using fewer variables.


10/10

- 10 -

-

i

In my opinion, the best remedy to multicollinear

ty is to use fewer variables. This can be achieved

,

t

by a combination of thinking about the problem

ransformation and combination of variables, and

-

t

model selection. Two methods of diagnosing mul

icollinearity in a given data set are (1) Look at the

-

n

Pearson correlation coefficient of all pairs of expla

atory variables (2) Look at the ratio / of Max Min

1 .

F

the largest to the smallest eigenvalues of (XX)

or those who insist on working with a multicol-

-

n

linear data set, there are biased estimation tech

iques (e.g. ridge regression) which may have a

lower mean squared error than least squares.

Documents

Chapt19 multicollinearity