Upload
asif-sultan
View
215
Download
0
Embed Size (px)
Citation preview
8/4/2019 Chapt19 multicollinearity
1/10
19: MULTICOLLINEARITY
e
o
Multicollinearity is a problem which occurs if on
f the columns of the X matrix is exactly or nearly
t
m
a linear combination of the other columns. Exac
ulticollinearity is rare, but could happen, for
r
"
example, if we include a dummy (0-1) variable fo
Male", another one for "Female", and a column of
M
ones.
ore typically, multicollinearity will be approxi-
v
mate, arising from the fact that our explanatory
ariables are correlated with each other (i.e. they
f
w
essentially measure the same thing). For example, i
e try to describe consumption in households (y ) in
8/4/2019 Chapt19 multicollinearity
2/10
t
- 2 -
erms of income (x ) and net worth (x ), then it1 2
1
a
will be hard to identify the separate effects of x
nd x on y . The estimated regression coefficients
1
2
2b and b will be hard to interpret. The variance of
b and b will be very large, so the corresponding1 2
t-statistics will tend to be insignificant, even though
d
R
the F for the model as a whole is significant an
is high. Further, the coefficient of x , and the
c
21
orresponding t-statistic, may change dramatically if
the seemingly insignificant variable x is deleted2
from the model.
8/4/2019 Chapt19 multicollinearity
3/10
- 3 -
For a numerical example, consider a data set on
)
i
the monthly sales of backyard satellite antennas (y
n nine randomly selected districts, together with the
enumber of households (x ) in the district, and th1
2 e
d
number of owner-occupied households (x ) in th
istrict. (Both x and x are measured in units of
1
1 2
0,000 households). The multiple regression of y
son x and x indicates that neither variable i1 2
2 e
o
linearly related to y . However, R = .9279, and th
verall F test is highly significant, indicating that at
least one of x and x is linearly related to y .1 2
8/4/2019 Chapt19 multicollinearity
4/10
- 4 -
s
D
Satellite Antenna Sale
istrict Sales#
Households#
Owner-( ) ( ) Occupiedy x 1Households
( )x 2
2
1 50 14 11
73 28 18
4
3 32 10 5
121 30 20
0
6
5 156 48 3
98 30 21
5
8
7 62 20 1
51 16 11
79 80 25 1
8/4/2019 Chapt19 multicollinearity
5/10
- 5 -
The reason why the results of the two t-tests are
-
l
so different from the result of the F-test is that col
inearity has destroyed the t-tests by strongly reduc-
t
b
ing their power. The Pearson correlation coefficien
etween x and x is r = .985, so the two variables1 2
1
g
are highly collinear. A simple regression of y on x
ives a t-statistic for b of 9.35 (highly significant),1
2 -
s
while a simple regression of y on x gives a t
tatistic for b of 8.62 (also highly significant).2
2Note also that the R values for these two simple
o
regressions are .9259 and .9139, respectively, both
f which are almost as high as the multiple R for
the full model, .9279.
2
8/4/2019 Chapt19 multicollinearity
6/10
- 6 -
To get some mathematical insight into the gen-
(
eral problem, we use the spectral decomposition
Jobson, p. 576) to write
(XX) = p p ,p
0
i =
1i1
i i
w i
here are the eigenvalues of XX and
-P = [p , . . . , p ] is an orthogonal matrix of eigen0 p
vectors of XX.
If there is exact multicollinearity, then for some
t
(p +1)1 vector 0, we must have X = 0, so tha
is an eigenvector of XX, and the corresponding
teigenvalue is zero. Therefore, one of the musi
e
(
be zero. In this case, XX is not invertible, sinc
XX) would have to satisfy1
8/4/2019 Chapt19 multicollinearity
7/10
- 7 -
,(XX) (XX) = 01
that is, = 0, which is ruled out by the definition of
O
.
ur computer will (hopefully) be unable to calcu-
l
late the least squares estimator b , since b is no
onger uniquely defined, and (XX) does not exist.1
-
e
Due to roundoff and other numerical errors, how
ver, some packages will be able to carry out their
.
d
calculations without any obvious catastrophe (e.g
ividing by zero), and therefore they will produce
u
output, which will be completely inappropriate and
seless.
8/4/2019 Chapt19 multicollinearity
8/10
- 8 -
If there is approximate multicollinearity, then one
tor more of the will be very close to zero, so thai
1i1
i i
p
0i = y
l
the entries of (XX) = p p will be ver
arge. Since var(b ) = (XX) , we see that1j
a
j u2
j
pproximate multicollinearity tends to inflate the
sestimated variance of b for one or more (perhapj
all) j . As a result, the t-statistics will tend to be
b
insignificant. The overall F is not adversely affected
y multicollinearity, so it may be significant even if
none of the individual b is.j
It can also be shown that the prediction variance
e
o
(incurred in "predicting" either the response surfac
r a future value of y at a particular value of the
8/4/2019 Chapt19 multicollinearity
9/10
e
- 9 -
xplanatory variables) will not be disastrously
s
o
affected by multicollinearity, as long as the entrie
f obey the same approximate multicollinearities
as the columns of X.
Keep in mind, though, that multicollinearity often
-
n
arises because we are trying to use too many expla
atory variables. This tends to inflate the prediction
S
variance. (See the handout on model selection).
o, although the effect of multicollinearity on the
-
c
predictions may not be disastrous, we will still typi
ally be able to improve the quality of the predic-
tions by using fewer variables.
8/4/2019 Chapt19 multicollinearity
10/10
- 10 -
-
i
In my opinion, the best remedy to multicollinear
ty is to use fewer variables. This can be achieved
,
t
by a combination of thinking about the problem
ransformation and combination of variables, and
-
t
model selection. Two methods of diagnosing mul
icollinearity in a given data set are (1) Look at the
-
n
Pearson correlation coefficient of all pairs of expla
atory variables (2) Look at the ratio / of Max Min
1 .
F
the largest to the smallest eigenvalues of (XX)
or those who insist on working with a multicol-
-
n
linear data set, there are biased estimation tech
iques (e.g. ridge regression) which may have a
lower mean squared error than least squares.