Chapt19 multicollinearity

Embed Size (px)

Citation preview

  • 8/4/2019 Chapt19 multicollinearity

    1/10

    19: MULTICOLLINEARITY

    e

    o

    Multicollinearity is a problem which occurs if on

    f the columns of the X matrix is exactly or nearly

    t

    m

    a linear combination of the other columns. Exac

    ulticollinearity is rare, but could happen, for

    r

    "

    example, if we include a dummy (0-1) variable fo

    Male", another one for "Female", and a column of

    M

    ones.

    ore typically, multicollinearity will be approxi-

    v

    mate, arising from the fact that our explanatory

    ariables are correlated with each other (i.e. they

    f

    w

    essentially measure the same thing). For example, i

    e try to describe consumption in households (y ) in

  • 8/4/2019 Chapt19 multicollinearity

    2/10

    t

    - 2 -

    erms of income (x ) and net worth (x ), then it1 2

    1

    a

    will be hard to identify the separate effects of x

    nd x on y . The estimated regression coefficients

    1

    2

    2b and b will be hard to interpret. The variance of

    b and b will be very large, so the corresponding1 2

    t-statistics will tend to be insignificant, even though

    d

    R

    the F for the model as a whole is significant an

    is high. Further, the coefficient of x , and the

    c

    21

    orresponding t-statistic, may change dramatically if

    the seemingly insignificant variable x is deleted2

    from the model.

  • 8/4/2019 Chapt19 multicollinearity

    3/10

    - 3 -

    For a numerical example, consider a data set on

    )

    i

    the monthly sales of backyard satellite antennas (y

    n nine randomly selected districts, together with the

    enumber of households (x ) in the district, and th1

    2 e

    d

    number of owner-occupied households (x ) in th

    istrict. (Both x and x are measured in units of

    1

    1 2

    0,000 households). The multiple regression of y

    son x and x indicates that neither variable i1 2

    2 e

    o

    linearly related to y . However, R = .9279, and th

    verall F test is highly significant, indicating that at

    least one of x and x is linearly related to y .1 2

  • 8/4/2019 Chapt19 multicollinearity

    4/10

    - 4 -

    s

    D

    Satellite Antenna Sale

    istrict Sales#

    Households#

    Owner-( ) ( ) Occupiedy x 1Households

    ( )x 2

    2

    1 50 14 11

    73 28 18

    4

    3 32 10 5

    121 30 20

    0

    6

    5 156 48 3

    98 30 21

    5

    8

    7 62 20 1

    51 16 11

    79 80 25 1

  • 8/4/2019 Chapt19 multicollinearity

    5/10

    - 5 -

    The reason why the results of the two t-tests are

    -

    l

    so different from the result of the F-test is that col

    inearity has destroyed the t-tests by strongly reduc-

    t

    b

    ing their power. The Pearson correlation coefficien

    etween x and x is r = .985, so the two variables1 2

    1

    g

    are highly collinear. A simple regression of y on x

    ives a t-statistic for b of 9.35 (highly significant),1

    2 -

    s

    while a simple regression of y on x gives a t

    tatistic for b of 8.62 (also highly significant).2

    2Note also that the R values for these two simple

    o

    regressions are .9259 and .9139, respectively, both

    f which are almost as high as the multiple R for

    the full model, .9279.

    2

  • 8/4/2019 Chapt19 multicollinearity

    6/10

    - 6 -

    To get some mathematical insight into the gen-

    (

    eral problem, we use the spectral decomposition

    Jobson, p. 576) to write

    (XX) = p p ,p

    0

    i =

    1i1

    i i

    w i

    here are the eigenvalues of XX and

    -P = [p , . . . , p ] is an orthogonal matrix of eigen0 p

    vectors of XX.

    If there is exact multicollinearity, then for some

    t

    (p +1)1 vector 0, we must have X = 0, so tha

    is an eigenvector of XX, and the corresponding

    teigenvalue is zero. Therefore, one of the musi

    e

    (

    be zero. In this case, XX is not invertible, sinc

    XX) would have to satisfy1

  • 8/4/2019 Chapt19 multicollinearity

    7/10

    - 7 -

    ,(XX) (XX) = 01

    that is, = 0, which is ruled out by the definition of

    O

    .

    ur computer will (hopefully) be unable to calcu-

    l

    late the least squares estimator b , since b is no

    onger uniquely defined, and (XX) does not exist.1

    -

    e

    Due to roundoff and other numerical errors, how

    ver, some packages will be able to carry out their

    .

    d

    calculations without any obvious catastrophe (e.g

    ividing by zero), and therefore they will produce

    u

    output, which will be completely inappropriate and

    seless.

  • 8/4/2019 Chapt19 multicollinearity

    8/10

    - 8 -

    If there is approximate multicollinearity, then one

    tor more of the will be very close to zero, so thai

    1i1

    i i

    p

    0i = y

    l

    the entries of (XX) = p p will be ver

    arge. Since var(b ) = (XX) , we see that1j

    a

    j u2

    j

    pproximate multicollinearity tends to inflate the

    sestimated variance of b for one or more (perhapj

    all) j . As a result, the t-statistics will tend to be

    b

    insignificant. The overall F is not adversely affected

    y multicollinearity, so it may be significant even if

    none of the individual b is.j

    It can also be shown that the prediction variance

    e

    o

    (incurred in "predicting" either the response surfac

    r a future value of y at a particular value of the

  • 8/4/2019 Chapt19 multicollinearity

    9/10

    e

    - 9 -

    xplanatory variables) will not be disastrously

    s

    o

    affected by multicollinearity, as long as the entrie

    f obey the same approximate multicollinearities

    as the columns of X.

    Keep in mind, though, that multicollinearity often

    -

    n

    arises because we are trying to use too many expla

    atory variables. This tends to inflate the prediction

    S

    variance. (See the handout on model selection).

    o, although the effect of multicollinearity on the

    -

    c

    predictions may not be disastrous, we will still typi

    ally be able to improve the quality of the predic-

    tions by using fewer variables.

  • 8/4/2019 Chapt19 multicollinearity

    10/10

    - 10 -

    -

    i

    In my opinion, the best remedy to multicollinear

    ty is to use fewer variables. This can be achieved

    ,

    t

    by a combination of thinking about the problem

    ransformation and combination of variables, and

    -

    t

    model selection. Two methods of diagnosing mul

    icollinearity in a given data set are (1) Look at the

    -

    n

    Pearson correlation coefficient of all pairs of expla

    atory variables (2) Look at the ratio / of Max Min

    1 .

    F

    the largest to the smallest eigenvalues of (XX)

    or those who insist on working with a multicol-

    -

    n

    linear data set, there are biased estimation tech

    iques (e.g. ridge regression) which may have a

    lower mean squared error than least squares.