Decomposing Variance

Embed Size (px)

Citation preview

  • 8/2/2019 Decomposing Variance

    1/41

    Author(s): Kerby Shedden, Ph.D., 2010

    License: Unless otherwise noted, this material is made available under the

    terms of the Creative Commons Attribution Share Alike 3.0 License:

    http://creativecommons.org/licenses/by-sa/3.0/

    We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your

    ability to use, share, and adapt it. The citation key on the following slide provides information about how you

    may share and adapt this material.

    Copyright holders of content included in this material should contact [email protected] with any

    questions, corrections, or clarification regarding the use of content.

    For more information about how to cite these materials visit http://open.umich.edu/privacy-and-terms-use.

    Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis

    or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please

    speak to your physician if you have questions about your medical condition.

    Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers.

    1 / 4 1

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    2/41

    Decomposing Variance

    Kerby Shedden

    Department of Statistics, University of Michigan

    October 10, 2011

    2 / 4 1

    http://find/
  • 8/2/2019 Decomposing Variance

    3/41

    Law of total variation

    For any regression model involving a response Y and a covariate vectorX, we have

    var(Y) = varXE(Y|X) + EXvar(Y|X).

    Note that this only makes sense if we treat X as being random.

    We often wish to distinguish these two situations:

    The population is homoscedastic: var(Y|X) does not depend on X,so we can simply write var(Y|X) = 2, and we getvar(Y) = varXE(Y|X) +

    2.

    The population is heteroscedastic: var(Y|X) is a function 2(X)

    with expected value 2

    = EX2

    (X), and again we getvar(Y) = varXE(Y|X) + 2.

    If we write Y = f(X) + with E(|X) = 0, then E(Y|X) = f(X), andvarXE(Y|X) summarizes the variation of f(X) over the marginaldistribution ofX.

    3 / 4 1

    http://find/
  • 8/2/2019 Decomposing Variance

    4/41

    Law of total variation

    0

    1

    2 3

    4

    X

    1

    0

    1

    2

    3

    4

    E

    (

    Y

    |

    X

    )

    Orange curves: conditional distributions ofY given XPurple curve: marginal distribution ofYBlack dots: conditional means ofY given X

    4 / 4 1

    http://find/
  • 8/2/2019 Decomposing Variance

    5/41

    Pearson correlation

    The population Pearson correlation coefficient of two jointly distributedscalar-valued random variables X and Y is

    XY cov(X,Y)

    XY.

    Given data Y = (Y1, . . . ,Yn) and X = (X1, . . . ,Xn)

    , the Pearsoncorrelation coefficient is estimated by

    XY =

    cov(X,Y)

    XY=

    i(Xi X)(Yi Y)

    i(Xi X)2 i(Yi Y)2=

    (X X)(Y Y)

    X X Y Y.

    When we write Y Y here, this means Y Y 1, where 1 is a vector of1s, and Y is a scalar.

    5 / 4 1

    http://goforward/http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    6/41

    Pearson correlation

    By the Cauchy-Schwartz inequality,

    1 XY 1

    1 XY 1.

    The sample correlation coefficient is slightly biased, but the bias is sosmall that it is usually ignored.

    6 / 4 1

    http://goforward/http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    7/41

    Pearson correlation and simple linear regression slopes

    For the simple linear regression model

    Y = + X+ ,

    if we view X as a random variable that is uncorrelated with , then

    cov(X,Y) = 2X

    and the correlation is

    XY cor(X,Y) =

    2 + 2/2X.

    The sample correlation coefficient is related to the least squares slope

    estimate:

    =cov(X,Y)

    2X= XY

    YX

    .

    7 / 4 1

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    8/41

    Orthogonality between fitted values and residuals

    Recall that the fitted values are

    Y = X = PY

    and the residuals are

    R = Y Y = (I P)Y.

    Since P(I P) = 0 it follows that YR = 0.

    since R = 0, it is equivalent to state that the sample correlation between

    R and Y is zero, i.e.

    cor(R, Y) = 0.

    8 / 4 1

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    9/41

    Coefficient of determination

    A descriptive summary of the explanatory power ofX for Y is given bythe coefficient of determination, also known as the proportion ofexplained variance, or multiple R2. This is the quantity

    R2 1 Y Y2

    Y Y2=

    Y Y2

    Y Y2=var(Y)

    var(Y)

    .

    The equivalence between the two expressions follows from the identity

    Y Y2 = Y Y + Y Y2

    = Y Y2 + Y Y2 + 2(Y Y)(Y Y)= Y Y2 + Y Y2,

    It should be clear that R2 = 0 iff Y = Y and R2 = 1 iffY = Y.

    9 / 4 1

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    10/41

    Coefficient of determination

    The coefficient of determination is equal to

    cor(Y,Y)2.To see this, note that

    cor(Y,Y) = (Y Y)(Y Y)Y Y Y Y

    =(Y Y)(Y Y + Y Y)

    Y Y Y Y

    = (Y Y)(Y Y) + (Y Y)(Y Y)Y Y Y Y

    =Y Y

    Y Y.

    10/41

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    11/41

    Coefficient of determination in simple linear regressionIn general,

    R2 = cor(Y, Y)2 = cov(Y, Y)2var(Y) var(Y) .In the case of simple linear regression,

    cov(Y, Y) = cov(Y, + X)= cov(Y,X),

    and

    var(Y) = var( + X)= 2var(X)

    Thus for simple linear regression, R2 =

    cor(Y,X)2 =

    cor(Y, Y)2.

    11/41

    http://find/
  • 8/2/2019 Decomposing Variance

    12/41

    Relationship to the F statistic

    The F-statistic for the null hypothesis

    1 = . . . = p = 0

    is

    Y Y2

    Y Y2n p 1

    p=

    R2

    1 R2n p 1

    p,

    which is an increasing function ofR2.

    12/41

    http://find/
  • 8/2/2019 Decomposing Variance

    13/41

    Adjusted R2

    The sample R2 is an estimate of the population R2:

    1 var(Y|X)

    var(Y).

    Since it is a ratio, the plug-in estimate R2 is biased, although the bias is

    not large unless the sample size is small or the number of covariates islarge. The adjusted R2 is an approximately unbiased estimate of thepopulation R2:

    1 (1 R2)n 1

    n p 1

    .

    The adjusted R2 is always less than the unadjusted R2. The adjusted R2

    is always less than or equal to one, but can be negative.

    13/41

    http://find/
  • 8/2/2019 Decomposing Variance

    14/41

    The unique variation in one covariate

    How much information about Y is present in a covariate Xk? This

    question is not straightforward when the covariates are non-orthogonal,since several covariates may contain overlapping information about Y.

    Let Xk be the residual ofXk after regressing it against all othercovariates (including the intercept). IfPk is the projection ontospan({Xj,j = k}), then

    Xk = (I Pk)Xk.

    We could use

    var(Xk )/

    var(Xk) to assess how much of the variation in

    Xk is unique in that it is not also captured by other predictors.But this measure doesnt involve Y, so it cant tell us whether the uniquevariation in Xk is useful in the regression analysis.

    14/41

    http://find/
  • 8/2/2019 Decomposing Variance

    15/41

    The unique regression information in one covariate

    To learn how Xk contributes uniquely to the regression, we can considerhow introducing Xk to a working regression model affects the R

    2.

    Let Yk = PkY be the fitted values in the model omitting covariate k.

    Let R2 denote the multiple R2 for the full model, and let R2k be the

    multiple R2 for the regression omitting covariate Xk. The value of

    R2 R2k

    is a way to quantify how much unique information about Y in Xk is notcaptured by the other covariates. This is called the semi-partial R2.

    15/41

    http://find/
  • 8/2/2019 Decomposing Variance

    16/41

    Identity involving norms of fitted values and residuals

    Before we continue, we will need a simple identity that is often useful.

    In general, ifA and B are orthogonal, then A + B2 = A2 + B2.

    IfA and B A are orthogonal, then

    B2 = B A + A2 = B A2 + A2.

    Thus we have B2 A2 = B A2.

    Applying this fact to regression, we know that the fitted values andresiduals are orthogonal. Thus for the regression omitting variable k, Ykand Y Yk are orthogonal, so

    so Y Yk2 = Y2 Yk

    2.

    By the same argument, Y Y2 = Y2 Y2.

    16/41

    2

    http://find/
  • 8/2/2019 Decomposing Variance

    17/41

    Improvement in R2 due to one covariate

    Now we can obtain a simple, direct expression for the semi-partial R2.

    Since Xk is orthogonal to the other covariates,

    Y = Yk + Y,X

    k Xk ,Xk

    Xk ,

    and

    Y2 = Yk2 + Y,Xk

    2/Xk 2.

    17/41

    2 d

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    18/41

    Improvement in R2 due to one covariate

    Thus we have

    R2 = 1 Y Y2

    Y Y2

    = 1 Y2 Y2

    Y Y2

    = 1 Y2 Yk

    2 Y,Xk 2/Xk

    2

    Y Y2

    = 1 Y Yk

    2

    Y Y2

    +Y,Xk

    2/Xk 2

    Y Y2

    = R2k +Y,Xk

    2/Xk 2

    Y Y2.

    18/41

    S i i l R2

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    19/41

    Semi-partial R2

    Thus the semi-partial R2 is

    R2 R2k =Y,X

    k

    2/Xk

    2

    Y Y2 =Y,X

    k

    /Xk

    2

    Y Y2

    where Yk is the fitted value for regressing Y on Xk .

    Since Xk /Xk is centered and has length 1, it follows that

    R2 R2k = cor(Y,Xk )2 = cor(Y, Yk)2.Thus the semi-partial R2 for covariate k has two equivalentinterpretations:

    It is the improvement in R2 resulting from including covariate k in aworking regression model that already contains the other covariates.

    It is the R2 for a simple linear regression ofY onXk = (I Pk)Xk.

    19/41

    P i l R2

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    20/41

    Partial R2

    The partial R2 is

    R2 R2k

    1 R2k= Y,X

    k

    2

    /X

    k

    2

    Y Yk2.

    The partial R2 for covariate k is the fraction of the maximum possibleimprovement in R2 that is contributed by covariate k.

    Let Yk be the fitted values for regressing Y on all covariates except Xk.

    Since YkXk = 0,

    Y,Xk 2

    Y Yk2 Xk 2=

    Y Yk,Xk

    2

    Y Yk2 Xk 2

    The expression on the left is the usual R2 that would be obtained whenregressing Y Yk on X

    k . Thus the partial R

    2 is the same as the usualR2 for (I Pk)Y regressed on (I Pk)Xk.

    20/41

    D iti f j ti t i

    http://find/
  • 8/2/2019 Decomposing Variance

    21/41

    Decomposition of projection matrices

    Suppose P Rnn is a rank-d projection matrix, and U is a n dorthogonal matrix whose columns span col(P). If we partition U by

    columns

    U =

    | | |U1 U2 Ud

    | | |

    ,

    then P = UU, so we can write

    P =

    dj=1

    UjUj.

    Note that this representation is not unique, since there are differentorthogonal bases for col(P).

    Each summand UjUj R

    nn is a rank-1 projection matrix onto Uj.

    21/41

    D iti f R2

    http://find/
  • 8/2/2019 Decomposing Variance

    22/41

    Decomposition ofR2

    Question: In a multiple regression model, how much of the variance in Yis explained by a particular covariate?

    Orthogonal case: If the design matrix X is orthogonal (XX = I), theprojection P onto col(X) can be decomposed as

    P=

    pj=0

    Pj =11

    n+

    pj=1

    XjXj ,

    where Xj is the jth column of the design matrix (assuming here that the

    first column ofX is an intercept).

    22/41

    Deco ositio of R2 (o thogo al case)

    http://find/
  • 8/2/2019 Decomposing Variance

    23/41

    Decomposition ofR2 (orthogonal case)

    The n n rank-1 matrix

    Pj

    = XjX

    j

    is the projection onto span(Xj) (and P0 is the projection onto the span ofthe vector of 1s). Furthermore, by orthogonality, PjPk = 0 unless j = k.Since

    Y Y =

    p

    j=1

    PjY,

    by orthogonality

    Y Y2 =

    p

    j=1 PjY2.

    Here we are using the fact that ifU1, . . . ,Um are orthogonal, then

    U1 + + Um2 = U1

    2 + + Um2.

    23/41

    Decomposition of R2 (orthogonal case)

    http://find/
  • 8/2/2019 Decomposing Variance

    24/41

    Decomposition ofR2 (orthogonal case)

    The R2

    for simple linear regression ofY on Xj is

    R2j Y Y2/Y Y2 = PjY

    2/Y Y2,

    so we see that for orthogonal design matrices,

    R2 =

    pj=1

    R2j .

    That is, the overall coefficient of determination is the sum of univariatecoefficients of determination for all the explanatory variables.

    24/41

    Decomposition of R2

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    25/41

    Decomposition ofR2

    Non-orthogonal case: IfX is not orthogonal, the overall R2 will not bethe sum of single covariate R2s.

    If we let R2j be as above (the R2 values for regressing Y on each Xj),then there are two different situations:

    jR

    2j > R

    2, and

    jR2j < R

    2.

    25/41

    Decomposition of R2

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    26/41

    Decomposition ofR

    Case 1:

    R2j > R

    2

    Its not surprising that jR2j can be bigger than R2. For example,suppose that

    Y = X1 +

    is the data generating model, and X2 is highly correlated with X1 (but isnot part of the data generating model).

    For the regression ofY on both X1 and X2, the multiple R2 will be

    1 2/var(Y) (since E(Y|X1,X2) = E(Y|X1) = X2).

    The R2 values for Y regressed on either X1

    or X2

    separately will also beapproximately 1 2/var(Y).

    Thus R21 + R22 2R

    2.

    26/41

    Decomposition of R2

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    27/41

    Decomposition ofR

    Case 2:

    jR2j < R

    2

    This is more surprising, and is sometimes called enhancement.As an example, suppose the data generating model is

    Y = Z+ ,

    but we dont observe Z (for simplicity assume EZ = 0). Instead, weobserve a value X2 with mean zero that is independent ofZ and , and avalue X1 that satisfies

    X1 = Z+ X2.

    Since X2 is independent ofZ and , it is also independent ofY, thusR22 0 for large n.

    27/41

    Decomposition of R2 (enhancement example)

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    28/41

    Decomposition ofR (enhancement example)

    The multiple R2 ofY on X1 and X2 is approximately 2Z/(2Z + 2) forlarge n, since the fitted values will converge to Y = X1 X2 = Z.

    To calculate R21 , first note that for the regression ofY on X1,

    cov(Y,X1)var(X1)

    = 2

    Z

    2Z + 2X2

    and

    0.

    28/41

    Decomposition of R2 (enhancement example)

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    29/41

    Decomposition ofR (enhancement example)Therefore for large n,

    n1Y Y2 n1Z+ 2ZX1/(2Z + 2X2 )2

    = n12X2Z/(2Z +

    2X2

    ) + 2ZX2/(2Z +

    2X2

    )2

    = 4X22Z/(

    2Z +

    2X2

    )2 + 2 + 4Z2X2/(2Z +

    2X2

    )2

    = 2X22Z/(

    2Z +

    2X2

    ) + 2.

    Therefore

    R21 = 1 n1Y Y2

    n1Y Y2

    1 2X2

    2Z/(2Z + 2X2 ) +

    2

    2Z + 2

    =2Z

    (2Z + 2)(1 + 2X2/

    2Z)

    . 29/41

    Decomposition of R2 (enhancement example)

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    30/41

    Decomposition ofR (enhancement example)

    Thus

    R21/R2 1/(1 + 2X2/

    2Z),

    which is strictly less than one if2X2 > 0.

    Since R22 = 0, it follows that R2 > R21 + R

    22 .

    The reason for this is that while X2 contains no directly usefulinformation about Y (hence R22 = 0), it can remove the measurement

    error in X1, making X1 a better predictor ofZ.

    30/41

    Partial R2 example I

    http://goforward/http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    31/41

    Partial R example I

    Suppose the design matrix satisfies

    XX/n =

    1 0 00 1 r

    0 r 1

    and the data generating model is

    Y = X1 + X2 +

    with var = 2.

    31/41

    Partial R2 example I

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    32/41

    Partial R example I

    We will calculate the partial R2 for X1, using the fact that the partial R2

    is the regular R2 for regressing

    (I P1)Y

    on

    (I P1)X1

    where P1 is the projection onto span ({1,X2}).

    Since this is a simple linear regression, the partial R

    2

    can be expressed

    cor((I P1)Y, (I P1)X1)2.

    32/41

    Partial R2 example I

    http://find/
  • 8/2/2019 Decomposing Variance

    33/41

    p

    The numerator of the partial R2 is the square of

    cov((I P1)Y, (I P1)X1) = Y(I P1)X1/n

    = (X1 + X2 + )(X1 rX2)/n

    1 r2.

    The denominator contains two factors. The first is

    (I P1)X12/n = X

    1

    (I P1)X1/n

    = X1(X1 rX2)/n

    1 r2.

    33/41

    Partial R2 example I

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    34/41

    p

    The other factor in the denominator is Y(I P1)Y/n:

    Y(I P1)Y/n = (X1 + X2)(I P1)(X1 + X2)/n + (I P1)/n +2(I P1)(X1 + X2)/n

    (X1 + X2)(X1 rX2)/n +

    2

    1 r2 + 2.

    Thus we get that the partial R2 is approximately equal to

    1 r2

    1 r2 + 2.

    If r = 1 then the result is zero (X1 has no unique explanatory power),and if r = 0, the result is 1/2, indicating that after controlling for X2,around 1/2 fraction of the remaining variance is explained by X1 (therest is due to ).

    34/41

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    35/41

    Partial R2 example II

  • 8/2/2019 Decomposing Variance

    36/41

    pThe four R2s for this model are related as follows, where R2{} is the R

    2

    based only on the intercept.

    R2{}

    R22R21

    R21,2

    ????

    ????

    ????

    ????

    ????

    ????

    ????

    ????

    ????

    ????

    36/41

    Partial R2 example II

    http://find/
  • 8/2/2019 Decomposing Variance

    37/41

    We can calculate the limiting values for each R2:

    R2{} = 0

    R21 = R21,2 =

    b2

    b2 + 21

    37/41

    Partial R2 example II

    http://goforward/http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    38/41

    For the regression on X2, the limiting value of the slope is

    cov(Y,X2)var(X2)

    =b cov(X1,X2) + cov(1,X2)

    1 + 22

    =b

    1 + 22.

    Therefore the residual mean square is approximately

    n1Y Y22 = n1bX1 + 1 b(X1 + 2)/(1 +

    22 )

    2

    = n1 b221 + 22

    X1 + 1 b1 + 22

    22

    b222

    1 + 22+ 21 .

    38/41

    Partial R2 example II

    http://find/
  • 8/2/2019 Decomposing Variance

    39/41

    So,

    R22 1 b222/(1 +

    22 ) +

    21

    b2 + 21

    =b2 b222/(1 +

    22 )

    b2

    + 21

    =b2

    (1 + 22 )(b2 + 21)

    =1

    (1 + 22 )(1 + 21/b

    2)

    If22 = 0 then X1 = X2, and we recover the usual R2 for simple linear

    regression ofY on X1.

    39/41

    Partial R2 example II

    http://goforward/http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    40/41

    With some algebra, we get an expression for the partial R2 for adding X1to a model already containing X2:

    R21,2 R22

    1 R22 =

    b222

    b222 + 21 + 2122 .

    If22 = 0, the partial R2 is 0.

    Ifb= 0, 22 > 0 and 21 = 0, the partial R

    2 is 1.

    40/41

    Summary

    http://find/http://goback/
  • 8/2/2019 Decomposing Variance

    41/41

    Each of the three R2 values can be expressed either in terms of varianceratios, or as a squared correlation coefficient:

    Multiple R2 Semi-partial R2 Partial R2VR Y Y2/Y Y2 R2 R2k (R

    2 R2k)/(1 R2k)

    Correlation cor(Y,Y)2 cor(Y,Xk )2

    cor((I Pk)Y,X

    k )2

    41/41

    http://find/