© Christopher Dougherty 1999–2006 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE We will now investigate the consequences of misspecifying

© Christopher Dougherty 1999–2006

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

We will now investigate the consequences of misspecifying the regression model in terms of explanatory variables. To keep the analysis simple, we will assume that there are only two possibilities. Either Y depends only on X2, or it depends on both X2 and X3.

A.If Y depends only on X2, and we fit a simple regression model, we will not encounter any problems, assuming of course that the regression model assumptions are valid.

B.Likewise we will not encounter any problems if Y depends on both X2 and X3 and we fit the multiple regression.


A.In this sequence we will examine the consequences of fitting a simple regression when the true model is multiple.

The omission of a relevant explanatory variable causes the regression coefficients to be biased and the standard errors to be invalid.

B.In the next one we will do the opposite and examine the consequences of fitting a multiple regression when the true model is simple.

Including an irrelevant variable – In this case, the coefficients in general remain unbiased, but they are inefficient. The standard errors remain valid, but are needlessly large.



Consequences of Variable Misspecification

True Model

Fit

ted

Mod

eluXXY 33221 uXY 221

33

221ˆ

Xb

XbbY

221ˆ XbbY

The omission of a relevant explanatory variable causes the regression coefficients to be biased and the standard errors to be invalid.

Correct specification,no problems


Coefficients are biased (in general). Standarderrors are invalid.



In the present case, the omission of X3 causes b2 to be biased by the term highlighted in yellow. We will explain this first intuitively and then demonstrate it mathematically.

uXXY 33221 221ˆ XbbY

2

22

3322322 )(

XX

XXXXbE

i

ii



Y

X3X2

direct effect of X2, holding X3 constant

effect of X3

apparent effect of X2, acting as a mimic for X3

b2 b3

The intuitive reason is that, in addition to its direct effect b2, X2 has an apparent indirect effect as a consequence of acting as a proxy for the missing X3.


2

22

3322322 )(

XX

XXXXbE

i

ii



Y

X3X2


effect of X3


b2 b3

The strength of the proxy effect depends on two factors: the strength of the effect of X3 on Y, which is given by b3, and the ability of X2 to mimic X3.


2

22

3322322 )(

XX

XXXXbE

i

ii




2

22

3322322 )(

XX

XXXXbE

i

ii

Y

X3X2


effect of X3


b2 b3

The ability of X2 to mimic X3 is determined by the slope coefficient obtained when X3 is regressed on X2, the term highlighted in yellow.




uuXXXX

uXXuXXYY

iii

iiii

333222

3322133221

We will now derive the expression for the bias mathematically. It is convenient to start by deriving an expression for the deviation of Yi about its sample mean. It can be expressed in terms of the deviations of X2, X3, and u about their sample means.




uuXXXX

uXXuXXYY

iii

iiii

333222

3322133221

222

222

22

332232

222

22332232

222

222

222

XX

uuXX

XX

XXXX

XX

uuXXXXXXXX

XX

YYXXb

i

ii

i

ii

i

iiiii

i

ii

Although Y really depends on X3 as well as X2, we make a mistake and regress Y on X2 only. The slope coefficient is therefore as shown.




uuXXXX

uXXuXXYY

iii

iiii

333222

3322133221

222

222

22

332232

222

22332232

222

222

222

XX

uuXX

XX

XXXX

XX

uuXXXXXXXX

XX

YYXXb

i

ii

i

ii

i

iiiii

i

ii

We substitute for the Y deviations and simplify. Hence we have demonstrated that b2 has three components.




2

22

222

22

3322322

XX

uuXX

XX

XXXXb

i

ii

i

ii

222

222

22

3322322

XX

uuXXE

XX

XXXXbE

i

ii

i

ii

To investigate biasedness or unbiasedness, we take the expected value of b2. The first two terms are unaffected because they contain no random components. Thus we focus on the expectation of the error term.



X2 is nonstochastic, so the denominator of the error term is nonstochastic and may be taken outside the expression for the expectation.

0

1

1

1

22222

22222

22222

222

22

uuEXXXX

uuXXEXX

uuXXEXXXX

uuXXE

ii

i

ii

i

ii

ii

ii

222

222

22

3322322

XX

uuXXE

XX

XXXXbE

i

ii

i

ii



In the numerator the expectation of a sum is equal to the sum of the expectations (first expected value rule).

0

1

1

1

22222

22222

22222

222

22

uuEXXXX

uuXXEXX

uuXXEXXXX

uuXXE

ii

i

ii

i

ii

ii

ii

222

222

22

3322322

XX

uuXXE

XX

XXXXbE

i

ii

i

ii



In each product, the factor involving X2 may be taken out of the expectation because X2 is nonstochastic.

0

1

1

1

22222

22222

22222

222

22

uuEXXXX

uuXXEXX

uuXXEXXXX

uuXXE

ii

i

ii

i

ii

ii

ii

222

222

22

3322322

XX

uuXXE

XX

XXXXbE

i

ii

i

ii



By Assumption A.3, the expected value of u is 0. It follows that the expected value of the sample mean of u is also 0. Hence the expected value of the deviation from mean for the disturbance term is also 0.

0

1

1

1

22222

22222

22222

222

22

uuEXXXX

uuXXEXX

uuXXEXXXX

uuXXE

ii

i

ii

i

ii

ii

ii

222

222

22

3322322

XX

uuXXE

XX

XXXXbE

i

ii

i

ii



Thus we have shown that the expected value of b2 is equal to the true value plus a bias term.

Note 1: The definition of a bias is the difference between the expected value of an estimator and the true value of the parameter being estimated.)

Note 2: The bias will be zero if the sample correlation between X2 and X3 is zero, or b3 = 0.

As a consequence of the misspecification, the standard errors, t tests and F test are invalid.


222

222

22

3322322

XX

uuXXE

XX

XXXXbE

i

ii

i

ii

2

22

3322322

XX

XXXXbE

i

ii



. reg S ASVABC SM

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 147.36 Model | 1135.67473 2 567.837363 Prob > F = 0.0000 Residual | 2069.30861 537 3.85346109 R-squared = 0.3543-------------+------------------------------ Adj R-squared = 0.3519 Total | 3204.98333 539 5.94616574 Root MSE = 1.963

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1328069 .0097389 13.64 0.000 .1136758 .151938 SM | .1235071 .0330837 3.73 0.000 .0585178 .1884963 _cons | 5.420733 .4930224 10.99 0.000 4.452244 6.389222------------------------------------------------------------------------------

We will illustrate the bias using an educational attainment model.

To keep the analysis simple, we will assume that in the true model S depends only on ASVABC and SM. The output above shows the corresponding regression using EAEF Data Set 21.

uSMASVABCS 321



. reg S ASVABC SM



We will run the regression a second time, omitting SM. Before we do this, we will try to predict the direction of the bias in the coefficient of ASVABC.

uSMASVABCS 321 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

2

22

3322322

XX

XXXXbE

i

ii


. reg S ASVABC SM



It is reasonable to suppose, as a matter of common sense, that b3 is positive. This assumption is strongly supported by the fact that its estimate in the multiple regression is positive and highly significant.



. reg S ASVABC SM



The correlation between ASVABC and SM is positive, so the numerator of the bias term must be positive. The denominator is automatically positive since it is a sum of squares and there is some variation in ASVABC. Hence the bias should be positive.

. cor SM ASVABC(obs=540)

| SM ASVABC--------+------------------ SM| 1.0000 ASVABC| 0.4202 1.0000

uSMASVABCS 321



. reg S ASVABC


------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .148084 .0089431 16.56 0.000 .1305165 .1656516 _cons | 6.066225 .4672261 12.98 0.000 5.148413 6.984036------------------------------------------------------------------------------

Here is the regression omitting SM.



. reg S ASVABC SM


. reg S ASVABC


As you can see, the coefficient of ASVABC is indeed higher when SM is omitted. Part of the difference may be due to pure chance, but part is attributable to the bias.

uSMASVABCS 321



. reg S SM


------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- SM | .3130793 .0348012 9.00 0.000 .2447165 .3814422 _cons | 10.04688 .4147121 24.23 0.000 9.232226 10.86153------------------------------------------------------------------------------

Here is the regression omitting ASVABC instead of SM. We would expect b3 to be upwards biased. We anticipate that b2 is positive and we know that both the numerator and the denominator of the other factor in the bias expression are positive.

2233 )(

SMSM

SMSMASVABCASVABCbE

i

ii



In this case the bias is quite dramatic. The coefficient of SM has more than doubled.

The reason for the bigger effect is that the variation in SM is much smaller than that in ASVABC, while b2 and b3 are similar in size, judging by their estimates (see equation on previous slide).

. reg S ASVABC SM


. reg S SM

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- SM | .3130793 .0348012 9.00 0.000 .2447165 .3814422 _cons | 10.04688 .4147121 24.23 0.000 9.232226 10.86153------------------------------------------------------------------------------



. reg S ASVABC SM Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 147.36 Model | 1135.67473 2 567.837363 Prob > F = 0.0000 Residual | 2069.30861 537 3.85346109 R-squared = 0.3543-------------+------------------------------ Adj R-squared = 0.3519 Total | 3204.98333 539 5.94616574 Root MSE = 1.963

. reg S ASVABC Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 274.19 Model | 1081.97059 1 1081.97059 Prob > F = 0.0000 Residual | 2123.01275 538 3.94612035 R-squared = 0.3376-------------+------------------------------ Adj R-squared = 0.3364 Total | 3204.98333 539 5.94616574 Root MSE = 1.9865

. reg S SM Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 80.93 Model | 419.086251 1 419.086251 Prob > F = 0.0000 Residual | 2785.89708 538 5.17824736 R-squared = 0.1308-------------+------------------------------ Adj R-squared = 0.1291 Total | 3204.98333 539 5.94616574 Root MSE = 2.2756

Finally, we will investigate how R2 behaves when a variable is omitted.



Finally, we will investigate how R2 behaves when a variable is omitted. In the simple regression of S on ASVABC, R2 is 0.34, and in the simple regression of S on SM it is 0.13.

Does this imply that ASVABC explains 34% of the variance in S and SM 13%? No, because the multiple regression reveals that their joint explanatory power is 0.35, not 0.47.

In the second regression, ASVABC is partly acting as a proxy for SM, and this inflates its apparent explanatory power. Similarly, in the third regression, SM is partly acting as a proxy for ASVABC, again inflating its apparent explanatory power.



. reg LGEARN S EXP

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 100.86 Model | 50.9842581 2 25.492129 Prob > F = 0.0000 Residual | 135.723385 537 .252743734 R-squared = 0.2731-------------+------------------------------ Adj R-squared = 0.2704 Total | 186.707643 539 .34639637 Root MSE = .50274

------------------------------------------------------------------------------ LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | .1235911 .0090989 13.58 0.000 .1057173 .141465 EXP | .0350826 .0050046 7.01 0.000 .0252515 .0449137 _cons | .5093196 .1663823 3.06 0.002 .1824796 .8361596------------------------------------------------------------------------------

However, it is also possible for omitted variable bias to lead to a reduction in the apparent explanatory power of a variable. This will be demonstrated using a simple earnings function model, supposing the logarithm of hourly earnings to depend on S and EXP.

uEXPSLGEARN 321 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE


. reg LGEARN S EXP



If we omit EXP from the regression, the coefficient of S should be subject to a downward bias. b3 is likely to be positive. The numerator of the other factor in the bias term is negative since S and EXP are negatively correlated. The denominator is positive.

uEXPSLGEARN 321 . cor S EXP(obs=540)

| S EXP--------+------------------ S| 1.0000 EXP| -0.2179 1.0000



. reg LGEARN S EXP



For the same reasons, the coefficient of EXP in a simple regression of LGEARN on EXP should be downwards biased.

uEXPSLGEARN 321 . cor S EXP(obs=540)

| S EXP--------+------------------ S| 1.0000 EXP| -0.2179 1.0000



. reg LGEARN S EXP------------------------------------------------------------------------------ LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | .1235911 .0090989 13.58 0.000 .1057173 .141465 EXP | .0350826 .0050046 7.01 0.000 .0252515 .0449137 _cons | .5093196 .1663823 3.06 0.002 .1824796 .8361596

. reg LGEARN S------------------------------------------------------------------------------ LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | .1096934 .0092691 11.83 0.000 .0914853 .1279014 _cons | 1.292241 .1287252 10.04 0.000 1.039376 1.545107

. reg LGEARN EXP------------------------------------------------------------------------------ LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- EXP | .0202708 .0056564 3.58 0.000 .0091595 .031382 _cons | 2.44941 .0988233 24.79 0.000 2.255284 2.643537

As can be seen, the coefficients of S and EXP are indeed lower in the simple regressions.



. reg LGEARN S EXP Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 100.86 Model | 50.9842581 2 25.492129 Prob > F = 0.0000 Residual | 135.723385 537 .252743734 R-squared = 0.2731-------------+------------------------------ Adj R-squared = 0.2704 Total | 186.707643 539 .34639637 Root MSE = .50274

. reg LGEARN S Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 140.05 Model | 38.5643833 1 38.5643833 Prob > F = 0.0000 Residual | 148.14326 538 .275359219 R-squared = 0.2065-------------+------------------------------ Adj R-squared = 0.2051 Total | 186.707643 539 .34639637 Root MSE = .52475

. reg LGEARN EXP Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 12.84 Model | 4.35309315 1 4.35309315 Prob > F = 0.0004 Residual | 182.35455 538 .338948978 R-squared = 0.0233-------------+------------------------------ Adj R-squared = 0.0215 Total | 186.707643 539 .34639637 Root MSE = .58219

A comparison of R2 for the three regressions shows that the sum of R2 in the simple regressions is actually less than R2 in the multiple regression. This is because the apparent explanatory power of S in the second regression has been undermined by the downwards bias in its coefficient. The same is true for the apparent explanatory power of EXP in the third equation.


Consequences of Variable Misspecification

True Model

Fit

ted

Mod

eluXXY 33221 uXY 221

33

221ˆ

Xb

XbbY

221ˆ XbbY



Coefficients are biased (in general). Standarderrors are invalid.

VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE

Now, we will investigate the effects of including an irrelevant variable in a regression model (these are different from those of omitted variable misspecification.) In this case the coefficients in general remain unbiased, but they are inefficient. The standard errors remain valid, but are needlessly large.

Coefficients are unbiased (in general),

but inefficient.Standard errors are

valid (in general)


uXY 221

33221ˆ XbXbbY

uXXY 3221 0

Rewrite the true model adding X3 as an explanatory variable, with a coefficient of 0. Now the true model and the fitted model coincide. Hence b2 will be an unbiased estimator of b2 and b3 will be an unbiased estimator of 0.



uXY 221

33221ˆ XbXbbY

uXXY 3221 0

However, the variance of b2 will be larger than it would have been if the correct simple regression had been run because it includes the factor 1 / (1 – r2), where r is the correlation between X2 and X3.

2,

222

22

32

2 11

XXi

ub rXX



• The estimator b2 using the multiple regression model will therefore be less efficient than the alternative using the simple regression model.

• The intuitive reason for this is that the simple regression model exploits the information that X3 should not be in the regression, while with the multiple regression model you find this out from the regression results.

• The standard errors remain valid, because the model is formally correctly specified, but they will tend to be larger than those obtained in a simple regression, reflecting the loss of efficiency.

• These are the results in general. Note that if X2 and X3 happen to be uncorrelated, there will be no loss of efficiency after all.



. reg LGFDHO LGEXP LGSIZE

Source | SS df MS Number of obs = 868---------+------------------------------ F( 2, 865) = 460.92 Model | 138.776549 2 69.3882747 Prob > F = 0.0000Residual | 130.219231 865 .150542464 R-squared = 0.5159---------+------------------------------ Adj R-squared = 0.5148 Total | 268.995781 867 .310260416 Root MSE = .388

------------------------------------------------------------------------------ LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+-------------------------------------------------------------------- LGEXP | .2866813 .0226824 12.639 0.000 .2421622 .3312003 LGSIZE | .4854698 .0255476 19.003 0.000 .4353272 .5356124 _cons | 4.720269 .2209996 21.359 0.000 4.286511 5.154027------------------------------------------------------------------------------

The analysis will be illustrated using a regression of LGFDHO, the logarithm of annual household expenditure on food eaten at home, on LGEXP, the logarithm of total annual household expenditure, and LGSIZE, the logarithm of the number of persons in the household.

The source of the data was the 1995 US Consumer Expenditure Survey. The sample size was 868.



. reg LGFDHO LGEXP LGSIZE LGHOUS


------------------------------------------------------------------------------ LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+-------------------------------------------------------------------- LGEXP | .2673552 .0370782 7.211 0.000 .1945813 .340129 LGSIZE | .4868228 .0256383 18.988 0.000 .4365021 .5371434 LGHOUS | .0229611 .0348408 0.659 0.510 -.0454214 .0913436 _cons | 4.708772 .2217592 21.234 0.000 4.273522 5.144022------------------------------------------------------------------------------

Now add LGHOUS, the logarithm of annual expenditure on housing services. It is safe to assume that LGHOUS is an irrelevant variable and, not surprisingly, its coefficient is not significantly different from zero.






It is however highly correlated with LGEXP (correlation coefficient 0.81), and also, to a lesser extent, with LGSIZE (correlation coefficient 0.33).

. cor LGHOUS LGEXP LGSIZE(obs=869)

| LGHOUS LGEXP LGSIZE--------+--------------------------- lGHOUS| 1.0000 LGEXP| 0.8137 1.0000 LGSIZE| 0.3256 0.4491 1.0000







Its inclusion does not cause the coefficients of those variables to be biased.







But it does increase their standard errors, particularly that of LGEXP, as you would expect, reflecting the loss of efficiency.



PROXY VARIABLES

Suppose that a variable Y is hypothesized to depend on a set of explanatory variables X2, ..., Xk as shown above, and suppose that for some reason there are no data on X2.

As we have seen, a regression of Y on X3, ..., Xk would yield biased estimates of the coefficients and invalid standard errors and tests.

uXXXY kk ...33221


Sometimes, however, these problems can be reduced or eliminated by using a proxy variable in the place of X2. A proxy variable is one that is hypothesized to be linearly related to the missing variable. In the present example, Z could act as a proxy for X2.

The validity of the proxy relationship must be justified on the basis of theory, common sense, or experience. It cannot be checked directly because there are no data on X2.

uXXXY kk ...33221

ZX 2

PROXY VARIABLES


If a suitable proxy has been identified, the regression model can be rewritten as shown.

uXXXY kk ...33221

ZX 2

uXXZ

uXXZY

kk

kk

...)(

...)(

33221

3321

PROXY VARIABLES


We thus obtain a model with all variables observable. If the proxy relationship is an exact one, and we fit this relationship, most of the regression results will be rescued.

1. The estimates of the coefficients of X3, ..., Xk will be the same as those that would have been obtained if it had been possible to regress Y on X2, ..., Xk.

2. The standard errors and t statistics of the coefficients of X3, ..., Xk will be the same as those that would have been obtained if it had been possible to regress Y on X2, ..., Xk.

3. R2 will be the same as it would have been if it had been possible to regress Y on X2, ..., Xk.

uXXXY kk ...33221 ZX 2

uXXZ

uXXZY

kk

kk

...)(

...)(

33221

3321

PROXY VARIABLES


3. 3. R2 will be the same as it would have been if it had been possible to regress Y on X2, ..., Xk.

4. The coefficient of Z will be an estimate of b2m, and so it will not be possible to obtain an estimate of b2, unless you are able to guess the value of m.

5. However the t statistic for Z will be the same as that which would have been obtained for X2 if it had been possible to regress Y on X2, ..., Xk, and so you are able to assess the significance of X2, even if you are not able to estimate its coefficient.


uXXZ

uXXZY

kk

kk

...)(

...)(

33221

3321

PROXY VARIABLES


6. It will not be possible to obtain an estimate of b1 since the intercept in the revised model is (b1+b2l), but usually b1 is of relatively little interest, anyway.

It is more realistic to hypothesize that the relation between X2 and Z is approximate, rather than exact. In that case the results listed above will hold approximately.

However, if Z is a poor proxy for X2, the results will effectively be subject to measurement error. Further, it is possible that some of the other X variables will try to act as proxies for X2, and there will still be a problem of omitted variable bias.


uXXZ

uXXZY

kk

kk

...)(

...)(

33221

3321


PROXY VARIABLES

The use of a proxy variable will be illustrated with an educational attainment model. We will suppose that educational attainment depends jointly on cognitive ability and family background.

As usual, ASVABC will be used as the measure of cognitive ability. However, there is no ‘family background’ variable in the data set. Indeed, it is difficult to conceive how such a variable might be defined.

uINDEXASVABCS 321


Instead, we will try to find a proxy. One obvious variable is the mother's educational attainment, SM. However, father's educational attainment, SF, may also be relevant. So we will hypothesize that the family background index depends on both.

uINDEXASVABCS 321

SFSMINDEX 21

PROXY VARIABLES


Thus we obtain a relationship expressing S as a function of ASVABC, SM, and SF.

uINDEXASVABCS 321

SFSMINDEX 21

uSFSMASVABC

uSFSMASVABCS

2313231

21321

)(

)(

PROXY VARIABLES


. reg S ASVABC SM SF


------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

Here is the corresponding regression using EAEF Data Set 21.

PROXY VARIABLES


. reg S ASVABC



Here is the regression of S on ASVABC alone.

PROXY VARIABLES




. reg S ASVABC


A comparison of the regressions indicates that the coefficient of ASVABC is biased upwards if we make no attempt to control for family background.

PROXY VARIABLES




. reg S ASVABC


This is what we should expect. Both SM and SF are likely to have positive effects on educational attainment, and they are both positively correlated with ASVABC.

. cor ASVABC SM SF(obs=570)

| ASVABC SM SF--------+--------------------------- ASVABC| 1.0000 SM| 0.4202 1.0000 SF| 0.4090 0.6241 1.0000

PROXY VARIABLES

Documents

© Christopher Dougherty 1999–2006 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE We will now investigate the consequences of misspecifying