Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
1 Outline.
1. Motivation.
2. Variance of ols
3. Estimation and testing.
4. Asymptotic theory of ols with conditional het-
eroskedasticity.
2 Motivation.
• In this section, we want to drop the assumptionthat the error term is iid
• This assumpton may not make sense in many
context.
• For example, suppose that we were estimating aproduction function by pooling data on outputs
and inputs for i = 1, ..., N firms in an industry
qi = β0 + β1ki + β2li + ωi
• qi−output (often measured in value added), ki−capital,li−labor, ωi−productivity shock
• In many industries, the size distribution of firmsis highly skewed.
• Ex. construction, there a large number of smallfirms with no payroll and a handful of extremely
large firms (such as Bechtel) who in engage in
extremely large complicated projects.
• It would be silly to assume that the variance ofBechtel’s productivity shock is the same as the
contractor who does small remodeling jobs.
• Remark- you might ask what framework for mea-surment justifies pooling these very distinct firms
into the same regression!
• Thus, it is important to consider the case thatV ar(y|X) = Ω0 6= σoI.
• The following properties of our ols estimator re-main unchanged: unbiasedness, consistency, nor-
mality, asymptotic normality.
• None of these properties relied on the assumptionof a scalar variance covariance matrix.
• The following properties are changed-the variancematrix, estimation of the variance matrix, distrib-
ution of pivotal statistic, relative efficiency of ols
• In this chapter, we shall assume that
V ar(y|X) =
⎛⎜⎜⎜⎜⎜⎜⎝σ21 0 0 0 0
σ22 0 0 00 0 σ23 0 00 0 0 . . . 00 0 0 0 σ2N
⎞⎟⎟⎟⎟⎟⎟⎠
• This is called conditional heteroskedasticity.
• In Chapter 19, consider serial correlation of ele-ments.
• As a practical matter, it is common practice toadjust standard errors in regression results (e.g.
calculate robust standard errors).
3 Variance of Least Squares.
• Consider the variance of our ols estimator condi-tonal on X.
V AR(bβ|X) = V ar³(X 0X)−1X 0y|X
´= (X 0X)−1X 0V ar(y|X)X(X0X)−1
= (X 0X)−1X 0Ω0X(X0X)−1
• Obviously, we should expect s2(X 0X) to be a bi-ased estimator of the variance.
• Next note that:
V ar[y − bμ|X] = V ar[(I − PX)y|X]= (I − PX)V ar[y|X](I − PX)
= (I − PX)Ω0(I − PX)
• The above formula suggests that our distributiontheory for s2 falls apart.
• It is no longer going to be possible to show that(y−bμ)0(y−bμ)
σ20converges to a χ2 distribution.
• The test statistics previously derived will no longerhold because our variance matrix is no longer a
scalar.
• It is easy to show that ols is no longer going to
be an efficient estimator.
• For simplicity, consider the case that the only re-gressor is a constant, xn = 1 for all n
• Also, let the errors be normally distributed andheteroskedastic as follows:
var(yn) =n σ201 if 1 ≤ n ≤ N1
σ202 if N1 < n ≤ N
σ201 < σ202
• Recall that the ols estimator is just the samplemean, so that:
V ar(bβ|X) = σ201N1 + σ202(N −N1)
N2
• Consider an alternative estimator where we justuse the first N1 observations, so that
eβ =1
N1
N1Xn=1
yn
var(eβ|X) =σ201N1
• Obviously, if σ201 is small enough, then var(eβ|X) <V ar(bβ|X).
• This example demonstrates that with heteroskedas-ticty, ols is no longer efficient.
• Moreover, this example suggests that we may wishto overweight observations with a lower variance
in order to construct more efficient estimators.
3.1 Testing for Conditional Heteroskedas-
ticity.
• Following the previous example, you could parti-tion the sample into two groups 1 ≤ n ≤ N1 and
N1 < n ≤ N.
• We want to test the null hypothesis that σ201 =σ202.
• Define s1 and s2 as:
s1 =(y1 − bμ1)0 (y1 − bμ1)
σ201
s2 =(y2 − bμ2)0 (y2 − bμ2)
σ202
• Under the null, s1s2∼ FN1−1,N−N1−1 under the
hypothesis of heteroskedasticty and normality.
• Note many of the tests described in the chapterrequire the assumption of normality, which is not
particularly attractive for applied work.
• This idea can be generalized into the Goldfeld-Quandt F-Test.
• Suppose we are running a multiple regression, in-stead of just using an intercept as above.
• Suppose that one wishes to test the null hypoth-esis that σ2n increases with zn.
• Reorder the observations from the highest zn to
the lowest zn.
• Choose a point N1 and compute s1 and s2 as
above.
• We need to adjust the degrees of freedom, but itcan be shown that s1
s2∼ FN1−K,N−N1−K
• The test also discussed the Breusch-Pagan ScoreTest.
• Assume
yn|(xn, zn) ∼ N(x0nβ0, γ01 + z02nγ02)
• Test null hypothesis that γ02 = 0.
• The formal test statistic is stated in the text, butlet’s talk about an ”intuitive” derivation.
• Let wn(bβ) = (yn−bμOLS,n)2.• Loosely speaking, the fitted residuals are estima-tors of σ20n, n = 1, ..., N
• Suppose that we were at the ”limit” and that β0were effectively observable.
• Then (yn−xnβ0)2 would be distributed as a σ20nχ21which has a mean of σ20n
• Then, under our modeling assumptions:
E(w(β0)|X,Z) = [σ20n]0 = zγ0
V ar(w(β0)|X,Z) = 2diag∙³σ20n
´2¸= 2diag
h(zγ0)
2i
• Recall that an ols regression gives us the condi-tional mean.
• Thus, we could do an ols regression on the aboveequation and test the null hypothesis that γ02 =
0.
• Let our estimates of the regression parameters be:
bγ = (Z0Z)−1Z0w(β0)bw2 = ((I − PZ)w(β0))0 ((I − PZ)w(β0))
= w(β0)0(I − PZ)w(β0)
• Remark, recall that projection matrices are idem-potent and symmetric.
• Under the null hypothesis, the Wald Statistic wouldbe:
W =bγ02Z02⊥1Z2⊥1bγ2bw2
• The above uses our partitioned regression formu-las:
bγ2 = ³Z02⊥1Z2⊥1
´−1Z2⊥1w(β0)
• This generates the variance formula:
V ar(bγ2) = V ar[won|Z]³Z02⊥1Z2⊥1
´−1
• This gives us the weighting matrix in the numer-ator of W.
• Replacing β0 withbβ this is almost the Breusch-
Pagan Score Test.
• Remarket: the usefulness is also limited by nor-mality assumption.
4 Adjustments to ols
• The assumption of homoskedasticity is strong andit is common to adjust standard errors to allow for
conditional heteroskedasticity in applied studies.
• The above example encouraged us to think ofwn(β0) as an ”estimate” of σ
2n.
• Of course, since we do not have repeated samplesper n , we will not be able to learn about σ2n.
• However, many of the objects that we wish tolearn about, such as V ar(bβ|X) will be a fixed k
dimensional object and will not change with the
sample size.
• Can sometimes show that the errors in thinking
about wn(β0) as an ”estimate” of σ2n will average
out.
• White demonstrated that:
1
NX 0diag
h(yn − bμn)2iX →p 1
NX0ΩX
• Hence it is possible to estimate V ar(bβ|X) as:
V ar(bβ|X) = (X 0X)−1X 0Ω0X(X0X)−1d
V ar(bβ|X) = (X 0X)−1X 0diagh(yn − bμn)2iX(X 0X)−1
4.1 WLS/GLS
• Next, we describe how to generate more efficientestimates of our linear model.
• Previous example demonstrated that ols can beinefficient.
• The reason that our proof for the efficiency of olsbreaks down is that the error distribution is no
longer spherical.
• Previous example demonstrated that you may wantto overweight low variance observations and un-
derweight high variance observations.
• For example, suppose that the variance matrix isdiagonal.
• Let’s reweight our observations using the inverseof the variance matrix, i.e.
y∗n =yn
σ0n
x∗n =xn
σ0n
• Note that var[y∗n|x∗n] = 1 so that we are back toa spherical distribution.
• Consider the following regression:
y∗n = x∗0nβ + ε∗n
• Since we are back to a spherical distribution, wenow satisfy the assumptions of the Gauss-Markov
theorem.
• Suppose that Ω0 was known (not required to bespherical for this next theorem!)
Theorem Let X be full column rank, y be a random
variable such thatE[y|X] = Xβ0 and Var[y|X] =Ω0, a pd matrix. The GLS estimator bβGLS =³X0Ω−10 X
´−1X0Ω−10 y is efficient.
• The idea behind the proof is to overweight theobservations with low variance like our simple ex-
ample.
• This will return us to a spherical distribution andenhance efficiency.
• Let Ω0 = C0C−10 be the Cholesky factorization
of Ω0
• Next note that:
EhC−10 y|X
i= C−10 Xβ0
V arhC−10 y|X
i= C−10 Ω0
³C−10
´0= In
• Next we apply the Gauss Markov Theorem to es-
timate β0
β0 =∙³C−10 X
´0 ³C−10 X
´¸−1 ³C−10 X
´0C−10 y
=³X 0Ω−10 X
´−1X 0Ω−10 y
= bβGLS• This proves the theorem.
• In general the GLS estimator is not feasible sinceΩ0 is not known.
• However, we can do a Feasible GLS by replacingΩ0 with an estimate in some circustances.
• In the case of models with ”linear homoskedastic-ity” (i.e. the standard deviation is a linear func-
tion of the x’s), we perform a two step estimator:
1. Fit a linear regression of wn(bβOLS) on zn and
denote the fitted coefficients as bγ2. Plug in bγ for γ0 and compute the FGLS estimatorbβFGLS = ³
X 0 bΩ−1X´−1X 0 bΩ−1y• In practice, most papers do not use the feasibleGLS estimator discussed in the text (although it is
sometimes seen if heteroskedasticity is particularly
bad and efficiency is a concern).
• Most commonly, use ols with ”robust” standarderrors to allow for heteroskedasticity of an un-
known form.
• Making parametric assumptions about the het-eroskedasticity is not particularly attractive as in
the estimators in the text.
5 Asymptotic Theory for Heteroskedas-
ticity.
• Next, we would like to work out the distributiontheory for ols under heteroskedasticity.
• We begin by stating LLN and CLT’s for inid (inde-pendently, not identically) distributed sequences
of random number.
Chebychev’s LLN Let Un be a sequence of inde-pendent rv’s such thatE [Un] = μn and V ar [Un] =
σ2n exit for all n. Denote
EN [μ] =1
N
Xnμn
EN
hσ2i=
1
N
Xnσ2n
If limN→∞1NEN
hσ2i= 0 then EN [U ]−EN [μ]→p
0.
Liapounov CLT Let Un be a sequence of inid rv’ssuch that E [Un] = μn and V ar [Un] = σ2n >
ε > 0 and E [|Un − μn|] = γn exits for all n. If
limN→∞
(Pn γn)
1/3³Pn σ
2n
´1/2 = 0
then
N1/2EN [U − μ]
EN
hσ2i1/2 →d N(0, 1)
• The assumption on the CLT essentially rules outdistributions that has tails that are ”too fat”.
• Next, we sketch the proof for the asymptotic dis-tribution theory for ols under heteroskedasticity.
• The first thing we wish to do is demonstrate thatbβOLS is consistent.• Recall that:
bβOLS − β0 =µ1
NX 0X
¶−1 1NX0(y −Xβ0)
• Let xn(y−x0nβ0) play the role of Un in the LLN.
• Assume that xn are not random variables, but
that limN→∞1N
Pn h(znγ
0)xnx0n = D(γ) and
³1NX 0X
´−1→ D1 finite and pd.
• We have satisfied the LLN so thatEN£xn(y − x0nβ0)
¤=
1NX0(y − Xβ0) →p 0. Then by continuity of
plims:
µ1
NX 0X
¶−1 1NX 0(y −Xβ0)→p 0
• Assuming that the error term is normal, it is slightlytedious, but we can verify the sufficient condition
for the CLT applying the Cramer-Wald device.
• Hence, it follows analogously to previous proofsthat:
³X0Ω0X
´−1/2X 0(y −Xβ0)→d N(0, I)
• Suppose that we have an estimator of the variancematrix such that:
X0 bΩX −p D2(γ0)
where D2(γ0) is nonsingular.
• Then by limit continuity it will follow that:
³X 0 bΩX´−1/2X 0X(bβOLS − β0)
−³X0Ω0X
´−1/2X 0(y −Xβ0)→p 0
• Thus we will treat bβOLS as approximatelyN(β0,
¡X 0X
¢−1 ³X0 bΩX´ ¡X 0X¢−1).