22
Empirical Methods in Applied Microeconomics Jrn-Ste/en Pischke LSE October 2007 1 Nonstandard Standard Error Issues The discussion so far has concentrated on identication of the e/ect of in- terest. Obviously, this always should be the main concern: there is little consolence in having an accurate standard error on a meaningless estimate! Hopefully, the previous chapters will help you to design research projects and emprical strategies with lead to valid estimates. But there are a few important inference issues which arise with the type of cross-sectional and panel data we typically use in applied econometrics. It is therefore time to try to tackle those. This chapter uses somewhat more matrix algebra than the previous ones but will hopefully be equally accessible. 1.1 The Bias of Robust Standard Errors The natural way to compute asymptotic standard errors and t-statistics for regression is using the robust covariance matrix ( P [X i X 0 i ]=N ) 1 P [X i X i ^ " 2 i ]=N ( P [X i X 0 i ]) 1 . Of course, asymptotic covariance matrices, as the name suggests, are only valid in large samples. We have seen already that large samples are always a relative concept, and things may go awry in the samples we use in our actual research. The robust covariance matrix is no exception. Suppose the actual covariance matrix of the population regression resid- uals is given by E["" 0 jX ]== diag( i ). For the moment the covariance matrix is diagonal, meaning that residuals are independent across obser- vations. We will take up the case of dependent residuals in the following section. The covariance matrix of the OLS estimator is then V = X 0 X 1 X 0 X X 0 X 1 : (1) 1

Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

Empirical Methods in Applied Microeconomics

Jörn-Ste¤en PischkeLSE

October 2007

1 Nonstandard Standard Error Issues

The discussion so far has concentrated on identi�cation of the e¤ect of in-terest. Obviously, this always should be the main concern: there is littleconsolence in having an accurate standard error on a meaningless estimate!Hopefully, the previous chapters will help you to design research projectsand emprical strategies with lead to valid estimates. But there are a fewimportant inference issues which arise with the type of cross-sectional andpanel data we typically use in applied econometrics. It is therefore time totry to tackle those. This chapter uses somewhat more matrix algebra thanthe previous ones but will hopefully be equally accessible.

1.1 The Bias of Robust Standard Errors

The natural way to compute asymptotic standard errors and t-statistics forregression is using the robust covariance matrix (

P[XiX0i]=N)

�1 �P[XiXi"̂2i ]=N

�(P[XiX0i])

�1.Of course, asymptotic covariance matrices, as the name suggests, are onlyvalid in large samples. We have seen already that large samples are alwaysa relative concept, and things may go awry in the samples we use in ouractual research. The robust covariance matrix is no exception.

Suppose the actual covariance matrix of the population regression resid-uals is given by E[""0jX] = � = diag(�i). For the moment the covariancematrix is diagonal, meaning that residuals are independent across obser-vations. We will take up the case of dependent residuals in the followingsection. The covariance matrix of the OLS estimator is then

V =�X 0X

��1 �X 0�X

� �X 0X

��1: (1)

1

Page 2: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

With �xed Xs this is the actual covariance matrix applicable to our smallsample estimator, not just the asymptotic covariance matrix. The prob-lem is that it involves the unknown �is which we replace by the samplecounterparts "̂2i in our covariance estimator. Notice that "̂ = y � Xb� =y � X(X 0X)�1X 0y =

�I �X(X 0X)�1X 0� (X� + ") = M" where M =

I �X(X 0X)�1X 0 is the residual maker matrix and " is the residual of thepopulation regression. Denoting the i-th column of the matrix M by mi

then "̂i = m0i". It follows that

E�"̂2i�= E

�m0i""

0mi

�= m0

i�mi:

Notice that mi is the i-th column of the identity matrix (call it ei) minus thei-th column of the projection matrix H = X(X 0X)�1X 0 (which is also calledthe hat-matrix, since it makes predicted values). Denote the i-th column ofthe hat-matrix by hi = x0i(X

0X)�1X 0. Hence, mi = ei � hi, and therefore

E�"̂2i�= (ei � hi)0� (ei � hi)= �i � 2�ihii + h0i�hi (2)

where hii is the i-th diagonal element of the hat-matrix.. Because this matrixis symmetric and idempotent (meaning HH 0 = H), it follow that hii = h0ihiso that we obtain (see Chesher and Jewitt, 1987)

E�bV � V � = �X 0X

��1 �X 0diag

�h0i (�� 2�iI)hi

�X� �X 0X

��1: (3)

While bV is biased, it is easy to see that it is a consistent estimator ofV . Consider the case of �xed Xs again, and focus on the middle bit of thematrix X 0b�X. Notice that b� is not consistent for �, since there are moreand more elements to estimate as the sample gets large. Neverthless, "̂i isconsistent for "i since "̂i = yi � x0ib� and b� is consisent for � (another wayto think of this: if we have the entire population instead of a sample, we getthe population residual from the population regression). But

X 0b�X =1

N

Xi

"̂2ix0ixi

and since plim "̂2i = �i we get plim X 0b�X = X 0�X.

2

Page 3: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

So why is bV is biased? The reason is that E�"̂2i�is a biased estimate,

as we have seen in (2). Consider the case where the residual is actuallyhomoskedastic so that �i = �. In this case (2) gives E

�"̂2i�= � � 2�hii +

�hii = � (1� hii). The variance of the residual in small samples is too small,and this is related to the quantity hii. So we need to start by consideringsome properties of the diagonal elements of the hat-matrix, hii. They arecalled leverage because they measure how much pull a particular value of xiexerts on the regression line. Note that

byi = h0iy

= hiiyi +Xj 6=i

hijyj

so if hii is particularly large the i-th observation will have a large in�uence(or �leverage�) on the predicted value. In a bivariate regression

hii =1

N+

(xi � x)2P(xj � x)2

so the leverage is related to how far the x-value of a data point is from thecenter of the data compared to the general dispersion in the sample. Highleverage points are outliers in the x-dimension. Figure 1 illustrates howhigh leverage points lead to small residuals because such points can pullthe regression line a lot without changing the residuals on the other (lowerleverage) data points a lot.

How much leverage is a lot? Notice thatPhii = trace(H) = rank(H) =

k, the number of regressors, since H is an idempotent matrix. Hence

1

N

Xi

hii =k

N:

Moreover, hii < 1, and as hii ! 1, the variance of the i-th residual wouldshrink to zero, i.e. the regression line would pass exactly through that point.

Armed with what we now know about hii, we can now return to the biasformula (3) for the robust covariance estimator. This formula highlights twothings. First, the bias depends on the form of �, i.e. the actual variancesof the population residuals, which is in general unknown. If we knew �,we could compute the correct standard errors using (1) directly and therewould be no need to resort to the robust covariance matrix. If we do notknow �, there is no way of knowing the exact extent of the bias of the robustcovariance matrix. The second ingredient in the bias are the vectors hi from

3

Page 4: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

high leverage point-4

04

812

0 1 2 3 4 5 6x

Figure 1: High leverage points lead to small residuals

the projection or hat matrix. So the second thing we learn from (3) isthat the bias will be worse if there are large x-outliers in our data, and inparticular when the leverage of an observation is related to the variance ofthe residual.

What can be done to improve the performance of the robust covariancematrix? There are a number of suggestions in the literature. Denoting the

robust covariance matrix estimator by (P[XiX0i]=N)

�1�P

[XiXib�i]=N� (P[XiX0i])�1the alternative forms use alternative values for b�i:

HC0 : b�i = "̂2iHC1 : b�i = N

N � k "̂2i

HC2 : b�i = 1

1� hii"̂2i

HC3 : b�i = 1

(1� hii)2"̂2i :

HC0 yields the covariance estimator suggested by White (1980). HC1 is a

4

Page 5: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

simple degrees of freedom correction, which helps in small samples. HC2uses the leverage to give an unbiased estimate of the variance estimate ofthe i-th residual. HC3 is an approximation to a jacknife estimator sug-gested by MacKinnon and White (1985). In many cases the size of thecalculated standard errors from HCj will be larger the larger j, but there isno guarantee that the ordering is of that particular form with actual data.These alternative estimators are often implemented in modern regressionpackages.1 Even when they are not, they are easy to compute using a tricksuggested by Messer and White (1984). This amounts to dividing yi and Xi

byqb�i and then running an IV regression with these transformed variables,

instrumenting Xi=qb�i by Xiqb�i for the appropriate choice of b�i.2

In order to gain some insight into these various versions of the robustcovariance estimator, consider a very simple regression design:

yi = �+ �di + "i (4)

where di is a dummy variable. � in this regression estimates the di¤erence inthe means in the two subsamples de�ned by the dummy variable. Denotingthese subsamples by the subscripts 0 and 1, we have

b� = y1 � y0:Furthermore, let p = E(di). We will treat the dummy as a �xed covariate, sothat p = N1=N and 1�p = N0=N . We discuss this example because it is animportant one in statistics, and we know a lot about the small sample prop-erties for the di¤erence in means. When yi is distibuted normally with equalbut unknown variance in the two subsamples, then the t�statisitic for thedi¤erence in means has a t-distribution: this is the classic two sample t-test.However, we are concerned with the possibility that there is heteroskdas-ticity, meaning that the variances in the two subsamples are di¤erent. Ifnothing is known about these two variances the testing problem in smallsamples becomes intractable: the exact small sample distribution for thisproblem is not known. This is known as the Behrens-Fisher problem (seee.g. DeGroot, 1986, p. 510-511). The di¤erent robust covariance estimatorsHC0 - HC3 are di¤erent responses to �guring out the standard error for thistesting problem.

1For example, the STATA package computes HC1, HC2, and HC3.2Another suggestion to improve the small sample performance of the covariance esti-

mator is bootstrapping. Horowitz (1997) advocates a form of the bootstrap called thewild bootstrap in this context.

5

Page 6: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

De�ne S2j =Pdi=j

�yi � yj

�2 for j = 0; 1. The diagonal elements of thehat-matrix in this particular case are

hii =

�1=N0 if di = 01=N1 if di = 1

;

and it is straightforward to show that the �ve covariance estimators are

OLS :N

N0N1

�S20 + S

21

N � 2

�=

1

Np(1� p)

�S20 + S

21

N � 2

�HC0 :

S20N20

+S21N21

HC1 :N

N � 2

�S20N20

+S21N21

�HC2 :

S20N0 (N0 � 1)

+S21

N1 (N1 � 1)

HC3 :S20

(N0 � 1)2+

S21(N1 � 1)2

:

The standard OLS estimator pools the observations from both subsamplesto derive the variance estimate: this is the e¢ cient thing to do when the twovariances are actually the same. The White (1980) estimator HC0 adds theestimates of the two sampling variances of the means, using the consistent(but biased) maximum likelihood estiamte of variance. The HC2 estimatoris the unbiased estimator for the sampling variance in this case, since itmakes the correct degrees of freedom correction. HC1 makes the degreesof freedom correction outside the sum, which will help but generally notbe quite correct. Since we know HC2 to be the unbiased estimate of thesampling variance, we also see immediately that HC3 will be too big. Eventhough we know the exact unbiased estimator for the sampling variancein this case, we still don�t know the small sample distribution of the teststatistic

y1 � y0qS20

N0(N0�1) +S21

N1(N1�1)

;

the Behrens-Fisher problem. Note that p = 0:5 implies that the regressiondesign is perfectly balanced. In this case, the OLS estimator will be equalto HC1, and all �ve estimators will generally di¤er little.

To provide some further insights, we present some results from a smallMonte Carlo experiment for the model (4). Whe choose N = 30, since

6

Page 7: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

this will highlight the small sample issues, and p = 0:9, which implies hii =10=N = 1=3 if di = 1, in order to have a relatively unbalanced design. Wedraw

"i ��N(0; �2) if di = 0N(0; 1) if di = 1

and we show results for two cases. The �rst has relatively little heteroskedas-ticity, and we set � = 0:85, while the second has lots of heteroskedasticitywith � = 0:5.

Table 8.1.1. displays the results. The columns �mean� and �standarddeviation� display means and standard deviations of the various estima-tors across 25,000 replications of the sampling experiment. The standarddeviation of b� is the sampling variance we trying to measure. Even withlittle heteroskedasticity the OLS standard erros are too small by about 15%.However, HC0 and HC1 are even smaller because of the small sample bias.HC2 is slightly bigger than the OLS standard errors on average. Notice thatthis estimator of the sampling variance is unbiased while the mean of theHC2 standard errors across sampling experiments (0.54) is still below thestandard deviation of b� (0.60). This comes from the fact that the stan-dard error is the square root of the sampling variance, the sampling varianceis itself estimated and hence has sampling variability, and the square rootis a concave function. The HC3 standard error is slightly too big, as weexpected.

The last two columns in the table show empirical rejection rates forthe hypothesis b� = �, using a nominal size of 5% for the test. Sincewe don�t know the exact small sample distribution, we compare the teststatistics to the normal distribution (which is the asymptotic distribution)and to a t-statistic (which is not the correct small sample distribution inthis case for any of the estimators, as we have seen). Rejection rates arefar too high for all tests. Interestingly, with little heteroskedasticity OLSstandard errors have lower rejection rates than the robust standard errors,even though the standard errors themselves are are smaller than HC2 andHC3 on average. But the standard errors themselves are estimated and havesampling variability. The OLS standard errors are much more preciselyestimated than the robust standard errors, as can be seen from column(2).3 This means the robust standard errors will sometimes be too small�by accident�and this happens often enough in this case to make the OLS

3The large sampling variance of the robust estimators has also been noted by Chesherand Austin (1991). Kauermann and Carroll (2001) propose an adjustment to the con�-dence interval to correct for this.

7

Page 8: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

standard errors preferred. The lesson we can take a away from this is thatrobust standard errors are no panacea. They can be smaller than OLSstandard errors for two reasons: the small sample bias we have discussed,and the higher sampling variance of these standard errors. Hence, if weobserve robust standard errors being smaller than OLS standard errors weknow this to be a warning �ag. If heteroskedasticity was present and ourstandard error estimate was �about right,�this wouldn�t happen.

With lots of heteroskedasticity as in the lower panel of the table thingsare di¤erent: OLS standard errors are now dominated by the robust stan-dard errors throughout, although the empirical rejection rates are still waytoo high to give us much con�dence in our con�dence intervals for any ofthe estimators. Using the t-distribution rather than the normal helps onlymarginally. There doesn�t seem to be any clear way out of this conundrum.Standard error estimates might be biased in �nite samples. OLS standarderrors because of heteroskedasticity, and robust standard errors because ofthe in�uence of high leverage points. Hence, if the regression design is pos-sibly unbalanced, the only prescription for the applied researcher is to checkthe data for high leverage points. If the regression design is relatively bal-anced then robust standard errors should produce fairly accurate con�denceintervals even in small samples.

One hopeful observation on robust standard errors is that we have rarelyseen them di¤er from OLS standard errors in emprical practice by morethan something like 25%. In any applied project there are always myriadsof speci�cation choices to be made from selection of the sample, to the exacttreatment of the variables, regression design, etc. These certainly produce�non sampling variation� in our estimates of a similar magnitude (in thesense that our estimates would di¤er if we repeated the project with slightlydi¤erent choices). Hence, although we strife to get our standard errors asright as possible, if they end up being biased by something in the order of25% this would probably not keep us up at night. But this is only truein the case of independent observations. Things can be much worse whenobservations are dependent.

1.2 Clustering and Serial Correlation in Panels

1.2.1 Clustering and the Moulton Factor

The more serious problems have to do with correlation of the residuals acrossthe units of observation. Start by considering the simple model

yig = �+ �xg + "ig (5)

8

Page 9: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

where the outcome is observed at the individual level but the regressor ofinterest, xg, varies only at a higher level of aggregation, a group g, and thereare G groups. For example, yig could be the test score of a student, and xgis class size, where i denotes the student, and g denotes the class room. Ifxg is as randomly assigned, as in the Tennessee STAR experiment (Krueger,1999), then the OLS estimator is unbiased and consistent for the populationregression coe¢ cient. Recall that the 12,000 students in Kindergarten togarde 3 were randomly assigned to small or regular classes in the STARexperiment. What we are worried about in an analysis of the STAR data isthat the error term has a group structure:

"ig = vg + �ig: (6)

The class room level component could result from the fact that a class mayhave had a particularly good teacher, or a class took the test when therewere a lot of external disruptions, so that all students performed more poorlythan alternative classes.

This problem of correlation in the errors is, of course, well known ineconometrics. Kloek (1981) and in particular Moulton (1986), however,pointed out how important it can be for applied research in the groupedregressor case. Following their derivations, it is straightforward to analyzethis case. The algebra needs some extra notation, and is therefore exiled toan appendix. Let

� =�2v

�2v + �2�

:

Given the structure (6) � is called the intra-class correlation (even in caseswhere the groups are not class-rooms!). When the groups are of equal sizen, we have

var(b�)var�(b�) = 1 + (n� 1)� (7)

where var(b�OLS) is the true variance of the OLS estimator and var�(b�OLS)is the conventional OLS variance. Notice that the OLS standard error for-mula will be worse if n is large and if � is large. To see the intuition, considerthe case where � ! 1. In this case, all the errors within a group are thesame. This is just like taking a data set and making n identical copies. Thecovariance matrix of the replicated data set is going to be 1=n times theoriginal covariance matrix, although no information has been added.

In order to see how this problem is related to the group structure inthe regressor x, consider the generalization of (7) where the regressor is xig,

9

Page 10: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

which varies at the individual level, but is correlated within groups, and thegroup sizes ng vary by group. In this case

var(b�)var�(b�) = 1 +

�var(ng)

n+ n� 1

��x� (8)

�x =

Pg

Pi6=k (xig � x) (xkg � x)

var(xig)Pg ng(ng � 1)

:

�x is the intra-class correlation of xig, and it is actually unrestricted anddoes not impose a form like in (6). What the formula says is that the bias inthe OLS formula is much worse when �x is large but vanishes when �x = 0: Ifthe xig�s are uncorrelated within groups, the error structure does not matterfor the estimation of the standard errors.

In order to see that this problem can be quite important, return toexample of estimating the e¤ect of class size on student achievement withthe Tennessee STAR data. For illustration, we will just run (5) by OLS,although a fair bit of the variation in class size comes from non-randomfactors. A simple regression of the percentile score for Kindergarteners ontheir class size yields an estimate of -0.618 with a robust standard error of0.090.4 Now consider the formula (8). Even though �x = 1, classes are ofunequal size. Plugging all the relevant values into the formula we get

var(b�)var�(b�) = 1 +

�17:13

19:42+ 19:42� 1

�0:311 = 7:01:

This implies that our standard error estimate is too large by a factor of2:65 =

p7:01. The corrected standard error is 0.238.

The same problem arises in IV estimation. Consider the regression equa-tion

yig = �+ �xig + "ig

where the regressor can now vary at the individual level. Let Z be a matrixof instruments which only vary at the group level. It is easy to show that theMoulton formula for the IV case is the same as (8) for the grouping of theinstrument (Shore-Shepard, 1996). Hence it is equally important to addressthis problem at the level of an instrumental variable as it is for a regressorin the OLS case. Another setting where this problem might pop up is theregression discontinuity design if the confounder, xi, is measured at a grouplevel, and not the individual level (see Card and Lee, 2007).

4The IV coe¢ cient estimate, where class size is instrumented with two dummies forthe assignment to regular and regular with aide groups, is almost identical.

10

Page 11: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

There are various solutions to this problem:

1. Parametric correction: Obtain an estimate of � and calculate the stan-dard errors using the correct formula given by (8). The intra-classcorrelations � and �x can typically be estimated easily in statisticalsoftware.5

2. Clustered standard errors: A non-parametric correction for the stan-dard errors is given by the following extension of the robust covariancematrix (Liang and Zeger, 1986):

var(b�) =�X 0X

��1 Xg

Xgb�gXg!�X 0X��1 (9)

b�g = qb"gb"0g =266664

b"21g b"1gb"2g � � � b"1gb"nggb"1gb"2g b"22g ......

. . . b"(ng�1)gb"nggb"1gb"ngg � � � b"(ng�1)gb"ngg b"2ngg

377775 :where Xg is the matrix of regressors for group g, and q is a degrees offreedom adjustment factor like G=(G � 1) similar to the one in HC1for the simple heteroskedasticity robust covariance matrix above. Thiscalculation of the covariance matrix allows for arbitrary correlation ofthe errors within the clusters g, not just the structure in (6). Clusteredstandard errors will be consistent as the number of groups gets large.

3. Aggregation to the group level: Calculate yg �rst and then run aweigthed least squares regression

yg = �+ �xg + "g

with the number of observations in the group as weights (or the inverseof the sampling variance of yg). For the correct choice of weights thisis equivalent to doing OLS on the micro data. The error term atthis aggregated level is "g = vg + �g, and the error component vg istherefore considered in the usual second step standard errors so thatinference can be based directly on the second step covariance matrix.6

5For example, using the loneway command in Stata.6See Wooldridge (2003) and Donald and Lang (2007). While the aggregate regression

is simply the between regression in the context of a random e¤ects model, long known toeconometricians, the �rst discussion of the analogy of the micro and group level regressionand the relationship to inference is probably in Kloek (1981).

11

Page 12: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

If there are other micro level regressors in the model, as in

yig = �g + �xg +Wig� + "ig;

we can do the aggregation by running the regression

yig = �0g +Wig� + "

0ig

which includes a full set of group dummies. The b�0g coe¢ cients onthe group dummies are our group means, purged of the e¤ect of theindividual level variables Wig. Obviously, aggregation does not workwhen xig varies within group. Averaging the xigs to group means isIV, and hence involves changing the estimator.

4. Block bootstrap: Bootstrapping means to draw random samples fromthe empirical distribution of the data. Since the best representationof the empirical distribution of the data is the data itself, this meansin practice for a sample of size N , to draw another sample of size Nwith replacement from the original data set. This can be done manytimes, and an estimate is computed for all the bootstrap samples.The standard error of the estimate is the standard deviation of theestimates across all the bootstrap samples. In block bootstrapping,the bootstrap draws will be whole blocks of data as de�ned by thegroups g. Hence, any correlation across the errors within the blockwill be kept intact with the block bootstrap sampling, and shouldtherefore be re�ected in the standard error estimate. There are manydi¤erent ways to do bootstrap inference, for more on this see Cameron,Gelbach, and Miller (2006).

5. Estimate a random e¤ects GLS or ML model of equation (5). Thisrelies on the on the linearity of the CEF, and we prefer the simpleOLS approximation to the conditional expectation function, so we donot recommend this approach.

Table 8.2.1 returns to the class size example from the STAR experiment,which we have discussed in this section. The table presents six di¤erentestimates of the standard errors: conventional robust standard errors (usingHC1), two versions of the parametrically corrected standard errors using theMoulton formula (8), the �rst using the formula for the intra class correlationgiven by Moulton, and the second using an alternative ANOVA estimatorof the intra class correlation,7 clustered standard errors, block bootstrapped

7This is computed using the loneway command in STATA.

12

Page 13: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

standard errors, and estimates aggregated to the group level. Columsns (1)and (2) present the results on the class size regressor while columns (3) and(4) of show the estimates on an individual level covariate we have includedin the regression: sex. Class rooms are almost balanced by sex, and hencethere is (almost) no intra class correlation in this regressor. As a result,the standard error estimates for this regressor are not a¤ected by any ofthe corrections. As we have seen before, the adjustment to the standarderror on the class size regressor are large but all the di¤erent adjustmentsdeliver standard errors that are also almost identical. There are 318 classrooms in the data set, which is a large number, and all these methods shoulddeliver similar results with a large number of clusters. Hence we tend to useclustered standard errors in practice, because they are conveniently availablein many regression packages, and hence easy to compute.8 The aggregationapproach also has much to commend itself, if only because it often allowsto plot the data easily in the second stage. With a small numbers of groupsthere is a new set of concerns to worry about, and we turn to this in section1.2.3 below.

1.2.2 Serial Correlation in Panels and Di¤erence-in-Di¤erenceModels

Now suppose that there are only two groups, i.e. the regressor of interest isa dummy variable:

yig = �+ �dg + vg + �ig: (10)

The Moulton problem does not arise in this case, because OLS �ts the regres-sion line perfectly through the two points de�ned by the dummy variable.To see this notice that

E (yigjdg) = �+ �dg + E (vgjdg)

so that

b� = E (yigjdg = 1)� E (yigjdg = 0)= � + E (vgjdg = 1)� E (vgjdg = 0) :

Since E (vgjdg) = 0 this means that the estimate of � will be unbiased butit will not be consistent, as pointed out by Donald and Lang (2007). In

8 In fact, the name �clustered standard errors,�which applied researchers have adopted,derives from the name of the Stata option.

13

Page 14: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

every new sample, there will be a new draw of vg. So the regression linewill be somewhat o¤, and the estimate will not exactly equal the population�. However, on average, there will be no bias: sometimes � will be overes-timated, sometimes underestimated. Now suppose we let N !1, while G,the number of groups, remains constant at 2. The bias that exists in anyparticular sample will not go to zero, because vg is just as imporant in thebig sample as in the small sample. Only the sampling variation due to �itwill vanish, not the sampling variation due to vg. In a sense, the Moultonproblem discussed above arises precisely from the fact that the regressionline will not neatly �t through all the points de�ned by vg when there arethree or more groups.

This problem also arises in the standard 2x2 di¤erence-in-di¤erence model.Recall from section ?? that we modeled the outcome as an additive func-tion of state e¤ect, a time e¤ect, and the treatment e¤ect on the interactionE[yijs; t] = s+�t+�dst. Now consider the case where there is a state-timespeci�c component to the error term:

yist = s + �t + �dst + vst + �ist: (11)

Because the model is saturated this is no di¤erent from the model (10) forthe purpose of inference. As before, the error component vst does not vanisheven when N !1, i.e. the group sizes are large. Moreover, there is reallyno way to get consistent standard errors which acknowledge this problembecause �dst and vst are completely collinear. So no separate estimate of� and vst is possible. This means that 2x2 di¤erence-in-di¤erences are notreally very informative if vst shocks are important.

An example of this problem is the basic analysis of the employmente¤ects of the New Jersey minimum wage in the original Card and Krueger(1994) New Jersey-Pennsylvania comparison. Card and Krueger comparedemployment at New Jersey and Pennsylvania fast food restaurants beforeand after New Jersey introduced a state minimum wage. With two statesand two periods this is the standard 2x2 DD design. The solution to thisproblem is to have either multiple time periods on two states, as in the Cardand Krueger (2000) reanalysis of the New Jersey-Pennsylvania experimentwith a longer time series of payroll data, or multiple contrasts for two timeperiods, like in Card (1992) using 51 states. It is straightforward to getcorrect standard errors if vst is iid by using one of the methods discussed inthe previous section.

In many applications of the di¤erence-in-di¤erence model there will beboth multiple treatment groups (s) and multiple time periods (t). Bertrand,

14

Page 15: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

Du�o, and Mullainathan (2004) and Kézdi (2004) point out a further prob-lem in this case. Many economic variables of interest tend to be correlatedover time. This means that vst is most likely serially correlated. Considerthe Card and Krueger example again and imagine using the data from Cardand Krueger (2000) which span the period from October 1991 to September1997. This yiels 72 monthly observations for each state. But these 72observations are not independent: employment variations tend to be highlycorrelated over time. For example, we saw in the DD notes before thatemployment in Pennsylvania was consistently lower than in New Jersey formost of 1994 and 1995. Hence, the solutions which treat vst as iid are notsu¢ cient. Bertrand et al. (2004) investigate a variety of remedies, likeclustering at the state level, block bootstrap methods at the state level, ig-noring the time series information by aggregating the data into two periods,or parametric modeling of the serial correlation.

An interesting and important result is that clustering standard errorsat the state level solves the serial correlation problem. In the previoussection we would have treated the state*month cell as the cluster becausethe variation in the key regressor is at the state*month level. Instead, treatthe entire state as the cluster. This might seem odd at �rst glance, since wehave already controlled for state e¤ects. The state dummy s in (11) alreadyremoves the time mean of vst which is vs. Nevertheless, this method solvesthe serial correlation problem because vst � vs will still be correlated foradjacent periods. Clustering at the state*month level does not address thisbecause residuals across clusters are treated as independent. But clusteringat the state level allows for this since this covariance estimator allows acompletely non-parametric residual correlation within clusters. Clusteredstandard errors serve a very di¤erent role here than in the standard Moultoncase (5) but they work.

The conclusion is that correlated errors are likely to be a problem inmany panel type applications and adjusting the standard errors for thiscorrelation is important. Donald and Lang (2004), Bertrand et al. (2004),and Kézdi (2004) highlight the issue that we may want to treat standarderrors as clustered at a high level of aggregation. As a result we may endup with relatively few clusters.

1.2.3 Few clusters

The problem of few clusters is the analogue to the small sample problem ofrobust standard errors discussed in section 1.1. Here as there, small sampledistributions for the di¤erent estimators are not available but we know (from

15

Page 16: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

Monte Carlo evidence) that all the adjustments can be biased substantiallywhen there are only few clusters. Donald and Lang (2007) and Cameron,Gelbach, and Miller (2006) discuss inference when the resulting number ofgroups is small, see also Hansen (2007b). This area is very much researchin progress, and �rm recommendations are therefore di¢ cult.

The main approaches are

1. Bias corrections of clustered standard errors. Clustered standard er-rors are biased in small samples because E

�b"gb"0g� 6= E�"g"

0g

�= �g

just as in Section 1.1. One solution to the bias problem is to use anadjustment just as in Section 1.1 to correct for the small sample bias.As before, the bias depends on the form of �g. Bell and McCa¤rey(2002) suggest to adjust the residuals by

b�g = qe"ge"0ge"g = Ab"gwhere A solves

A0gAg = (I �Hg)�1

andHg = Xg(X

0X)�1X 0g

is the hat-matrix at the group level. This is the analog to HC2 forthe clustered case. However, the matrix Ag is not unique; there aremany such decompositions. Bell and McCa¤rey (2002) suggest to usethe symemtric square root of (I �Hg)�1 or

Ag = P�1=2

where P is the matrix of eigenvectors of (I �Hg)�1, � is the diagonalmatrix of the correponding eigenvalues, and �1=2 is the diagonal matrixof the square roots of the eigenvalues. One problem with the Bell andMcCa¤rey adjustment is that (I �Hg) may not be of full rank, andhence the inverse may not exist for all designs. This happens, forexample, when one of the regressors is a dummy variable and thisdummy variable only takes on values of zero or one within the group.In addition, the dimenstion of Hg is the number of observations pergroup. Since this matrix needs to be inverted this only tends to workif the group sizes are reasonably small.

16

Page 17: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

2. Various authors, including Bell and McCa¤rey (2002) and Donald andLang (2007) suggest to base inference on a t-distribution with G � kdegrees of freedom, where k is the number of regressors, rather thanon the standard normal distribution. We have seen in Section 1.1 thatthis is not generally the correct small sample distribution even if theerrors vg are normally distributed. Neverthless, for small G this makesa substantial di¤erence. Cameron, Gelbach, and Miller (2006) �ndthat this works well in conjunction with the Bell and McCa¤rey (2002)bias correction as described in 1 for the Moulton problem.

3. Donald and Lang (2007) suggest that aggregating to the group levelworks well even with a small number of groups in conjunction withusing a t-distribution with G � k degrees of freedom. Straight aggre-gation does not work to solve the serial correlation problem in panelsdiscussed in Bertrand et al. (2004) and Kezdi (2004).

4. Cameron, Gelbach, and Miller (2006) report that various forms of thebootstrap work well with small numbers of groups, and typically out-perform clustered standard errors without the bias correction. Theypoint out, however, that the bootstrap does not always lead to im-proved small sample statistics. In order to get such an improvementthey suggest to bootstrap Wald statistics directly, rather obtain thetest statistic based on bootstrapped standard errors. They also rec-ommend a method called the wild bootstrap. Rather than resamplingentire groups (yg; Xg) of data, this involves computing a new y�g based

on the residual b"g = yg � Xgb�, where y�g = Xgb� + b"�g. This impliesthat the Xgs are being kept �xed and only a new residual b"�g is cho-sen in each bootstrap replication. In the wild bootstrap b"�g = b"g withprobability 0.5 and b"�g = �b"g with probability 0.5.

5. Hansen (2007a) proposes parametric methods to solve the serial cor-relation problem discussed above. I.e. model the error process asan AR process, estimate the AR parameters and �x the covariancematrix. Hansen points out that the AR parameters are biased inpanels of short duration, and demonstrates the importance to use abias adjusted estimator for these coe¢ cients. His methods seem to beyield much improved inference compared to Bertrand et al.�s (2004) in-vestigation of parametric models without bias adjustment. AlthoughHansen (2007a) does not explicitly demonstrate the performance of hisestimator with a small number of groups one would generally expect

17

Page 18: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

more parametric methods to be less sensitive to sample size as the non-parametric ones, like clustering, as long as the parametric assumptionsare roughly right.

Various authors have demonstrated that ignoring the problem of a smallnumber of clusters can lead to very misleading inference. Nevertheless, thereseems to be no single �x which solves this problem satisfactorily. We haveseen above, in the case of the simple robust covariance matrix, section 1.1,that �xing the bias problem tends to introduce variance into the covarianceestimator. Hence, trying to �x the bias may sometimes lead to smaller stan-dard errors than ignoring the problem. Whether the few clusters problemleads to a lot of bias also seems to depend on the situation. For example,Monte Carlo results in Hansen (2007b) suggest that the bias is a lot worsefor the standard Moulton problem than for the serial correlation problem.This suggests that it may be feasible to stick with regular clustered standarderrors to solve serial correlation in panels, even when the number of clus-ters is as small as 10. For solving the Moulton problem in section 1.2.1, itseems more imporant to worry about clustered standard errors with a smallnumber of clusters. However, Donald and Lang (2007) aggregation seemsto work well in this case as long as the regressor of interest is �xed withingroups. This would also be our preferred strategy when both problemsoccur in combination, i.e. when the estimation is based on micro data, buttreatment only varies at the state (or some other aggregate) level over time.In this case, aggregate the observations �rst to the state-year level, and thencluster standard errors in the aggregate panel at the state level.

The uphsot from all this is that it may be important to pay attention tosmall sample issues in applied microeconometric work. Working with largemicro data sets, we used to sneer at the macro economists with their smalltime series samples. But he who laughs last laughs best: it turns out that itis the macro economists who had it right all along, and we micro economistsare now often con�ned to the same small sample sizes as they are. The keyis to think about �where your variation�lives. Unfortunately, all too oftenit lives at a fairly aggregate level. Which methods work best in particularapplications when the original data result from large micro data samples isstill an open question, and this remains an area of active research.

2 References

Bell and Maca¤rey (2002) Bias reduction in standard errors for linear re-gression with multi-stage samples, Survey Methodology, 2002

18

Page 19: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

Card, David, and David Lee (2007) Regression Discontinuity Inferencewith Speci�cation Error, Journal of Econometrics, 2007.

Chesher, Andrew and Ian Jewitt (1987) The bias of the heteroskedastic-ity consistent covariance estimator, Econometrica 55, 1217-1222.

Chesher, Andrew and Gerald Austin (1991) The �nite-sample distribu-tions of heteroskedasticity robust Wald statistics, Journal of Econometrics47, 153-173.

DeGroot, Morris (1986) Probability and Statistics, 2nd edition, Reading:Addison Wesley.

Hansen Christian (2007a) Generalized Least Squares Inference in Mul-tilevel Models with Serial Correlation and Fixed E¤ects, Journal of Econo-metrics.

Hansen, Christian (2007b) Asymptotic Properties of a Robust VarianceMatrix Estimator for Panel Data when T is Large, Journal of Econometrics.

Kauermann, Göran and Raymond J. Carroll (2001) A note on the E¢ -ciency of Sandwich Covariance Estimation, JASA, 96, 1387-1396.

Brent Moulton, �Random Group E¤ects and the Precision of RegressionEstimates�, Journal of Econometrics, 32, pp. 385-397.

L. Shore-Sheppard, �The Precision of Instrumental Variables Estimateswith Grouped Data,� Industrial Relations Section Working Paper #374,Princeton University, 1996

Marianne Bertrand, Esther Du�o, and Sendhil Mullainathan, �HowMuchShould We Trust Di¤erences-in-Di¤erences Estimates?�Quarterly Journalof Economics, vol. 119, February 2004, pp. 249-275.

K. Liang and Scott L. Zeger, �Longitudinal Data Analysis Using Gener-alized Linear Models,�Biometrika 73 (1986), 13-22.

Colin Cameron, Jonah Gelbach, and Douglas L. Miller �Bootstrap-BasedImprovements for Inference with Clustered Errors,�mimeographed, 2006

T. Kloek (1981) OLS Estimation in a Model Where a Microvariable isExplained by Aggregates and Contemporaneous Disturbances are Equicor-related. Econometrica, Vol. 49, No. 1. (Jan., 1981), pp. 205-207.

Davidson and MacKinnon (1993) Estimation and Inference in Econo-metrics, New York and Oxford: Oxford University Press.

Stephen G. Donald, Kevin Lang (2007) Inference with Di¤erence-in-Di¤erences and Other Panel Data. Review of Economics and Statistics May2007, Vol. 89, No. 2: 221-233.

Joel L. Horowitz (1997) Bootstrap Methods in Econometrics: Theory andNumerical Performance, in: Kreps and Wallis (eds.) Advances in Economicsand Econometrics: Theory and Applications, Seventh World Congress, volIII, Cambridge: Cambridge University Press, 188-222.

19

Page 20: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

MacKinnon and White (1985) Some heteroskedasticity consistent covari-ance matrix estimators with improved �nite sample properties. Journal ofEconometrics 29, 305-325.

Messer and White (1984) A note on computing the heteroskedasticityconsistent covariance matrix using instrumental variables techniques. Ox-ford Bulletin of Economics and Statistics, 46, 181-184.

Kezdi (2004) �Robust Standard Error Estimation in Fixed-E¤ects PanelModels,�Hungarian Statistical Review, Special English Volume #9, 2004.pp. 95-116.

Wooldridge (2003) Cluster-sample methods in applied econometrics, Amer-ican Economic Review. May 2003. Vol. 93, Iss. 2; p. 133

3 Appendix

In order to derive (7), write

yg =

26664y1gy2g...

yngg

37775 "g =

26664"1g"2g..."ngg

37775and

y =

26664y1y2...yG

37775 x =

26664�1x1�2x2...

�GxG

37775 " =

26664"1"2..."G

37775

20

Page 21: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

where �g is a column vector of ng ones and G is the number of groups. Noticethat

E(""0) = � =

266664�1 0 � � � 0

0 �2...

.... . . 0

0 � � � 0 �G

377775

�g = �2"

2666641 � � � � �

� 1...

.... . . �

� � � � � 1

377775 = �2" �(1� �)I + ��g�0g�

� =�2v

�2v + �2�

:

Now

X 0X =Xg

ngxgx0g

X 0�X =Xg

xg�0g�g�gx

0g:

But

xg�0g�g�gx

0g = �2"xg�

0g

26641 + (ng � 1)�1 + (ng � 1)�

� � �1 + (ng � 1)�

3775x0g= �2"ng [1 + (ng � 1)�]xgx0g:

Denote � g = 1 + (ng � 1)�, so we get

xg�0g�g�gx

0g = �2"ng� gxgx

0g

X 0�X = �2"Xg

ng� gxgx0g:

With this at hand, we can compute the covariance matrix of the OLSestimator, which is

var(b�OLS) =�X 0X

��1X 0�X

�X 0X

��1= �2"

Xg

ngxgx0g

!�1Xg

ng� gxgx0g

Xg

ngxgx0g

!�1:

21

Page 22: Empirical Methods in Applied Microeconomicsecon.lse.ac.uk/staff/spischke/ec524/evaluation6_07.pdf · De–ne S2 j = P d i=j y i y j 2 for j = 0;1. The diagonal elements of the hat-matrix

We want to compare this with the standard OLS covariance estimator

var�(b�OLS) = �2" X

g

ngxgx0g

!�1:

If the group sizes are equal, ng = n and � g = � = 1 + (n� 1)� so that

var(b�OLS) = �2"�

Xg

nxgx0g

!�1Xg

nxgx0g

Xg

nxgx0g

!�1

= �2"�

Xg

nxgx0g

!�1= �var�(b�OLS);

which implies (7).

22