Problems with and Solutions for Two-Dimensional Models of ...lists.fas.harvard.edu/pipermail/gov1000-list/attachments/20041109/... · Models of Continuous Dependent Variables

Problems with and Solutions for Two-DimensionalModels of Continuous Dependent Variables

Ben Goodrich1

November 9, 2004

1Harvard University, Department of Government, Littauer Center (North Yard), 1875 Cam-bridge St., Cambridge, MA, 02163; email: [email protected]

[email protected]

Abstract

This paper addresses hierarchical models with continuous dependent variables, such as time-

series cross-section models. Building on the argument in Zorn (2001), the main point of this

paper is that the pooled OLS estimator is deeply flawed – especially for time-series cross-

section data – but for reasons that have not explicitly been raised in previous papers. The

pooled OLS estimator, the within estimator, the between estimator, and the random effects

estimator can be seen as special cases of the fractionally pooled estimator presented in Bartels

(1996), which allows all of these estimators to be evaluated in a common framework. On

both bias and efficiency grounds, using both the within estimator and the between estimator

is probably the best estimation strategy for almost all applications in political science.

1 Introduction

How should we specify linear regression models when the data vary across two dimensions,

such as space and time? It is impossible to say what should be done in all situations, but it

is easier to determine what should not be done. This paper identifies current practices that

should be avoided, and offers some alternatives that can be used.

The paper is purely methodological but is directed toward substantive researchers, refer-

ees, and editors. Although there is a lot of algebra and subtle notation, I tend to cite rather

than reproduce proofs and tend to use the “intuitive” technique of showing that familiar esti-

mators are in fact special cases of estimators that seem unreasonable. In essence, this paper

is just a synthesis of the ideas in Zorn (2001) and Bartels (1996) that formalizes some of the

methodological reservations many researchers “feel” when analyzing two-dimensional data.

I replicate some results from Green, Kim and Yoon (2001) in order to bolster its conclusion

given the criticisms from the other papers in a recent International Organization symposium

(see Oneal and Russett, 2001; Beck and Katz, 2001; King, 2001), but the implications reach

far beyond this particular example and this subfield of political science.

The stakes in this debate are great. As data have become easier to collect, models for

two-dimensional data have become more prevalent. The paradigmatic examples come from

comparative and international political economy, where each country (or country-pair) is ob-

served over time (usually years). But scholars of American politics utilize two-dimensional

models when counties are nested within states, justices within circuit courts, etc. Two-

dimensional models are also used in other disciplines – economic models where firms are

nested within industries, policy models where students are nested within schools, and soci-

ology models where individuals are nested within families to name a few.

I make a modification to the estimator described in Zorn (2001) that improves the stan-

dard errors without affecting the coefficient estimates. From there, placing weights on the

1

data and restrictions on the coefficients yields the fractionally pooled estimator developed in

Bartels (1996). Depending on how the data are weighted, the between estimator, the within

estimator, the random effects estimator, and the pooled OLS estimator can all be seen as

special cases of the fractionally pooled estimator. Thus, all of these estimators, plus a few

more, can be analyzed in a common framework.

The fractionally pooled estimator always requires a correction the standard errors that

computers produce. Since all of the above estimators are special cases, they also require

corrections to the standard errors. These corrections are already familiar for the within

estimator and the between estimator, but the corrections to the standard errors of the random

effects estimator and pooled OLS estimator are novel, non-trivial, and strictly increasing.

The conclusion of this paper is that researchers should use within estimators and between

estimators when the data are two-dimensional. The restrictions that other estimators place

on the coefficients can cause bias, as demonstrated in Zorn (2001) and Green, Kim and Yoon

(2001). Given that the efficiency advantages of the random effects estimator and the pooled

OLS estimator are overstated because the uncorrected standard errors are understated, it is

unlikely that the true efficiency gains are worth the cost of bias. This finding is particularly

true when the data vary over time: Papers following the recommendations in Beck and Katz

(1996) are critically flawed unless a fixed effects specification is used.

2 Problems in the General Case

This section discusses problems with two-dimensional models of continuous dependent vari-

ables. The same problems, but not necessarily the same solutions, apply when the dependent

variable is discrete. Since the dependent variable is continuous, I focus on least squares es-

timators, but the same points apply if maximum likelihood or Markov Chain Monte Carlo

is used to estimate the model. After establishing some notation, this section demonstrates

2

how the common pooled OLS estimator can be constructed from component models. Seen

in this light, it is clear that alternative estimators are superior to pooled OLS.

Let i = 1, 2, . . . N index one dimension of the data and let j = 1, 2, . . . J index the other

dimension. For example, i could index states while j indexes counties or i could index

countries while j indexes time, but i and j can be anything that uniquely identifies a data

point. I will refer to the N “units of observation” in which the J “observations” are nested.

For convenience, I assume that the data are balanced, which implies that each of the N

units of observation has a complete set of J observations on all the independent variables

and the dependent variable. In most cases, the math could be modified slightly to account

for the problem of missing data, which is the norm in social science. Any unbalanced dataset

can be balanced using multiple imputation (see King et al., 2001; Little and Rubin, 2002).

A sample mean (x) and a unit mean (xi)

x =1

NJ

N∑i=1

J∑j=1

xij, (1)

xi =1

J

J∑j=1

xij, (2)

are different quantities. A unit mean can be calculated for each of the N units in the dataset

for any or all of the K covariates. In this paper, a column vector is indicated by lowercase

boldface lettering without subscripts. Thus, when we collect all the unit means on a covariate

into a vector,i

x, the length of this vector is often NJ rather than N because each unit mean

is copied J times. The context should indicate whether the length ofi

x is NJ or N .

For example, I define a “demeaned” variable to be the deviations from the unit means.

Thus, x −i

x is a demeaned variable but in order to make the “meaned” variable,i

x, con-

formable for subtraction with the two-dimensional variable, x, each unit mean must be copied

J times so that the length ofi

x is NJ . I often use the tilde notation to indicate demeaned

3

vectors such thati

y = y−i

y and similarly for matrices, as ini

X = X−i

X.

The least squares dummy variable estimator (LSDVE),

yij = αi + xijβw + εij, (3)

simply includes a separate intercept (αi) for each of the N units of observation. The row

vector of observations on the K covariates is denoted by xij (lowercase boldface lettering with

subscripts). One important characteristic of a LSDVE is that the error term does not have

a unit-specific component, because the omitted variables that would otherwise constitute a

unit-specific error component are perfectly captured by the unit-specific intercept. There

are NJ data points in the LSDVE since there are J observations on each of N units. For

the rest of this paper, I assume the objective is to estimate causal effects. However, if the

goal is merely to predict the dependent variable, the LSDVE is optimal for this purpose.

Aside from a degrees of freedom correction discussed in section 3, the LSDVE is equivalent

to the other type of “fixed effects estimator”, the within estimator (W-E),

(yij − yi) = 0 + (xij − xi) βw + εij, (4)

which eliminates the unit-specific error component by demeaning all the variables. A variety

of assumptions are possible regarding the distribution of the error term in fixed effects models.

However, nothing important in this paper turns on what those assumptions are. βw from

the LSDVE is identical to βw from the W-E, and the w subscript denotes a “within” effect.1

The counterpart to the W-E is the between estimator (B-E),

yi = β[0] + xiβb + εi, (5)

1For simplicity, I assume that all the estimated parameters are single points. Nothing in this paper(except space) precludes specifying a more complicated model with variable coefficients.

4

which posits that a meaned dependent variable is a linear function of an intercept(β[0]

)and

the meaned independent variables. The B-E has a total of N observations since there are

only N unit means. The b subscript on βb denotes a “between” effect.

It is more convenient to express the W-E and B-E in matrix form:

i

yNJ×1

= 0NJ×1

+

[i

XNJ×K

]βwK×1

+ εNJ×1

, (6)

i

yN×1

= β[0]

N×1

+

[i

XN×K

]βb

K×1

+i

εN×1

. (7)

Since number of rows for the W-E in equation 6 is NJ , and the number of rows for the B-E

in equation 7 is N , the two equations are not naturally conformable for addition. Most of

the criticisms in this paper can be traced back to this step where the meaned data are copied

J times in order to add equation 6 to equation 7:

i

yNJ×1

+i

yNJ×1

= β[0]

NJ×1

+ 0NJ×1

+

[i

XNJ×K

]βb

K×1

+

[i

XNJ×K

]βwK×1

+i

εNJ×1

+ εNJ×1

, (8)

which results in an equation that can be expressed at the observation level as:

yi + (yij − yi) = β[0] + 0 + xiβb + (xij − xi) βw + (εi + εij) , (9a)

yij = β[0] + xiβb + (xij − xi) βw + (εi + εij) . (9b)

Equation 9b is a special case of a general model that was introduced to political science by

Zorn (2001), although it has been derived in other disciplines (see Neuhaus and Kalbfleisch,

1998; Gould, 2001). I call equation 9b a simultaneous parsed model (SPM) – parsed be-

cause the effects of the covariates are split into their between and within components and

simultaneous because the between and within estimates are obtained at the same time.

5

As Zorn (2001) notes, xi is uncorrelated with (xij − xi) in the SPM.2 Intuitively, meaned

variables only have between variance and demeaned variables only have within variance, so

the two vectors cannot covary. More formally, Rice (1995, p.180) proves that meaned and

demeaned variables are independent and thus orthogonal. Although meaned (demeaned)

variables covary with other meaned (demeaned) variables, the “cross-dimensional” cells of

the SPM’s variance-covariance matrix are zero, as can be seen from section 5’s example.

If two covariates are orthogonal, excluding one from the model does not affect the point

estimate of the other. Both the W-E and the B-E exclude variables from the SPM, but

the variables that the W-E and the B-E exclude are orthogonal to the variables that each

includes. Thus, β[0], βb, and βw from the SPM are identical to the point estimates that would

be obtained from applying the B-E and W-E separately, but the standard errors differ.3

Least squares is reasonable for the W-E and the B-E. Although the error term in the

W-E and/or the B-E may not be spherical, there are many well-known post hoc corrections

to the standard errors that are compatible with least squares point estimation. In particular,

panel-corrected standard errors (PCSEs, see Beck and Katz, 1995) and clustered standard

errors (see Arellano, 1987; Kristensen and Wawro, 2003) are popular fixes to the standard

errors of the W-E while White standard errors are popular in cross-sectional models, such

as the B-E, when the form of the heteroskedasticity is unknown.

Unfortunately, OLS standard errors are unreasonable for the SPM, and there is no at-

tractive fix. All J errors for each of the N units of observation have a common component,

εi, violating the independence assumption. Thus, there is “within-unit” correlation in the

errors, although this correlation is conceptually distinct from “autocorrelation”, since auto-

correlation implies that E [εij × εij′ ] 6= 0 when j 6= j′ where j indexes time. But one result

from the time-series literature is applicable here: The damage autocorrelation does to the

2It is also true that εi is orthogonal to (εij − εi). Although the error term is unobserved, both εi and(εij − εi) are linear functions of variables that have no cross-dimensional covariance.

3The exactness of this claim depends on the data being balanced.

6

estimated standard errors depends on the degree of persistence in the independent variables.

Recall that in order to construct the SPM, the meaned observations needed to be copied J

times each. Since the meaned variables in a SPM do not vary within units, the “persistence”

of meaned variables is perfect, and the common component to the error term has a pernicious

impact on the estimated standard errors of between estimates. Although the point estimates

are the same, the standard errors from the B-E exceed the standard errors of the between

estimates in the SPM by a factor that is no less than√

J . Thus, if J = 36, a t statistic of

about 12 would be necessary to make a between estimate in a SPM statistically significant.

PCSEs do not fix this problem with the SPM. PCSEs do fix unit heteroskedasticity and

correlation across units of observation, but only if there is no within-unit correlation in the

errors – a condition that does not hold if εi exists (see Kristensen and Wawro, 2003). What

Stata calls “robust clustered Huber/White/sandwich” standard errors “fix” this problem in

a very conservative way by creating one “super error” for each of the N units of observation.4

Thus, this correction sacrifices all but N degrees of freedom and, unlike PCSEs, does not

address the problem of correlation in the error term across units of observation.

Regardless, if one places the restrictions that βb = βw on the SPM, the result,

yij = β[0] + xiβb + (xij − xi) βw + (εi + εij) = β[0] + xijβPOLSE + (εi + εij) , (10)

is the familiar pooled OLS estimator (POLSE), which is the most common estimator for

two-dimensional data in political science. However, the POLSE is nested within the SPM,

and the primary point in Zorn (2001) is that anyone who intends to use a pooled model

should first verify whether the restrictions that βb = βw are valid using a SPM. What is not

mentioned in Zorn (2001) is that the OLS estimate for σ2 is biased in the SPM due to the

4Robust clustered Huber/White/sandwich standard errors are distinct from what Kristensen and Wawro(2003) calls clustered standard errors following Arellano (1987). The cluster() option following the regcommand in Stata produces the former while the cluster() option following areg produces the latter.

7

presence of εi. Thus, the F test and other ways to check whether βb = βw in the SPM are

biased against the restrictions. This problem with the SPM is overcome in section 3.

The existence of εi in a POLSE is what Green, Kim and Yoon (2001) calls “unit hetero-

geneity”, which is also discussed in Wilson and Butler (2003). One way of addressing the

problem of unit heterogeneity that is discussed briefly in those two papers is the random

effects estimator (REE). The REE,

y∗ij = β[0]∗ + x∗ijβREE + (εi + εij)∗ , (11)

is just a POLSE with variables that have undergone a GLS transformation. In equation 11,

y∗ij = yij − θyi, x∗ij = xij − θxi, β[0]∗ =(1− θ

)β[0], and (εi + εij)

∗ = (εi + εij)− θεi, where

θ = 1−

√σ2

W -E

Jσ2B-E

. (12)

The quasi-differencing parameter θ is a function of the estimated error variance in the

W-E and B-E. Clearly, if θ = 1, the REE reduces to a W-E, and if θ = 0, the REE reduces

to a POLSE. Greene (2000, p.569) notes that “[t]o the extent that [θ > 0], we see that

the inefficiency of least squares will follow from an inefficient weighting of the [between and

within] estimators. Compared with generalized least squares, ordinary least squares places

too much weight on the between-units variation. It includes it all in the variation in X,

rather than apportioning some of it to random variation across groups.”

Greene’s claim will be given more force in section 3, but it should be kept in mind that the

REE, like the POLSE, imposes the restrictions that βb = βw. If these restrictions are invalid,

the REE and the POLSE yield biased estimates because the parameter constancy assumption

is violated. The obvious question is, “Why would the between effect of explanatory variable

k not be equal to the within effect of k?”. However, a better question is “Why not estimate

8

a more general model and check?” Zorn (2001) provides several examples where one might

expect between and within effects to differ, either in sign or magnitude. To me, the most

intuitive example comes from Gould (2001): Suppose the dependent variable is a sample of

Americans’ wages over time, and the independent variables are regional dummies with the

northeast excluded. Wages in southern states are lower than in the northeast, on average,

and the between-unit effect of the SOUTH dummy variable is expected to be negative.

However, if people move to the south from the northeast, they are likely taking better-

paying jobs. Thus, the within-unit effect of the SOUTH dummy variable is expected to be

positive. In other cases, the expected sign of the between and within effects is the same but

the magnitudes may be significantly different.

In many cases, theory will suggest that βb should equal βw, but theory is not an excuse

for failing to verify their equality. Although it is neither possible nor necessary to gather

data on all relevant variables, we should always admit the possibility that a specification

error could drive a wedge between βb and βw for at least one of the explanatory variables,

which may be the key causal variable(s) or a control variable that is correlated with the key

causal variable(s). Although the SPM does not permit a fair test of whether βb = βw, I

modify the SPM in section 3 to avoid the problem of duplicative error components and to

facilitate the evaluation of these restrictions.

3 The Fractionally Pooled Estimator

To reemphasize, although the standard errors from the SPM are wrong unless εi = 0 ∀i,

the POLSE is just a SPM with the restrictions that βb = βw. Hence, the POLSE generally

has the wrong standard errors and adds the additional risk that βb may not equal βw.

The root of the problem with the standard errors is that the meaned data – for both the

dependent variable and the independent variables – are copied in the SPM, which creates

9

the common components in the error term. Thus, instead of adding the demeaned data

to the meaned data as in equation 8, we could stack the NJ demeaned observations and

the N meaned observations using properties of partitioned matrices. This move is a two-

dimensional extension of Bartels (1996), which addresses the “pooling” of data generally. In

Bartels’ language, there are two “regimes” or data-generating processes, one “within-unit

regime” (with NJ data points) and one “between-unit regime” (with N data points).

The main point in Bartels (1996) is that the regimes can be preferentially weighted before

imposing pooling restrictions on the coefficients. Let w and b be known scalars between zero

and one inclusive that serve as weights for the within and between regimes respectively. Let

Y =

wi

yNJ×1

bi

yN×1

, A =

0NJ×1

b 1N×1

, B =

0NJ×K

bi

XN×K

, W =

wi

XNJ×K

0N×K

, E =

wεNJ×1

bi

εN×1

, and we can

consider the properties of the following two regression models:

Y=β[0]A + Bβb + Wβw + E, (13)

Y=β[0]A + (B + W) βFPE+E. (14)

Equation 13 is also a simultaneous parsed model, which I call the SPM2 to distinguish it

from the SPM given in equation 9b. The SPM2 produces point estimates that are identical

to those of the SPM but standard errors that are more appropriate. In the SPM2, each

of the meaned data points appears only once, so there are no common components in the

error term and no duplicative data. Thus, the SPM as formulated in Zorn (2001) should not

be used when the dependent variable is continuous, but all the conceptual points in Zorn

(2001) still apply to the SPM2. Equation 14 imposes the pooling restrictions that βb = βw

on equation 13 and is an example of the “fractionally pooled estimator” (FPE) developed in

Bartels (1996) since w and b are allowed to take any values between zero and one inclusive.

In this formulation, the number of demeaned variables (K) equals the number of meaned

10

variables, but this need not be the case. Often, there will be some variables that vary across

units but not within units. Such variables can be included in the “between” matrix (B) but

not in the “within” matrix (W). Less often (and only when the data are balanced), there

may be some covariates that vary within units but do not vary across units, which can be

included in W but not in B. Any problems with matrix conformability in the FPE can be

avoided by adding columns of zeroes to W or B as appropriate.

There are two important points regarding the standard errors of the SPM and the FPE.

First, as Bartels (1996, note 4) and others have pointed out, even if the point estimates

are allowed to differ across regimes, the SPM2 still makes the assumption that the error

variance is the same in the within regime as in the between regime, which is dubious. This

homoskedasticity assumption can be relaxed in the SPM2 by weighting the regimes appro-

priately or by using GLS. It is not immediately obvious how something like PCSEs would

work for a SPM2 because different fixes to the error term are needed for the two regimes.

Second, Bartels (1996) warns that if the FPE is used, the standard errors the computer

reports need to be scaled by a correction factor. Intuitively, if an observation is down-

weighted, it cannot count as a whole degree of freedom. For all least squares estimators, we

make degrees of freedom corrections to the standard errors by multiplying the standard er-

rors by√

Wrong # DFRight # DF

. The correction factor in the two-dimensional case is√

NJ+N−KwNJ+bN−wN−K

,

which differs slightly from the correction factor in Bartels (1996).5 Depending on the values

of w and b, the magnitude of this degrees of freedom correction will change but will always

increase the standard errors. One main result of this paper is that, depending on the values

of w and b, all the estimators discussed in this paper relate to the FPE, as shown in table 1.

5The correction factor given in Bartels (1996) is essentially√

NJ+N−KλNJ+(1)N−K , where λ corresponds to

the weight between zero and one inclusive. I generalize this correction factor to allow either the within orbetween regime to be downweighted, which would merely restates Bartels’ correction factor as

√NJ+N−K

wNJ+bN−K .However, in the econometrics literature, N degrees of freedom are subtracted when using a W-E, but if thewithin regime is downweighted this correction should be mitigated. Thus, the correction factor used in thispaper is

√NJ+N−K

wNJ+bN−wN−K , which behaves properly in the extreme cases that w = 0 or b = 0.

11

Table 1: Special cases and generalizations of the fractionally pooled estimator (FPE)Weights Special Cases Correction Unrestricted Analoguew b of the FPE Factor for σFPE to the FPE

1 0 Within Estimator√

NJ+N−KNJ−N−K

Unrestricted because b = 0

0 1 Between Estimator√

NJ+N−KN−K

Unrestricted because w = 0

? ? Random Effects Estimator√


Consecutive Parsed Estimator

1√J

1 Pooled OLS Estimator√

NJ+N−K

NJ12 +N

�1−J−

12

�−K

Simultaneous Parsed Model

1 1 Pooled OLS Estimator 2√

NJ+N−KNJ−K

Simultaneous Parsed Model 2

Notes: The general form of the correction factor for σFPE (and thus for the standard errors) is√NJ+N−K

wNJ+bN−wN−K. The weights for the pooled OLS estimator assume the data are balanced.

The consecutive parsed estimator (CPE) and the pooled OLS estimator 2 (POLSE2) arediscussed below.

If all the weight is placed on the within regime by specifying that w = 1 and b = 0, the

FPE reduces to a W-E, and√


reduces to√

NJ+N−KNJ−N−K

, where the denominator

reflects the textbook degrees of freedom correction for a W-E (see, for example Greene, 2000,

p.562). If w = 0 and b = 1, all the weight is placed on the between regime, and the FPE

reduces to a B-E while√


reduces to√

NJ+N−KN−K

, which again produces the

correct number of degrees of freedom for a B-E because there are only N unit means.

The FPE also encompasses the REE but apparently not in a closed form fashion. Recall

that in the REE, θ = 1−√ bσ2

W -E

Jbσ2B-E

, which allows the REE to compromise between a W-E and

a POLSE. In the extreme case that θ = 1, the REE reduces to a W-E, so the corresponding

weights for the FPE are w = 1 and b = 0. In the other extreme case that θ = 0, the REE

reduces to a POLSE, which corresponds to a FPE with weights of w = J−12 and b = 1. For

intermediate values of θ, there should be unique values of w and b that produce the same

point estimates as a REE, but to find them, a computer would need to search numerically

over the intervals w ∈[

1√J, 1

]and b ∈ [0, 1] to minimize

∑Kk=0 abs

(β

[k]REE − β

[k]FPE | w, b

).

Perhaps the main significance of this finding is that, while the GLS standard errors

12

produced by the REE are only valid asymptotically, the FPE emulation would lend itself to

a finite sample correction if the critical values of w and b can be found. This finding has

intuition behind it: The REE is a compromise between the W-E and the POLSE; the W-E

has fewer degrees of freedom than a POLSE; so how is the usual practice of using NJ −K

degrees of freedom for both a REE and a POLSE justified if θ > 0?

Moreover, the POLSE is a compromise between the W-E and the B-E; both the W-E

and the B-E have fewer than NJ −K degrees of freedom; so why should the POLSE have

NJ − K degrees of freedom? The FPE formalizes this intuition because it reduces to a

POLSE if w = J−12 and b = 1. The exactness of this conclusion depends on the data being

balanced, but when the data are unbalanced, it is possible to approximate a POLSE by

minimizing∑K

k=0 abs(β

[k]POLSE − β

[k]FPE | w, b

). The correction factor for the standard errors

of the FPE is√

NJ+N−K

NJ12 +N

�1−J−

12

�−K

. Hence, the correct standard errors in this FPE can be

much larger than the standard errors that the computer produces for the POLSE. We should

be suspicious of any published result that utilizes a POLSE if the t statistic fails to exceed

conventional levels when divided by√

NJ−K

NJ12 +N

�1−J−

12

�−K

(assuming balanced data).

Also, the FPE exposes the fact the POLSE implicitly weights the data in a non-substantive

fashion, which could have been anticipated given the derivation in equation 10. The POLSE

is a SPM with the restrictions that βb = βw, and the SPM copies the meaned data J times.

Thus, the POLSE weights the between variance relative to the within variance using a 1 : J−1

scheme. In the equivalent FPE, we use w = J−12 rather than w = J−1 because the w term

is squared when least squares is utilized (see Bartels, 1996, equation 20), which implies that

the weighting of the between variance relative to the within variance in this FPE ultimately

follows a 1 : J−1 scheme as in the POLSE. Thus, we should also be suspicious that published

results from a POLSE are driven by the arbitrary implicit weights. If βb 6= βw, then the

implicit weights have a major effect on βPOLSE, which is a weighted average of βb and βw.

The SPM is the unrestricted version of the POLSE, but these estimators are strictly

13

dominated by the SPM2 and the POLSE2 respectively. The POLSE2 is a special case of

the FPE where w = b = 1, which is what the POLSE masquerades as. The correction

factor for the POLSE2 reduces to√

NJ+N−KNJ−K

, whose denominator reflects the usual degrees

of freedom for a POLSE. Hence, the usual degrees of freedom for a POLSE assume non-

preferential weighting even though the POLSE implicitly downweights the within variance.

The best way to evaluate the restrictions that βb = βw is to find the smaller of the two

Bayesian Information Criteria (BIC) between the SPM2 and the POLSE2.6 There are many

different formulations of the BIC. In Raftery (1995, equation 26),

BIC′= (NJ + N) ln

(1−R2

)+ (K − 1) ln (NJ + N) , (15)

and R2 is the proportion of explained variance. This BIC can be used to approximate the

odds in favor of the SPM2 over the POLSE2 using the formula exp

(BIC

′POLSE2−BIC

′SPM2

2

).

Another important conclusion is that using the BIC to choose between a POLSE and a

LSDVE is flawed. First, the POLSE has arbitrary implicit weights that cannot be defended

on substantive grounds. Second, equation 15 assumes the errors are normally distributed,

which does not hold if there is unit-heterogeneity in the POLSE. Third, Bartels (1996, note 9)

claims that the R2 of the FPE is too large unless the preferential weighting is accounted for,

which implies that the BIC for the POLSE is biased. The POLSE2 does not preferentially

weight the data and can properly be compared with the SPM2 using the BIC.

The only argument for the POLSE over the LSDVE is that the POLSE exploits all the

variation in the data, while the LSDVE really only uses the within variation. This point can

be nullified by separately utilizing a W-E and a B-E, which I collectively call the consecutive

parsed estimator (CPE). Thus, the CPE makes use of all the within variance in the data

6Bartels (1996) has valid criticisms of testing restrictions, which can be partially mitigated by using theBIC instead of a hypothesis test. Furthermore, Gould (2001) notes that the Hausman test to discriminatebetween fixed and random effects is asymptotically equivalent to a F test of restrictions that βb = βw in aSPM. But the F test is biased in the SPM due to the duplicative error components.

14

and all of the between variance in the data, but does not attempt to synthesize the results

statistically (but does not preclude the researcher from synthesizing the results analytically).7

The point estimates from the SPM2 are identical to those produced by the CPE, and the

CPE produces better standard errors than the SPM2. Thus, if the SPM2 casts doubt on the

restrictions that βb = βw, the CPE should be used for the following reasons.

The CPE automatically relaxes the homoskedasticity assumption in the SPM2 that the

error variance for the within-unit regime is equal to the error variance for the between-

unit regime. The W-E and the B-E may suffer from other violations of the spherical error

assumption, but it is easy to calculate PCSEs or clustered standard errors following the W-E

and equally easy to calculate White standard errors following the B-E. As yet, we do not

know how to make such corrections for a SPM2.

Misspecification along the between (within) dimension in a SPM2 increases the standard

errors of all estimates. If the B-E is estimated separately from the W-E, omitting a relevant

variable from the B-E will affect the standard errors of the between-unit estimates only (and

vice versa). Finally, the W-E and the B-E individually estimate fewer parameters than does

the SPM2. For these reasons, the CPE has a small efficiency advantage over the SPM2,

although this efficiency advantage might not be apparent from the output if the standard

errors in the SPM2 reflect the dubious regime-homoskedasticty assumption.

The conclusion that researchers should use the CPE when the data are two-dimensional

may seem too orthodox. Many believe that only a POLSE or a REE can estimate the effect

of variables that do not vary within units of observation. However, the B-E half of the

CPE is optimal for this purpose. A B-E can produce good standard errors, and its point

estimates are not biased by the potentially invalid equality restrictions the POLSE and the

REE impose on the coefficients of two-dimensional variables. Variables that do not change

7In table 1, I claimed that the CPE was an unrestricted version of the REE. This claim is somewhattortuous. The REE does use a CPE to calculate θ, and after transforming the data, imposes the restrictionsthat βb = βw. The GLS transformation muddles the connection between the CPE and the REE.

15

within units of observation can only explain between-unit variance in the dependent variable,

so nothing is “lost” when the B-E is used to estimate the effects of such variables. What

appears to be lost is some precision in the standard errors, but the apparent precision of the

POLSE and the REE is merely a reflection of incorrect standard errors.

But granted that the CPE approach is nevertheless somewhat orthodox, there are three

possibilities for compromise, although I am not especially well-disposed toward any of them.

First, Bartels (1996) gives a Bayesian justification for fractional pooling where the unbiased

regime (the within regime in this case) receives a weight of unity and the other regime is

downweighted. The catch is that an explicit and substantive justification of the weighting

scheme must be given, but setting w = 1 and b ∈ (0, 1) is a potentially plausible course of

action when the amount of within variation is small. Ironically, many researchers – claiming

that there is “too little within variation in the data” – resort to a POLSE, which implicitly

downweights the scarce within variation. Moreover, the reason that the POLSE produces

small standard errors in this situation is because the standard errors are not corrected for

unit heterogeneity and implicit weighting.

A second compromise is to impose the restrictions that the between effect of a covariate

is equal to the corresponding within effect for some, but not all, of the covariates. Zorn

(2001) contemplates this “partial pooling” compromise, although the SPM2 should be used

to evaluate the restrictions rather than the SPM. It would be difficult to justify restricting

the between and within effects of a control variable to be equal, because doing so would risk

bias to the key causal variable while only saving one degree of freedom for each restriction

imposed. However, restricting the between and within effects of the key causal variable to

be equal would increase the precision for the key estimate and, depending on the evidence

from the SPM2, may not cause too much bias. Fortunately, the matter can (and should)

always be resolved in a data-driven way rather than assumed.

At present, the partial pooling route unfortunately does not lend itself to PCSEs or

16

other post hoc corrections to the standard errors. The third possibility for compromise is

to average the within effect of the key causal variable (k) with the corresponding between

effect using the CPE and the textbook formulas for the sum of two random variables:

β[k]

=1

2×

(β[k]

w + β[k]b

), (16)

SE(β

[k])

=1

2×

√[SE

(β

[k]w

)]2

+[SE

(β

[k]b

)]2

+ 2× Cov(β

[k]w , β

[k]b

). (17)

Since a demeaned variable has no covariance with a meaned variable, the last term under the

radical in equation 17 drops out and information from non-nested models can be averaged.

One virtue of this approach is that the standard errors of the within and between estimates

can be fixed with any of the well-known post hoc corrections before equation 17 is used.

However, how should we interpret an averaged estimate? All estimators that are used

for two-dimensional data either yield between estimates, within estimates, or some matrix-

weighted average of between and within estimates. Between and within estimates have

clear, albeit different, interpretations. A between effect reflects the expected difference in

the dependent variable when two units of observation only differ on one independent variable.

A within effect reflects the expected change in a unit’s dependent variable when one of its

independent variables changes. But there is no substantive interpretation for a matrix-

weighted average of between and within estimates except in the limiting case that βb = βw,

making both interpretations valid. This point applies to equation 16 just as much as it

applies to the POLSE, REE, and POLSE2.

This section has provided a framework to answer three questions that should always be

asked when analyzing two-dimensional data. First, are the restrictions on the coefficients,

if any, valid? Second, given the restrictions, are the implicit or explicit weights on the data

appropriate? Third, given the weights, are the degrees of freedom correctly calculated? The

answer to the first question is often negative, implying that a CPE should be used.

17

4 The Time-series-Cross-section Case

The previous sections discussed issues that arise with all two-dimensional datasets. This sec-

tion focuses on additional problems that occur in the special case where the second dimension

is time, which is the most common type of two-dimensional dataset in political science. Let

i = 1, 2, . . . N continue to index the units of observation, which are usually countries or pairs

of countries in the political economy literature. However, now the second dimension is time,

which is usually years in the political economy literature. Thus, j = t = 1, 2, . . . T indexes

the temporal dimension of the data. For concreteness, I assume that T is relatively large

so that the data can be considered “time-series-cross-section” (TSCS) data, but all of my

claims also apply to “panel” data where T is relatively small.

Beck and Katz (1996) recommends a two-dimensional version of the “auto-regressive

distributed lag” (ARDL) model, as a starting point from which to “test down”. For example,

yit = α + φyit−1 + xitβ + xit−1γ + (νi + νij) ; |φ| < 1. (18)

The first number in ARDL( 1 ,1) notation indicates that the right-hand side includes one

lag of the dependent variable(yit−1

). The second number indicates that the right-hand

side includes one lag of the exogenous variables (xit−1). This particular ARDL model also

includes the contemporaneous exogenous variables (xit) and a single intercept (α).

Beck and Katz (2001, p.493) elaborates that fixed effects are “never ideal” but should

be included when they are necessary – provided that no time invariant variables are of

substantive interest – and that the BIC, rather than a F test, should be used to judge

necessity. The problems with the comparison between a POLSE and a LSDVE were discussed

in section 3, and the proper BIC comparison is between a POLSE2 and a SPM2. But if one

were to follow the recommendation in Beck and Katz (2001), it is likely that a specification

with a single intercept would be adopted, so I focus on the problems that arise in that case.

18

The ARDL model has a long history in the econometrics literature for single time-series,

but is not immune from criticism. I make no claim that the ARDL model is appropriate,

even for the example given in section 5. For consistency, the ARDL model requires the error

term to be uncorrelated with current values of xit, past values of xit, and future values of xit.

This is a very strong assumption, but the consequences of it have not been explored much

in the political science literature. Also, Wilson and Butler (2003) urges us to think more

carefully about lag structures. Nevertheless, I focus on the ARDL(1,1) model because it has

been specifically recommended for political scientists and use it as an example of what can

go wrong with TSCS data. If the ARDL model were eschewed in favor of a different model,

all the points in the previous sections would continue to hold and some of the points in this

section would probably apply as well.

Wilson and Butler (2003, table 1) claims that, as of May 31, 2003, 135 papers published in

political science have used linear TSCS models and have cited Beck and Katz (1995) or Beck

and Katz (1996). All of which use an ARDL model but likely do so by placing restrictions

on equation 18. Thus, it is important to determine if the two-dimensional ARDL model

is sound. Beck and Katz are always careful to allow for the possibility of fixed effects,

but the fact that only 47 of those papers report fixed effects estimates indicates that most

ignore this possibility. My impression is that pooled specifications are usually given more

emphasis even when fixed effects estimates are reported. Many reviewers insist that fixed

effects estimates be reported in a footnote as a “robustness check” on the POLSE, but this

thinking is backwards because there is no reason to believe that the POLSE is sound. The

B-E should be used as a robustness check on the W-E when the data vary over time.

In order to turn the ARDL(1,1) model into a SPM, it is first necessary to discuss the

difference between the unit mean of a lagged variable and the unit mean of a contemporaneous

variable. When lagged variables are used, data are lost. Thus, the unit mean of a lagged

covariate(xi[t−1]

)is calculated over a slightly different sample than the unit mean of a

19

contemporaneous covariate (xi):

xi[t−1] =1

T

T−1∑t=0

xit 6=1

T

T∑t=1

xit = xi, (19)

yi[t−1] =1

T

T−1∑t=0

yit 6=1

T

T∑t=1

yit = yi. (20)

When demeaning the lagged covariate, we should use xi[t−1] rather than xi. Thus, xit−1 =

xit−1 − xi[t−1] while xit = xit − xi. One can then see that the ARDL(1,1) model is a SPM,

yit = yi + yit =α + φbyi[t−1] + xiβb + xi[t−1]γb + νi

+ φwyit−1 + xitβw + xit−1γw + νit,(21)

with the restrictions that φb = φw, βb = βw, and γb = γw.

Of course, it is possible to avoid the problems with the standard errors of between esti-

mates inherent in a SPM with a SPM2 or CPE, but I want to elaborate why the restrictions

that φb = φw, βb = βw, and γb = γw are especially inappropriate. In my opinion, these

criticisms invalidate the ARDL(1,1) model with a single intercept regardless of the substance

of the research question or what the results of a (biased) BIC comparison imply.

Equation 21, like any SPM, is the sum of a B-E and a W-E, but no textbook includes

meaned lagged variables in the B-E for good reason. The meaned lagged variables should

be excluded from a B-E and should be included in a SPM2 only to determine whether the

equality restrictions are valid, which I will now demonstrate is virtually impossible.

First, when yit−1 is included in the SPM, βw represents the short-term effects of the

exogenous variables. There is no such thing as a “short-term cross-sectional effect”, so there

can never be a theoretical reason to impose the restrictions that βb = βw in an ARDL

specification. I could imagine a scenario where it could be sensible to constrain the long-

term temporal effects – which can be calculated using the formula βw+γw

1−φw– to equal the

20

cross-sectional effects, but that is not what the ARDL model does.

Second, νi is highly correlated with yi[t−1] unless νi = 0 ∀i. It is impossible for the

variance of νi to be eliminated unless a fixed effects model is estimated. But in theory, if

νi = 0 ∀i without using fixed effects, then φb ∈ {0, 1}, and neither estimate is promising if

one intends to constrain φb to equal φw since φw generally falls within the [0, 1] interval.

What about the restriction that φb = φw when νi 6= 0 ∀i? Recall that the point estimate

for φb in the SPM is the same as in the B-E. Using this fact, we could agree that if yi were

regressed on yi[t−1] and a constant, φb ≈ 1 because yi and yi[t−1] are calculated using almost

the same sample as shown in equation 20. But adding xi and xi[t−1] to the right-hand side

of the B-E would not affect φb very much at all because yi[t−1] is a consequence of xi and

xi[t−1]. Thus, xi and xi[t−1] have virtually no net effect on yi, conditional on the effect they

have on yi[t−1]. Since xi and xi[t−1] have virtually no net explanatory power, yi[t−1] is left to

do all the explaining of the cross-sectional variance in the dependent variable, and φb ≈ 1 in

both the B-E and the SPM.

Given that φb = 1, the restriction the ARDL model imposes on the SPM that φb = φw is

valid only if φw = 1. Thus, the ARDL model makes the assumption that φb = φw = 1 and

the stationarity assumption that |φ| < 1, which are mutually exclusive. If the restriction

that φb = φw = 1 were valid, the model would be explosive, the long-run effects of the

exogenous variables would be infinite, the distribution of the test statistics would be wrong.

In short, finding that φb = φw = 1 is pretty much the worst thing that could ever happen

in a regression, but the ARDL model imposes this restriction. In general, the restriction

is invalid because φw < 1, which implies that φ is biased in the ARDL model.8 The point

estimates for the effects of the exogenous variables are biased as well due to their correlation

8Both Green, Kim and Yoon (2001, p.453) and Kristensen and Wawro (2003, note 18), among others,recognize in passing that the pooled estimate of a lagged dependent variable is biased upward and blameheterogeneity in the units. However, the same phenomenon could theoretically occur with homogenous unitsthat all experience transitory shocks in the error term.

21

with the lagged dependent variable.

Third, it does not make sense to think about the effects of xi conditional on xi[t−1]

(and vice versa) because the two vectors are conceptually the same and are almost perfectly

collinear. However, it does make sense to think about xit conditional on xit−1 if lagged effects

are possible. Thus, imposing the restrictions that γw = γb and that βw = βb when γb and

βb are non-sensible undermines the estimates for γw and βw. Many papers simply exclude

xit−1, making the ARDL(1,1) model into an ARDL(1,0) model, which is called the “partial

adjustment model” or the “lagged dependent variable model” by Beck and Katz. Excluding

xit−1 avoids the collinearity problem, but does not change the fact that the restrictions that

φb = φw and βb = βw are deeply problematic.

Fourth, the reason Beck and Katz (1996) recommends the ARDL(1,1) is to test hypothe-

ses about γ in order to capture possible efficiency advantages. If γ = 0, one should estimate

the partial adjustment model, which is more parsimonious than the ARDL(1,1) model. If

γ = −φβ, imposing these restrictions yields an efficient GLS model. However, neither of

these tests are fruitful in a two-dimensional context because all the estimates are biased to

the extent that the restrictions that φb = φw, βb = βw, and γb = γw do not hold. The SPM2,

or better yet the W-E, permits alternative tests that γw = 0 or whether γw = −φwβw, and

these restrictions can possibly be imposed on the W-E to increase precision.

Fifth, the ARDL(1,1) incorrectly assumes that the “unit effects” (νi) do not exist. The

existence of the unit heterogeneity implies that each residual is highly correlated with every

other residual for that unit, which undermines the consistency of PCSEs. Beck and Katz

(1996) recommends – but does not derive – a Lagrange Multiplier (LM) test to verify that

there is no autocorrelation in the residuals (νit). This LM test takes the form of an auxiliary

regression of the residuals on their lags and every independent variable in the original model:

νit = α + φνit−1 + κyit−1 + xitβ + xit−1γ + ηit. (22)

22

Beck and Katz (1996) recommends looking primarily at the magnitude of φ in equation

22 to determine if there is any autocorrelation left in the residuals, but a pooled auxiliary

regression has all the same problems that plague the original pooled model. In particular, φ

is biased unless φb= φ

w= 1 (which would be very bad). When a lagged dependent variable

is included in the original model, there is so little cross-sectional variation in νit that φ is

essentially a within estimate. However, there is some cross-sectional variation in νit and

the fact that φb≈ 1 would be easy to verify if an auxiliary SPM, SPM2, or B-E were used

instead of an auxiliary POLSE. We do not exactly know how PCSEs fare when the small

cross-sectional component of the residuals is almost perfectly predictable, but the Monte

Carlo evidence in Kristensen and Wawro (2003) is not particularly encouraging for PCSEs.

Of course, PCSEs following a W-E should work well because νi = 0 ∀i by construction.

Estimating a W-E prior to a B-E affords an opportunity to model the cross-unit corre-

lation in the errors of the B-E. In order to make corrections to the W-E’s standard errors,

PCSEs calculate Σ, which is a N ×N variance-covariance matrix of the residuals where the

off-diagonal elements are estimates of the covariance in the residuals for two units (see Beck

and Katz, 1995). Thus, one can create a covariate,i

c, that is a weighted sum of the unit

means of the dependent variable where the weights are given by the elements of Σ:

i

cN×1

=

[i

y′1×N

ΣN×N

]′

. (23)

Includingi

c in the B-E is consistent with the advice in Franzese and Hays (2004), but

future Monte Carlo experiments are needed. First, the diagonal elements of Σ should prob-

ably be replaced by zeroes to precludei

c from being influenced by a unit’s own dependent

variable. Second, it is possible that Σ should be rescaled to a correlation matrix before using

equation 23 to construct the new covariate. Third, the W-E may include a lagged dependent

variable and unique intercepts for each time period, both of which will reduce the contem-

23

poraneous correlation in the residuals of the W-E. It is possible that a underspecified W-E

should be estimated (after the properly specified W-E) to create a Σ that is not affected by

any variables that are included in the W-E but not the B-E. However, the CPE is still the

best estimation strategy discussed in this paper even ifi

c is omitted from the B-E.

Finally, some comments on the two-dimensional error correction model (ECM),

∆yit = α + ∆xitβ + φ (yit−1 + xit−1γ) + ζit, (24)

which is also recommended in Beck and Katz (1996) in some circumstances. An ECM and a

ARDL model are closely related, so it should come as no surprise that the problems with the

ARDL model carry over to the ECM. In particular, unless α is replaced with αi in equation

24, the point estimates will reflect excessive weight on the cross-sectional dimension and the

error term (ζit) will have common components. Differencing reduces but does not eliminate

the cross-sectional variation in the data, necessitating the use of fixed effects to obtain pure

within estimates of the coefficients and draw upon ECM theory from econometrics. My

impression is that most political scientists who use ECMs include fixed effects. However, I

have not seen a paper where the B-E is used to check the results of an ECM.

5 Empirical Example

This section replicates part of Green, Kim and Yoon (2001) in order to illustrate the problems

and solutions identified in previous sections. The model in Green, Kim and Yoon (2001) is

a dyadic TSCS model of bilateral trade between 1952 to 1992 called a “gravity model”. Let

a and b indicate the two states in the dyad. The logarithm of distance between a and b

is included on the right-hand side in addition to the logarithm of the product of the two

countries’ gross domestic products and the logarithm of the product of the two countries’

24

populations. The two covariates of interest are a dummy variable indicating the presence of

a military alliance between a and b and the minimum level of democracy between a and b.

The data are not balanced (N = 3079; T = 28.89), but I ignore this and a number of other

important, but tangential, methodological issues in order to present an exact replication.

Column 1 of table 2 is an exact replication of the POLSE reported in column 3 of table

2 of Green, Kim and Yoon (2001). All estimates are statistically significant given the OLS

assumptions. The FPE in the column 2 approximates the POLSE by weighting the meaned

data by b = 0.9713 and weighting the demeaned data by w = 0.2059, weights which were

calculated to minimize the discrepency between the POLSE and the FPE that arises when

the data are unbalanced. If all 3079 units had the maximum of 41 years of complete data, w

would equal 41−12 or 0.1562. The standard errors for this FPE reflect a correction factor of

2.11, which is not trivial but does not affect any inferences. However, none of the standard

errors in table 2 have much credibility because they assume a spherical error term.

Column 3 of table 2 is a FPE with w = b = 1 or a POLSE2. Since there are 88, 946 de-

meaned observations and only 3079 meaned observations, the absence of preferential weight-

ing implies that the results are essentially within estimates. All the coefficients retain the

same signs, but their magnitudes change somewhat. In particular, the estimated effect of

the lagged dependent variable decreases by almost 0.2, which is a large change for an au-

toregressive parameter. Also, the effects of alliances and democracy are insignificant.

The SPM2 presents pure within and between estimates in the columns 4 and 5 respec-

tively. The within effects for the population and alliance variables change signs and become

statistically significant. This SPM2 is a bad specification, at least along the cross-sectional

dimension, because the unit means of the lagged dependent variable are included in the

model. The estimated “effect” of the meaned lagged variable is 0.981 and insignificantly dif-

ferent from unity. As a result, the effects of all the other meaned variables are insignificant

because the meaned lagged variable is a consequence of them.

25

Table 2: Comparison of various least squares estimators for a two-dimensional gravity model of bilateral tradePOLSE FPE POLSE2 Bad SPM2 Good SPM2 CPE

Column number: (1) (2) (3) (4) (5) (6) (7) (8) (9)Variable ↓ / Info. → Replication w=0.2059

b=0.9713w=1b=1

Within Between Within Between W-E B-E

Intercept−3.046 −3.044 −5.071 −0.197 −24.375 −24.375(0.177) (0.406) (0.689) (1.185) (1.149) (1.306)

ln(Distance[ab])i−0.328 −0.384 −0.524 0.002 −1.507 −1.507(0.012) (0.028) (0.055) (0.063) (0.060) (0.068)

ln(GDP[a] 0.250 0.254 0.363 0.342 0.028 0.342 1.534 0.342 1.534×GDP[b])it (0.006) (0.012) (0.007) (0.013) (0.049) (0.013) (0.044) (0.013) (0.050)

ln(Population[a] −0.059 −0.046 −0.034 0.143 −0.020 0.143 −0.648 0.143 −0.648×Population[b])it (0.006) (0.014) (0.023) (0.067) (0.045) (0.068) (0.044) (0.068) (0.051)

Alliance −0.247 −0.341 −0.017 0.419 −0.037 0.419 −0.787 0.419 −0.787

Dummy[ab]it (0.027) (0.065) (0.092) (0.119) (0.148) (0.122) (0.150) (0.121) (0.171)

min(Democracy[a], 0.022 0.022 −0.003 −0.009 −0.005 −0.009 0.079 −0.009 0.079Democracy[b])it (0.001) (0.003) (0.002) (0.002) (0.009) (0.002) (0.009) (0.002) (0.010)

ln(Trade[ab])it−10.736 0.722 0.549 0.533 0.981 0.533 0.533

(0.002) (0.004) (0.003) (0.003) (0.015) (0.003) (0.003)

Notes: The dependent variable is ln(Trade[ab]

)it

or some transformation thereof. The two states in the dyad are denoted

by a and b. For details on the dataset, see Green, Kim, and Yoon (2001). Standard errors are in parentheses and are“OLS standard errors” with the following modifications. The standard errors in the FPE are multiplied by 2.11, whichreflects the correction in table 1. The standard errors in the POLSE2, the Bad SPM2, the Good SPM2, and the withinestimator component of the CPE reflect a deduction of N degrees of freedom only.

26

The only purpose of estimating an overspecified SPM2 is to evaluate the restrictions that

the POLSE2 places on the SPM2, which are clearly invalid in this case, especially for the

lagged dependent variable. The odds of this SPM2 relative to the POLSE2 are approximately

e500 to one based on the BIC. Also, note that Beck and Katz (2001) found evidence in favor

of the POLSE relative to the W-E using the BIC, but this is not a meaningful comparison

because the POLSE is flawed.

For reference, I also include the results of a well-specified SPM2 in columns 6 and 7

that excludes the meaned lagged dependent variable. The within estimates do not change

because the demeaned variables are orthogonal to the excluded variable. However, the

between estimates of the exogenous variables become significant by a wide margin, as would

be expected because the included variable bias is eliminated.

Columns 8 and 9 present a W-E and a B-E respectively. The results of the W-E are

identical to column 4 in table 2 of Green, Kim and Yoon (2001), and the point estimates are

identical to those in column 6 in this paper due to orthogonality. Although table 2 presents

standard errors that incorrectly assume a spherical error term, it would be easy to calculate

PCSEs or clustered standard errors for the W-E.9 The point estimates from the B-E are also

identical to the between estimates in the well-specified SPM2, again due to orthogonality.

And it would be easy to use White standard errors to correct the B-E for heterosedasticity.

We should not overlook the fact that the estimates in the CPE are all significant, but

the within and the between estimates have opposite signs in every case except the GDP

variable (where the magnitudes are very different, even in the long-run). In the absence of a

theoretical explanation for the differing signs, we should probably conclude that important

variables are missing. If the missing variables only vary across dyads, the results from the

W-E are unbiased. However, if any of the missing variables two-dimensional, all bets are off

9Although there is evidence that a second lag of the dependent variable would be needed in the W-E toeliminate the autocorrelation.

27

concerning the unbiasedness of the CPE. Of course, the true answer is unknowable.

It is also useful to note the results for the distance variable, which is the only time

invariant covariate. The estimated effect goes up in magnitude from the POLSE to the

POLSE2 but the estimate is an artifact of the invalid pooling constraints on the other

covariates. When these restrictions are relaxed in the overspecified SPM2, the effect of

distance is nil conditional on the meaned lagged variable. When this variable is excluded, the

effect of distance rebounds to a level that is consistent with the literature and is significant.

The main point to take away is that the B-E, rather than the POLSE, POLSE2 or REE

(not shown), is the best way to estimate the effects of time-invariant variables.

6 Conclusions

The message of this paper is fairly simple: Think about the restrictions put on the coef-

ficients, think about the weights put on the data, and think about how the weights affect

the degrees of freedom. Using a CPE is safe; reviewers and editors should enforce tough

standards if an author attempts to justify another estimation technique when the data are

two-dimensional. Conversely, researchers have plenty of opportunity to revisit previous stud-

ies. To summarize:

1. The SPM in Zorn (2001) produces unbiased point estimates but the wrong standard

errors for the between estimates, and the well-known post hoc corrections to the stan-

dard errors do not solve this problem adequately. The SPM2 produces the same point

estimates as the SPM but more reasonable standard errors. Thus, the SPM2 rather

than the SPM should be used to evaluate pooling restrictions on the coefficients.

2. The POLSE2, which restricts all the coefficients, is a special case of the FPE where

the within and between regimes are weighted equally. By specifying different weights

28

for the two regimes, the FPE can produce the same point estimates as the W-E, B-E,

REE, and POLSE. In the case of the POLSE, it is virtually impossible to substantively

justify a weight of J−12 for the within regime.

3. The FPE requires that the standard errors be multiplied by√


, which

implies that the reported standard errors of the REE and POLSE are too small.

4. If the pooling restrictions appear invalid when the BIC of the SPM2 is compared to

the BIC of the POLSE2, it is best to employ a W-E and then a B-E. Doing so allows

more flexibility to relax the assumption that the error term is spherical.

5. If time is one of the dimensions and a lagged dependent variable is included on the

right-hand side, it is virtually impossible to justify the pooling restrictions. The rec-

ommendation in Beck and Katz (1996) is sound only if fixed effects are used.

6. The B-E can be used as a robustness check for the results of a W-E or ECM and is

appropriate for estimating the effects of variables that do not change over time.

References

Arellano, Manuel. 1987. “Computing Robust Standard Errors for Within-Groups Estima-tors.” Oxford Bulletin of Economics and Statistics 49(4):431–34.

Baltagi, Badi H. 2001. Econometric Analysis of Panel Data. Second ed. New York: JohnWiley & Sons, LTD.

Bartels, Larry M. 1996. “Pooling Disparate Observations.” American Journal of PoliticalScience 40(3):905–942.

Beck, Nathaniel and Jonathon N. Katz. 1995. “What to Do (and Not to Do) with Time-Series–Cross-Section Data.” American Political Science Review 89(3):634–647.

Beck, Nathaniel and Jonathon N. Katz. 1996. “Nuisance vs. Substance: Specifying andEstimating Time-Series–Cross-Section Models.” Political Analysis 8(3):1–36.

Beck, Nathaniel and Jonathon N. Katz. 2001. “Throwing the Baby Out with the Bathwater:A Comment on Green, Kim, and Yoon.” International Organization 55(2):487–495.

29

Franzese, Robert J. and Jude C. Hays. 2004. “Empirical Modeling Strategies for SpatialInterdependence: Omitted-Variable vs. Simultaneity Biases.” Paper presented at the 2004Political Methodolgy Conference and is avaialbe from http://sitemaker.umich.edu/

jchays/files/franzesehays 1 .polmeth.2004.pdf.

Gould, William. 2001. “What is the Between Estimator?” STATA FAQ: http://www.

stata.com/support/faqs/stat/xt.html.

Green, Donald P., Soo Yeon H. Kim and David Yoon. 2001. “Dirty Pool.” InternationalOrganization 55(2):441–468.

Greene, William H. 2000. Econometric Analysis. Fourth ed. Upper Saddle River, NJ: PrenticeHall.

King, Gary. 2001. “Proper Nouns and Methodological Propriety: Pooling Dyads in Interna-tional Relations Data.” International Organization 55(2):497–507.

King, Gary, James Honaker, Anne Joseph and Kenneth Scheve. 2001. “Analyzing IncompletePolitical Science Data.” American Political Science Review 95(1):49–69.

Kristensen, Ida Pagter and Gregory Wawro. 2003. “Lagging the Dog? The Robustness ofPanel Corrected Standard Errors in the Presence of Serial Correlation and ObservationSpecific Effects.” Paper presented at the 2003 Political Methodology Conference. Prelim-inary version available from: http://polmeth.wustl.edu/papers/03/krist03.pdf.

Little, Roderick J.A. and Donald B. Rubin. 2002. Statistical Analysis with Missing Data.Second ed. Hoboken, New Jersey: John Wiley & Sons, Inc.

Neuhaus, J. M. and J. D. Kalbfleisch. 1998. “Between- and Within-Cluster Covariate Effectsin the Analysis of Clustered Data.” Biometrics 54:638–645.

Oneal, John R. and Bruce Russett. 2001. “Clear and Clean: The Fixed Effects of the LiberalPeace.” International Organization 55(2):469–485.

Raftery, Adrian E. 1995. “Bayesian Model Selection in Social Research.” SociologicalMethodology 25:111–163.

Rice, John A. 1995. Mathematical Statistics and Data Analysis. Second ed. InternationalThomson Publishing.

Wilson, Sven E. and Daniel M. Butler. 2003. “Too Good to Be True? The Promise andPeril of Panel Data in Political Science.” Working Paper. Preliminary version availablefrom http://fhss.byu.edu/POLSCI/Wilson/papers/.

Zorn, Christopher. 2001. “Estimating Between- and Within-Cluster Covariate Effects, withan Application to Models of International Disputes.” International Interactions 27(4):433–45.

30

http://sitemaker.umich.edu/jchays/files/franzesehays_1_.polmeth.2004.pdf

http://sitemaker.umich.edu/jchays/files/franzesehays_1_.polmeth.2004.pdf

http://www.stata.com/support/faqs/stat/xt.html

http://www.stata.com/support/faqs/stat/xt.html

http://polmeth.wustl.edu/papers/03/krist03.pdf

http://fhss.byu.edu/POLSCI/Wilson/papers/

Documents

Problems with and Solutions for Two-Dimensional Models of ...lists.fas.harvard.edu/pipermail/gov1000-list/attachments/20041109/... · Models of Continuous Dependent Variables