Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
A Generalized Focused Information Criterion for GMM∗
Minsu Chang Francis J. DiTraglia†
University of Pennsylvania
This Version: September 16, 2017 First Version: February 15, 2013
Abstract
This paper proposes a criterion for simultaneous GMM model and moment selec-tion: the generalized focused information criterion (GFIC). Rather than attemptingto identify the “true” specification, the GFIC chooses from a set of potentially mis-specified moment conditions and parameter restrictions to minimize the mean-squarederror (MSE) of a user-specified target parameter. The intent of the GFIC is to formal-ize a situation common in applied practice. An applied researcher begins with a set offairly weak “baseline” assumptions, assumed to be correct, and must decide whetherto impose any of a number of stronger, more controversial “suspect” assumptions thatyield parameter restrictions, additional moment conditions, or both. Provided that thebaseline assumptions identify the model, we show how to construct an asymptoticallyunbiased estimator of the asymptotic MSE to select over these suspect assumptions:the GFIC. We go on to provide results for post-selection inference and model averagingthat can be applied both to the GFIC and various alternative selection criteria. Toillustrate how our criterion can be used in practice, we specialize the GFIC to theproblem of selecting over exogeneity assumptions and lag lengths in a dynamic panelmodel, and show that it performs well in simulations. We conclude by applying theGFIC to a dynamic panel data model for the price elasticity of cigarette demand.
Keywords: Model Selection, Moment selection, Model averaging, Panel Data, GMMEstimation, Focused Information Criterion, Post-selection estimators
JEL Codes: C23, C52
∗We thank Manuel Arellano, Otilia Boldea, Camilo Garcıa-Jimeno, Bruce Hansen, Frank Kleibergen, andseminar participants at the 2013 Latin American Workshop in Econometrics, the 2014 Midwest EconometricsGroup meetings, Tilburg, the Tinbergen Institute, and the University of Wisconsin for helpful comments.
†Corresponding Author: [email protected], 3718 Locust Walk, Philadelphia, PA 19104
1
1 Introduction
An econometric model is a tool for answering a particular research question: different ques-
tions may suggest different models for the same data. And the fact that a model is wrong,
as the old saying goes, does not prevent it from being useful. This paper proposes a novel
selection criterion for GMM estimation that takes both of these points to heart: the gener-
alized focused information criterion (GFIC). Rather than attempting to identify the correct
specification, the GFIC chooses from a set of potentially mis-specified moment conditions
and parameter restrictions to yield the smallest mean squared error (MSE) estimator of a
user-specified scalar target parameter. We derive the GFIC under local mis-specification,
using asymptotic mean squared error (AMSE) to approximate finite-sample MSE. In this
framework mis-specification, while present for any fixed sample size, disappears in the limit
so that asymptotic variance and squared bias remain comparable. GMM estimators remain
consistent under local mis-specification but their limit distributions show an asymptotic
bias. Adding an additional moment condition or imposing a parameter restriction generally
reduces asymptotic variance but, if incorrectly specified, introduces a source of bias. The
GFIC trades off these two effects in the first-order asymptotic expansion of an estimator to
approximate its finite sample behavior.
The GFIC takes its motivation from a situation that is common in empirical practice. A
researcher who hopes to estimate a parameter of interest µ must decide which assumptions
to use. On the one hand is a set of relatively uncontroversial “baseline” assumptions. We
suppose that the baseline assumptions are correct and identify µ. But the very fact that
the baseline assumptions do not raise eyebrows suggests that they may not be especially
informative about µ. On the other hand are one or more stronger controversial “suspect”
assumptions. These stronger assumptions are expected to be much more informative about
µ. If we were certain that they were correct, we would definitely choose to impose them
in estimation. Indeed, by continuity, even if they were nearly correct, imposing the sus-
pect assumptions could yield a favorable bias-variance tradeoff. This is the essential idea
behind the GFIC. When the baseline assumptions identify the model, the GFIC provides an
asymptotically unbiased estimator of AMSE.
The GFIC is an extension of the focused moment selection criterion (FMSC) of DiTraglia
(2016). While the FMSC considers the problem of selecting moment conditions while holding
the model specification fixed, the GFIC allows us to select over both aspects of our speci-
fication simultaneously. This extension is particularly valuable in panel data applications,
where we may, for example, wish to carry out selection over the lag specification as well
as the exogeneity assumptions used to estimate a dynamic panel model. We specialize the
2
GFIC to such a dynamic panel example below, and provide simulation evidence of its per-
formance. Online Appendix A provides two additional examples: selecting between random
and fixed effects estimators, and choosing between pooled OLS and mean-group estimators
of an average effect in the presence of heterogeneity. In addition to extending the FMSC to
a broader class of problems, we also extend the results of DiTraglia (2016) on post-selection
and moment-averaging estimators to the more general setting of the GFIC. We conclude
with an empirical example modelling the price elasticity of cigarette demand.
As its name suggests, the GFIC is related to the focused information criterion (FIC) of
Claeskens and Hjort (2003), a model selection procedure for maximum likelihood estimators
that uses local mis-specification to approximate the MSE of a target parameter. Like the
FIC and related proposals, e.g. Schorfheide (2005), the GFIC uses local mis-specification
to derive a risk-based selection criterion. Unlike them, however, the GFIC provides both
moment and model selection for general GMM estimators. If the moment conditions used in
estimation are the score of a maximum likelihood model and we consider model selection only,
then the GFIC reduces to the FIC. Thus, the GFIC extends both the FIC and the FMSC
of DiTraglia (2016). Comparatively few papers propose criteria for simultaneous GMM
model and moment selection under mis-specification.1 Andrews and Lu (2001) propose
a family of selection criteria by adding appropriate penalty and “bonus” terms to the J-
test statistic, yielding analogues of AIC, BIC, and the Hannan-Quinn information criterion.
Hong et al. (2003) extend this idea to generalized empirical likelihood (GEL). The principal
goal of both papers is consistent selection: they state conditions under which the correct
model and all correct moment conditions are chosen in the limit. As a refinement to this
approach, Lai et al. (2008) suggest a two-step procedure: first consistently eliminate incorrect
models using an empirical log-likelihood ratio criterion, and then select from the remaining
models using a bootstrap covariance matrix estimator. The point of the second step is to
address a shortcoming in the standard limit theory. While first-order asymptotic efficiency
requires that we use all available correctly specified moment conditions, this can lead to a
deterioration in finite sample performance if some conditions are only weakly informative.
Hall and Peixe (2003) make a similar point about the dangers of including “redundant”
moment conditions while Caner (2009) proposes a lasso-type GMM estimator to consistently
remove redundant parameters.
In contrast to these suggestions, the GFIC does not aim to identify the correct model and
moment conditions: its goal is a low MSE estimate of a quantity of interest, even if this entails
using a specification that is not exactly correct. As such, the GFIC is an “efficient” rather
1See Smith (1992) for an approach to GMM model selection based on non-nested hypothesis testing. Fora detailed discussion of the literature on moment selection, see DiTraglia (2016).
3
than a consistent selection criterion. There is an unavoidable trade-off between consistent
selection and estimators with desirable risk properties (Yang, 2005). Indeed, the worst-case
risk of any consistent selection procedure is unbounded as sample size tends to infinity (Leeb
and Potscher, 2008). In this sense, the fact that the GFIC is not consistent is a benefit
rather than a liability. As we show in simulations below, its worst-case performance is much
better than that of competing selection procedures.
Although not strictly a selection procedure, the combined moments (CM) estimator of
Judge and Mittelhammer (2007) takes a similar perspective to that of the GFIC, emphasizing
that incoporating the information from an incorrect specification could lead to a favorable
bias-variance tradeoff under the right circumstances. Unlike the GFIC, however, the CM
estimator is not targeted to a particular research goal. A key point of the GFIC is that two
researchers using the same dataset but interested in different target parameters may find it
optimal, in a minimum MSE sense, to choose different model specifications. We explore this
idea further in our dynamic panel example below.
The remainder of this paper is organized as follows. Section 2 derives the asymptotic dis-
tribution of GMM estimators under locally mis-specified moment conditions and parameter
restrictions. Section 3 uses this information to calculate the AMSE of a user-specified target
parameter and provides asymptotically unbiased estimators of the required bias parameters,
yielding the GFIC. Section 4 extends the results on averaging estimators and post-selection
inference from DiTraglia (2016) to the more general setting of this paper. Section 5 spe-
cializes the GFIC to a dynamic panel example, and Section 6 presents simulation results.
Finally, Section 7 presents our empirical example and Section 8 concludes. Proofs and sup-
plementary simulation results appear in the Appendix. Further examples and simulation
results appear in Online Appendices A and B.
2 Asymptotic Framework
Let f(·, ·) be a (p + q)-vector of moment functions of a random vector Z and an (r + s)-
dimensional parameter vector β. To represent moment selection, we partition the moment
functions according to f(·, ·) = (g(·, ·)′, h(·, ·)′)′ where g(·, ·) and h(·, ·) are p- and q-vectors.The moment condition associated with g(·, ·) is assumed to be correct, while that associated
with h(·, ·) is locally mis-specified. The moment selection problem is to choose which, if any,
of the elements of h to use in estimation. To represent model selection, we partition the
full parameter vector according to β = (θ′, γ′)′, where θ is an s-vector and γ an r-vector of
parameters. The model selection problem is to decide which if any of the elements of γ to
estimate, and which to set equal to the corresponding elements of γ0, an r-vector of known
4
constants. The parameters contained in θ are those that we always estimate, the “protected”
parameters. Any specification that does not estimate the full parameter vector β is locally
mis-specified. The precise form of the local mis-specification, over parameter restrictions
and moment conditions, is as follows.
Assumption 2.1 (Local Mis-specification). Let Zni : 1 ≤ i ≤ n, n = 1, 2, . . . be an iid
triangular array of random vectors defined on a probability space (Υ,F , P ) satisfying
(a) E[g(Zni, θ0, γn)] = 0, with γn = γ0 + n−1/2δ
(b) E[h (Zni, θ0, γn, )] = τn, with τn = n−1/2τ
(c) f(Zni, θ0, γn) : 1 ≤ i ≤ n, n = 1, 2, . . . is uniformly integrable, and
(d) Znid→ Zi.
where γ0 is a (known) r-vector of parameter restrictions, δ an unknown r-vector of constants,
and τ an unknown q-vector of constants.
Assumption 2.1 specifies a triangular array data generating process in which the the true
parameter vector βn = (θ′0, γ′n)
′, changes with sample size but converges to β0 = (θ′0, γ′0)
′ as
n → ∞.2 Unless some elements of δ are zero, any estimator that restricts γ is mis-specified
for fixed n. In the limit, however, the restriction γ = γ0 holds. Similarly, for any fixed
sample size n, the expectation of h evaluated at the true parameter value βn depends on
the unknown constant vector τ , but this source of mis-specification disappears in the limit.
Thus, under Assumption 2.1, only estimators that use moment conditions from g to estimate
the full parameter vector β are correctly specified. In the limit, however, every estimator
is correctly specified, regardless of which elements of γ it restricts and which elements of
h it includes. The purpose of local mis-specification is to ensure that squared asymptotic
bias is of the same order as asymptotic variance: Assumption 2.1 is a device rather than
literal description of real-world data. Note that, by Assumption 2.1, the limiting random
variable Zi satisfies the population moment condition E[f (Zi, θ0, γ0)] = 0. Since the Zi are
assumed to have a common marginal law, we will use the shorthand Z for Zi throughout.
Accordingly, define:
F =
[∇θ′g (Z, θ0, γ0) ∇γ′g (Z, γ0, θ0)
∇θ′h (Z, γ0, θ0) ∇γ′h (Z, γ0, γ0)
]=[Fθ Fγ
]=
[Gθ Gγ
Hθ Hγ
]=
[G
H
](1)
2For simplicity, and because it is the case for all examples we consider below, we assume that the triangulararray from Assumption 2.1 is iid within each row. Note however, that the Zni are not iid across rows: γnand τn change with n. As such, the triangular array machinery is still required to describe our results.
5
along with
Ω = V ar
[g(Z, θ0, γ0)
h(Z, θ0, γ0)
]=
[Ωgg Ωgh
Ωhg Ωhh
]. (2)
Each of these expressions involves the limiting random variable Z rather than Zni. Thus, the
corresponding expectations are taken with respect to a distribution for which all moment
conditions have expectation zero evaluated at (θ0, γ0).
Before defining the estimators under consideration, we require some further notation.
Let b be a model selection vector, an r-vector of ones and zeros indicating which elements of
γ we have chosen to estimate. When b = 1r, where 1m represents an m-vector of ones, we
estimate both θ and the full vector γ. When b = 0r, where 0m denotes an m-vector of zeros,
we estimate only θ, setting γ = γ0. More generally, we estimate |b| components of γ and set
the others equal to the corresponding elements of γ0. Let γ(b) be the |b|-dimensional subvector
of γ corresponding to those elements selected for estimation. Similarly, let γ(−b)0 denote the
(r− |b|)-dimensional subvector containing the values to which we set those components of γ
that are not estimated. Analogously, let c =(c′g, c
′h
)′be a moment selection vector, a (p+q)-
vector of ones and zeros indicating which of the moment conditions we have chosen to use in
estimation. We denote by |c| the total number of moment conditions used in estimation. Let
BC denote the collection of all model and moment selection pairs (b, c) under consideration.
To express moment and model selection in matrix form, we define the selection matrices
Ξb and Ξc. Multiplying β by the (|b| + s) × (r + s) model selection matrix Ξb extracts the
elements corresponding to θ and the subset of γ indicated by the model selection vector b.
Thus Ξbβ =(θ′, γ(b)
′)′. Similarly, multiplying a vector by the |c| × (p+ q) moment selection
matrix Ξc extracts the components corresponding to the moment conditions indicated by the
moment selection vector c. To simplify the notation, we adopt the shorthand F (b, c) = ΞcFΞ′b
and Ωc = ΞcΩΞ′c throughout.
To express the estimators themselves, define the sample analogue of the expectations in
Assumption 2.1 as follows,
fn(β) =1
n
n∑i=1
f(Zni, θ, γ) =
[gn(β)
hn(β)
]=
[n−1
∑ni=1 g(Zni, θ, γ)
n−1∑n
i=1 h(Zni, θ, γ)
](3)
and let W be a (p + q) × (p + q) positive semi-definite weighting matrix with blocks
Wgg, Wgh, Whg and Whh, partitioned conformably to the partition of f(Z, β) by g(Z, β) and
h(Z, β). Each model and moment selection pair (b, c) ∈ BC defines a (|b| + s)-dimensional
6
estimator β(b, c) = (θ(b, c)′, γ(b)(b, c)′)′ of β(b) =(θ′, γ(b)
′)′according to
β(b, c) = arg minβ(b)∈B(b)
[Ξcfn
(β(b), γ
(−b)0
)]′ [ΞcWΞ′
c
] [Ξcfn
(β(b), γ
(−b)0
)]. (4)
We now state a number of standard high-level regularity conditions that will be assumed
throughout our derivations below.
Assumption 2.2 (High-level Regularity Conditions).
(a) β0 lies in the interior of Θ, a compact set
(b) W →p W , a positive definite matrix
(c) WcΞcE[f(Z, β)] = 0 if and only if β = β0, where Wc = ΞcWΞ′c
(d) E[f(Z, β)] is continuous on Θ
(e) supβ∈Θ||fn(β)− E[f(Z, β)]|| →p 0
(f) f is Z-almost surely differentiable in an open neighborhood B of β0
(g) supβ∈Θ||∇βfn(β)− F (β)|| →p 0
(h)√nfn(β0)−
√nE[f(Z, β0)] →d N where N ∼ Np+q(0,Ω)
(i) F (b, c)′WcF (b, c) is invertible, where Wc = ΞcWΞ′c
A particularly important special case is the estimator using only the moment conditions
in g to estimate the full parameter vector β = (θ′, γ′)′. We call this the valid estimator and
denote it by βv. Because it is assumed to be correctly specified both for finite n and in the
limit, the valid estimator contains the information we use to identify τ and δ, and thus carry
out moment and model selection. We assume that the valid estimator is identified.
Assumption 2.3 (GFIC Identification Condition). Let βv denote the GMM estimator for
β = (θ′, γ′)′ based solely on the moment conditions contained in g:
βv =(θ′v, γ
′v
)′= arg min
β∈Bgn(β)
′Wgg gn(β). (5)
We call this the “valid estimator” and assume that it satisfies all the conditions of Assumption
2.2. In particular assume that βv is identified, which implies p ≥ r + s.
7
Because Assumption 2.1 ensures that they are correctly specified in the limit, all can-
didate specifications (b, c) ∈ BC provide consistent estimators of θ0 under standard, high
level regularity conditions (see Assumption 2.2). Essential differences arise, however, when
we consider their respective asymptotic distributions. Under Assumption 2.1, both δ and τ
induce a bias term in the limiting distribution of√n(β(b, c)− β
(b)0
).
Theorem 2.1 (Asymptotic Distribution). Under Assumptions 2.1–2.2
√n(β(b, c)− β
(b)0
)d→ −K(b, c)Ξc
(N +
[0
τ
]− Fγδ
)(6)
where β(b)′
0 = (θ′0, γ(b)′
0 )′, K(b, c) = [F (b, c)′WcF (b, c)]−1 F (b, c)′Wc and N ∼ N(0,Ω) with
N = (N ′g ,N
′h )
′.
Because it employs the correct specification, the valid estimator of θ shows no asymptotic
bias. Moreover, the valid estimator of γ has an asymptotic distribution that is centered
around δ, suggesting an estimator of this bias parameter.
Corollary 2.1 (Asymptotic Distribution of Valid Estimator). Under Assumptions 2.1–2.2
√n(βv − β0
)=
√n
(θv − θ0
γv − γ0
)d→
[0
δ
]−KvNg
where Kv = [G′WggG]−1G′Wgg and Wgg = plim Wgg.
3 The GFIC
The GFIC chooses among potentially incorrect moment conditions and parameter restric-
tions to minimize estimator AMSE for a scalar target parameter. Denote this target pa-
rameter by µ = ϕ(θ, γ), where ϕ is a real-valued, almost surely continuous function of the
underlying model parameters θ and γ. Let µn = ϕ(θ0, γn) and define µ0 and µ(b, c) analo-
gously. By Theorem 2.1 and the delta method, we have the following result.
Corollary 3.1. Under the hypotheses of Theorem 2.1,
√n (µ(b, c)− µ0)
d→ −∇βϕ′0Ξ
′bK(b, c)Ξc
(N +
[0
τ
]− Fγδ
)
where ϕ0 = ϕ(θ0, γ0).
8
The true value of µ, however, is µn rather than µ0 under Assumption 2.1. Accordingly,
to calculate AMSE we recenter the limit distribution as follows.
Corollary 3.2. Under the hypotheses of Theorem 2.1,
√n (µ(b, c)− µn)
d→ −∇βϕ′0Ξ
′bK(b, c)Ξc
(N +
[0
τ
]− Fγδ
)−∇γϕ
′0δ
where ϕ0 = ϕ(θ0, γ0).
We see that the limiting distribution of µ(b, c) is not, in general, centered around zero:
both τ and δ induce an asymptotic bias. Note that, while τ enters the limit distribution only
once, δ has two distinct effects. First, like τ , it shifts the limit distribution of√nfn(θ0, γ0)
away from zero, thereby influencing the asymptotic behavior of√n (µ(b, c)− µ0). Second,
unless the derivative of ϕ with respect to γ is zero at (θ0, γ0), δ induces a second source
of bias when µ(b, c) is recentered around µn. Crucially, this second source of bias exactly
cancels the asymptotic bias present in the limit distribution of γv. Thus, the valid estimator
of µ is asymptotically unbiased and its AMSE equals its asymptotic variance.
Corollary 3.3. Under the hypotheses of Theorem 2.1,
√n (µv − µn)
d→ −∇βϕ(θ0, γ0)′KvNg
where µv = ϕ(θv, γv). Thus, the valid estimator µv shows no asymptotic bias and has asymp-
totic variance ∇βϕ(θ0, γ0)′KvΩggK
′v∇βϕ(θ0, γ0).
Using Corollary 3.2, the AMSE of µ(b, c) is as follows,
AMSE (µ(b, c)) = AVAR (µ (b, c)) + BIAS (µ (b, c))2 (7)
AVAR (µ (b, c)) = ∇βϕ′0Ξ
′bK(b, c)ΩcK(b, c)′Ξb∇βϕ0 (8)
BIAS (µ (b, c)) = −∇βϕ′0M(b, c)
[δ
τ
](9)
where
M(b, c) = Ξ′bK(b, c)Ξc
[−Gγ 0
−Hγ I
]+
[0p×r 0s×q
Ir 0r×q
](10)
The idea behind the GFIC is to construct an estimate AMSE (µ(b, c)) and choose the
specification (b∗, c∗) ∈ BC that makes this quantity as small as possible. As a side-effect
of the consistency of the estimators β(b, c), the usual sample analogues provide consistent
9
estimators of K(b, c) and F ′γ = (G′
γ, H′γ) under Assumption 2.1, and ϕ(θv, γ0) is consistent
for ϕ0. Consistent estimators of Ω are also readily available under local mis-specification
although the best choice may depend on the situation.3 Since γ0 is known, as are Ξb and
Ξc, only δ and τ remain to be estimated. Unfortunately, neither of these quantities is
consistently estimable under local mis-specification. Intuitively, the data become less and
less informative about τ and δ as the sample size increases since each term is divided by√n. Multiplying through by
√n counteracts this effect, but also stabilizes the variance of
our estimators. Hence, the best we can do is to construct asymptotically unbiased estimators
of τ and δ. Corollary 3.4 provides the required estimator for δ, namely δ =√n (γv − γ0),
while Lemma 3.1 provides an asymptotically unbiased estimator of τ by plugging βv into the
sample analogue of the h-block of moment conditions.
Corollary 3.4. Under the hypotheses of Theorem 2.1, δ =√n (γv − γ0)
d→ δ−Kγv Ng where
Kv = [G′WggG]−1G′Wgg =
(Kθ′
v , Kγ′v
)′, so δ is an asymptotically unbiased estimator of δ.
Lemma 3.1. Under the hypotheses of Theorem 2.1,
τ =√nhn
(βv
)d→ τ −HKvNg + Nh
where Kv = [G′WggG]−1G′Wgg. Hence, τ is an asymptotically unbiased estimator of τ .
Combining Corollary 3.4 and Lemma 3.1, gives the joint distribution of δ and τ .
Theorem 3.1. Under the hypotheses of Theorem 2.1,[δ
τ
]=
√n
[(γv − γ0)
hn(βv)
]d→
[δ
τ
]+ΨN , Ψ =
[−Kγ
v 0
−HKv I
]
where Kv = [G′WggG]−1G′Wgg is partitioned according to K ′
v = (Kθ′v , K
γ′v ).
Now, we see immediately from Equation 9 that
BIAS (µ (b, c))2 = ∇βϕ′0M(b, c)
[δδ′ δτ ′
τδ′ ττ ′
]M(b, c)′∇βϕ0
Thus, the bias parameters τ and δ enter the AMSE expression in Equation 7 as outer
products: ττ ′, δδ′ and τδ′. Although τ and δ are asymptotically unbiased estimators of τ
and δ, it does not follow that τ τ ′, δδ′ and τ δ′ are asymptotically unbiased estimators of
ττ ′, δδ′, and τδ′. The following result shows how to adjust these quantities to provide the
required asymptotically unbiased estimates.
3We discuss this in more detail for our dynamic panel example in Section 5 below.
10
Corollary 3.5. Suppose that Ψ and Ω are consistent estimators of Ψ and Ω. Then, B is an
asymptotically unbiased estimator of B, where
B =
[δδ′ δτ ′
τ δ′ τ τ ′
]− ΨΩΨ′, B =
[δδ′ δτ ′
τδ′ ττ ′
].
Combining Corollary 3.5 with consistent estimates of the remaining quantities yields
the GFIC, an asymptotically unbiased estimator of the AMSE of our estimator of a target
parameter µ under each specification (b, c) ∈ BC
GFIC(b, c) = ∇βϕ′0
[Ξ′bK(b, c)ΩcK(b, c)′Ξb + M(b, c) B M(b, c)′
]∇βϕ0. (11)
We choose the specification (b∗, c∗) that minimizes the GFIC over the candidate set BC.
4 Averaging and Post-Selection Inference
While we are primarily concerned in this paper with the mean-squared error performance
of our proposed selection techniques, it is important to have tools for carrying out inference
post-selection. In this section we briefly present results that can be used to carry out valid
inference for a range of model averaging and post-selection estimators, including the GFIC.4
The GFIC is an efficient rather than consistent selection criterion: it aims to estimate
a particular target parameter with minimum AMSE rather than selecting the correct spec-
ification with probability approaching one in the limit. As pointed out by Yang (2005),
among others, there is an unavoidable trade-off between consistent selection and desirable
risk properties. Faced with this dilemma, the GFIC sacrifices consistency in the interest of
low AMSE. Because it is not a consistent criterion, the GFIC remains random even in the
limit. We can see this from Equation 11 in Section 3 and Corollary 3.5. While the quantities
∇βϕ0, K(b, c), and Ωc are consistent estimators of their population counterparts, B is only
an asymptotically unbiased estimator of B and thus has a limiting distribution. In particular
B →d B(N , δ, τ) where
B(N , δ, τ) =
([δ
τ
]+ΨN
)([δ
τ
]+ΨN
)′
−ΨΩΨ (12)
Accordingly, to carry out inference post-GFIC, we need a limiting theory that is rich enough
to accommodate randomly-weighted averages of the candidate estimators µ(b, c). To this
4We direct the reader to DiTraglia (2016) and the references contained therein for a more detailed dis-cussion of inference post-selection.
11
end, consider an estimator of the form µ =∑
(b,c)∈BC ω(b, c)µ(b, c) where µ(b, c) denotes
the target parameter under the moment conditions and parameter restrictions indexed by
(b, c), BC denotes the full set of candidate specifications, and ω(b, c) denotes a collection
of data-dependent weights. These could be zero-one weights correponding to a moment or
model selction criterion, e.g. select the estimator that minimizes GFIC, or model averaging
weights.5 We impose the following mild restrictions on the weights ω.
Assumption 4.1 (Conditions on the Weights).
(a)∑
(b,c)∈BC ω(b, c) = 1, almost surely
(b) For each (b, c) ∈ BC, ω(b, c) d→ ψ(N , δ, τ |b, c), a function of N , δ, τ , and consistently
estimable constants with at most countably many discontinuities.
Under the preceding conditions, we can derive the limit distribution of µ shown in the
following Corollary.
Corollary 4.1 (Limit Distribution of Averaging Estimators). Under Assumption 4.1 and
the hypotheses of Theorem 2.1,√n (µ− µn)
d→ Λ(δ, τ) where
Λ(δ, τ) = −∇βϕ′0
∑(b,c)∈BC
ψ(N , δ, τ |b, c)
Ξ′bK(b, c)ΞcN +M(b, c)
[δ
τ
](13)
Note that the limit distribution from the preceding corollary is highly non-normal: it is
a randomly weighted average of a normal random vector, N . To tabulate this distribution
for the purposes of inference, we will in general need to resort to simulation. If τ and δ
were known, the story would end here. In this case we could simply substitute consistent
estimators of K and M and then repeatedly draw N ∼ N(0, Ω), where Ω is a consistent
estimator of Ω, to tabulate the distribution of Λ to arbitrary precision as follows.
Algorithm 4.1 (Approximating Quantiles of Λ(δ, τ)).
1. Generate J independent draws Nj ∼ N(0, Ω)
2. Set Λj(δ, τ) = −∇βϕ′0
∑(b,c)∈BC ψ(Nj, δ, τ |b, c)
Ξ′bK(b, c)ΞcNj + M(b, c)
[δ
τ
]
3. Using the Λj(δ, τ), find a(δ, τ), b(δ, τ) so that Pa(δ, τ) ≤ Λ(δ, τ) ≤ b(δ, τ)
= 1− α.
5For an example that averages over fixed and random effects estimators, see Section A.1.
12
Unfortunately, no consistent estimators of τ or δ exist: all we have at our disposal are
asymptotically unbiased estimators. The following “1-step” confidence interval is constructed
by substituting these into Algorithm 4.1.
Algorithm 4.2 (1-Step Confidence Interval). Carry out of Algorithm 4.1 with τ and δ set
equal to the estimators τ and δ from Theorem 3.1 to calculate a(δ, τ) and b(δ, τ). Then set
CI1(α) =[µ− b(δ, τ)/
√n, µ− a(δ, τ)/
√n].
The 1-Step interval defined in Algorithm 4.2 is conceptually simple, easy to compute,
and can perform well in practice.6 But as it fails to account for sampling uncertainty in
τ , CI1 does not necessarily yield asymptotically valid inference for µ. Fully valid inference
requires the addition of a second step to the algorithm and comes at a cost: conservative
rather than exact inference. In particular, the two-step procedure described in the following
algorithm is guaranteed to yield an interval with asymptotic coverage probability of no less
than (1− α1 − α2)× 100%.
Algorithm 4.3 (2-Step Confidence Interval for µ).
1. Construct R, a (1− α1)× 100% joint confidence region for (δ, τ) using Theorem 3.1.
2. For each (δ∗, τ ∗) ∈ R carry out Algorithm 4.1 with α = α2 yielding a (1− α2)× 100%
confidence interval[a(δ∗, τ ∗), b(δ∗, τ ∗)
]for Λ(δ∗, τ ∗).
3. Set amin = min(δ∗,τ∗)∈R
a(δ∗, τ ∗) and bmax = max(δ∗,τ∗)∈R
b(δ∗, τ ∗).
4. Construct the interval CI2(α1, α2) =[µ− bmax/
√n, µ− amin/
√n].
Theorem 4.1 (2-Step Confidence Interval for µ). Let ∇βϕ0, ψ(·|b, c), K(b, c) and M(b, c) be
consistent estimators of ∇βϕ0, ψ(·|b, c), K(b, c) and M(b, c) and let R be a (1− α1)× 100%
confidence region for (δ, τ) constructed from Theorem 3.1. Then CI2(α1, α2), defined in
Algorithm 4.3 has asymptotic coverage probability no less than 1− (α1 + α2) as J, n→ ∞.
5 Dynamic Panel Example
We now specialize the GFIC to a dynamic panel model of the form
yit = θxit + γ1yit−1 + · · ·+ γkyit−k + ηi + vit (14)
6For more discussion on this point, see DiTraglia (2016).
13
where i = 1, . . . , n indexes individuals and t = 1, . . . , T indexes time periods. For simplicity,
and without loss of generality, we suppose that there are no exogenous time-varying regressors
and that all random variables are mean zero.7 The unobserved error ηi is a correlated
individual effect: σxη ≡ E [xitηi] may not equal zero. The endogenous regressor xit is assumed
to be predetermined but not necessarily strictly exogenous: E[xitvis] = 0 for all s ≥ t but
may be nonzero for s < t. We assume throughout that yit is stationary, which requires both
xit and uit to be stationary and |γ| < 1 where γ = (γ1, . . . , γk)′. Our goal is to estimate one
of the following two target parameters with minimum MSE:
µSR ≡ θ, µLR ≡ θ/ [1− (γ1 + · · ·+ γk)] (15)
where µSR denotes the short-run effect and µLR the long-run effect of x on y.
The question is which assumptions to use in estimation. Naturally, the answer may
depend on whether our target is µSR or µLR. Our first decision is what assumption to
impose on the relationship between xit and vit. This is the moment selection decision.
We assumed above that x is predetermined. Imposing the stronger assumption of strict
exogeneity gives us more and stronger moment conditions, but using these in estimation
introduces a bias if x is not in fact strictly exogenous. Our second decision is how many
lags of y to use in estimation. This is the model selection decision. The true model contains
k lags of y. If we estimate only r < k lags we not only have more degrees of freedom
but more observations: every additional lag of y requires us to drop one time period from
estimation. In the short panel datasets common in microeconomic applications, losing even
one additional time period can represent a substantial loss of information. At the same time,
unless γr+1 = · · · = γk = 0, failing to include all k lags in the model introduces a bias.
To eliminate the individual effects ηi we work in first differences. Defining ∆ in the usual
way, so that ∆yit = yit − yit−1 and so on, we can write Equation 14 as
∆yit = θ∆xit + γ1∆yit−1 + · · ·+ γk∆yit−k +∆vit. (16)
For simplicity and to avoid many instruments problems – see e.g. Roodman (2009) – we
focus here on estimation using the instrument sets
z′it(`,P) ≡[yit−2 · · · yit−(`+1) xit−1
]z′it(`, S) ≡
[z′it(`,P) xit
](17)
similar to Anderson and Hsiao (1982). Modulo a change in notation, one could just as
7Alternatively, we can simply de-mean and project out any time-varying exogenous covariates after takingfirst-differences.
14
easily proceed using the instrument sets suggested by Arellano and Bond (1991). We use
` as a placeholder for the lag length used in estimation. If ` = 0, z′it(0,P) = xit−1 and
z′it(0, S) = (xit−1, xit). Given these instrument sets, we have (` + 1) × (T − ` − 1) moment
conditions if x is assumed to be predetermined versus (`+2)×(T−`−1) if it is assumed to be
strictly exogenous, corresponding to the instrument matrices Zi(`,P) = diag z′it(`,P)Tt=`+2
and Zi(`, S) = diag z′it(`, S)Tt=`+2. To abstract for a moment from the model selection
decision, suppose that we estimate a model with the true lag length: ` = k. The only
difference between the P and S sets of moment conditions is that the latter adds over-
identifying information in the form of E[xit∆vit]. If x is strictly exogenous, this expectation
equals zero, but if x is only predetermined, then E[xit∆vit] = −E[xitvit−1] 6= 0 so the
over-identifying moment condition is invalid. Given our instrument sets, this is the only
violation of strict exogeneity that is relevant for our moment selection decision so we take
E[xitvit−1] = −τ/√n.
In the examples and simulations described below we consider two-stage least squares
(TSLS) estimation of µSR and µLR using the instruments defined in Equation 17. Without
loss of generality, we select between two lag length specifications: the first is correct, ` = k,
and the second includes m lags too few: ` = r where r = k −m. Accordingly, we make the
coefficients associated with the (r+1)th, . . . , kth lags local to zero. Let γ ′ = (γ1, · · · , γk−1, γk)
denote the full vector of lag coefficients and γ ′r = (γ1, · · · , γr) denote the first r = k−m lag
coefficients. Then, the true parameter vector is βn = (θ,γ ′r, δ
′/√n)′ which becomes, in the
limit, β = (θ,γ ′r,0
′)′. Both δ and 0 are of length m. To indicate the subvector of β that
excludes the (r + 1)th, . . . , kth lag coefficients, let βr = (θ,γ ′r)
′.
Because the two lag specifications we consider use different time periods in estimation,
we require some additional notation to make this clear. First let ∆yi = [∆yi,k+2, · · · ,∆yiT ]′
and ∆y+i = [∆yi,k+2−m,∆yi,k+2−(m−1), · · · ,∆yiT ]′ where the superscript “+” indicates the
inclusion of m additional time periods: t = k+2−m, . . . , k+1. Define ∆xi, ∆x+i , ∆vi, and
∆v+i analogously. Next, define Lr+1∆y+
i = [∆yi1,∆yi2, · · · ,∆yiT−(r+1)]′ where Lr+1 denotes
the element-wise application of the (r + 1)th order lag operator. Note that the first element
of Lr+1∆y+i is unobserved since ∆yi1 = yi1 − yi0 but t = 1 is the first time period. Now we
define the matrices of regressors for the two specifications:
W+′
i (r) =[∆x+
i L∆y+i L2∆y+
i · · · Lr∆y+i
]W ′
i (k) =[∆xi L∆yi L2∆yi · · · Lk−1∆yi Lk∆yi
].
Note that W+i (r) contains m more rows than Wi(k) but Wi(k) contains m more columns
than W+i (r): removing the (r + 1)th, . . . , kth lags from the model by setting ` = r = k −m
15
allows us to usem additional time periods in estimation and reduces the number of regressors
by m. Stacking over individuals, let ∆y = [∆y′1 · · ·∆y′
n]′, W` = [W1(`) · · ·Wn(`)]
′ and define
∆y+ and W+` analogously, where ` denotes the lag length used in estimation. Finally, let
Z ′(`, ·) = [Z ′1(`, ·) · · ·Z ′
n(`, ·)] where (·) is P or S depending on the instrument set in use.
Using this notation, under local mis-specification the true model is
∆y = W (k)βn +∆v ∆y+ = W (k)+βn +∆v+ (18)
Using the shorthand Q ≡ n[W ′Z(Z ′Z)−1Z ′W ]−1W ′Z(Z ′Z)−1 our candidate estimators are
β(k, ·) = Q(k, ·)[Z ′(k, ·)∆y
n
]β(r, ·) = Q(r, ·)
[Z ′(r, ·)∆y+
n
](19)
where (·) is either P or S depending on which instrument set is used and r = k − m, m
lags fewer than the true lag length k. The following result describes the limit distribution of
β(k,P), β(k, S), β(r,P), and β(r, S) which we will use to construct the GFIC.
Theorem 5.1 (Limit Distributions for Dynamic Panel Estimators). Let (ynit, xnit, vnit) be
a triangular array of random variables that is iid over i, stationary over t, and satisfies
Equation 16 with (γk−m+1, . . . , γk)′ = δ/
√n. Suppose further that xit is predetermined with
respect to vit but not strictly exogenous: E[xit∆vit] = τ/√n. Then, under standard regularity
conditions,
√n[β(k,P)− β
]→d
[0 0′
r δ′]′+Q (k,P)N (0,V(k,P))
√n[β(k, S)− β
]→d
[0 0′
r δ′]′+Q (k, S)
ιT−(k+1) ⊗
[0k+1
τ
]+ N (0,V(k, S))
√n[β(r,P)− βr
]→d Q(r,P)
[ιT−(r+1) ⊗ψP δ + N (0,V(r,P))
]√n[β(r, S)− βr
]→d Q(r, S)
[ιT−(r+1) ⊗
([ψP
ψS
]δ +
[0r+1
τ
])+ N (0,V (r, S))
]
where r = k − m, β′ = (θ, γ1, . . . , γr,0′m), β
′r = (θ, γ1, . . . , γr), V(k, ·) = Var [Zi(k, ·)∆vi],
V(r, ·) = Var[Zi(r, ·)∆v+
i
], Q(`, ·) →p Q(`, ·), ψP = E[zit(r,P)(∆yit−(r+1), . . . ,∆yit−k)],
ψS = E[xit(∆yit−(r+1), . . . ,∆yit−k)], Zi(`, ·) = diagz′it(`, ·)Tt=`+2, zit(`, ·) is as in Equation
17, and ιd denotes a d-vector of ones.
To operationalize the GFIC, we need to provide appropriate estimators of all quantities
that appear in Theorem 5.1. To estimate Q(k,P), Q(k, S), Q(r,P), and Q(r, S) we em-
ploy the usual sample analogues Q(·, ·) given above, which remain consistent under local
16
mis-specification. There are many consistent estimators for the variance matrices V(k,P),V(k, S), V(r,P), V(r, S) under local mis-specification. In our simulations below, we employ
the usual heteroskedasticity-consistent, panel-robust variance matrix estimator. Because
E[zit(`, S)∆vit] 6= 0, we center our estimators of V(`, S) by subtracting the sample analogue
of this expectation when calculating the sample variance. We estimate ψP and ψS as follows
ψ′P =
1
nTk
∑T
t=k+2
∑ni=1 zit(r,P)∆yit−(r+1)
...∑Tt=k+2
∑ni=1 zit(r,P)∆yit−k
, ψ′S =
1
nTk
∑T
t=k+2
∑ni=1 xit∆yit−(r+1)
...∑Tt=k+2
∑ni=1 xit∆yit−k
where Tk = T − k− 1. These estimates use our assumption of stationarity from above. The
only remaining quantities we need to construct the GFIC involve the bias parameters δ and
τ . We can read off an asymptotically unbiased estimator of δ directly from Theorem 5.1,
namely δ =√n (γr+1(k,P), . . . , γk(k,P))
′ based on the instrument set that assumes only that
x is pre-determined rather than strictly exogenous. To construct an asymptotically unbiased
estimator of τ , we use the residuals from the specification that uses both the correct moment
conditions and the correct lag specification, specifically
τ =
(ι′T−k−1
T − k − 1
)n−1/2X ′
[∆y −W (k)β(k,P)
](20)
where X ′ = [X1 · · ·Xn] and Xi = diag xitTt=k+2. The following result gives the joint limiting
behavior of δ and τ , which we will use to construct the GFIC.
Theorem 5.2 (Joint Limit Distribution of δ and τ). Under the conditions of Theorem 5.1,[δ − δτ − τ
]d→ ΨN (0,ΠV (k, S) Π′)
where δ =√n[e′k β(k,P)], ek = (0,0′
k−m, ι′m)
′, τ is as defined in Equation 20,
Ψ =
e′kQ(k,P) 0′T−k−1(
ι′T−k−1
T−k−1
)ξ′Q(k,P)⊗ ι′T−k−1
(ιT−k−1
T − k − 1
) ,ξ′ = E
xit
[∆xit L∆yit · · · Lk∆yit
], the variance matrix V(k, S) is as defined in
Theorem 5.1, the permutation matrix Π =[Π′
1 Π′2
]′with Π1 = IT−k−1 ⊗
[Ik+1 0k+1
]and Π2 = IT−k−1 ⊗
[0′k+1 1
], ιd is a d-vector of ones and Id the (d× d) identity matrix.
17
To provide asymptotically unbiased estimators of the quantities τ 2, δδ′ and δτ that ap-
pear in the AMSE expressions for our estimators, we apply a bias correction to the asymp-
totically unbiased estimators of δ and τ from Theorem 5.2.
Corollary 5.1. Let Ψ be a consistent estimator of Ψ, defined in Theorem 5.2, and V(k, S)be a consistent estimator of V(k, S), defined in Theorem 5.1. Then, the elements of[
δδ′δτ
τ δ′τ 2
]− ΨΠ V (k, S)Π′ Ψ′
provide asymptotically unbiased estimators of of δδ′, τ 2 and δτ , where Π is the permutation
matrix defined in Theorem 5.2.
We have already discussed consistent estimation of V(k, S). Since Π is a known permuta-
tion matrix, it remains only to propose a consistent estimator of Ψ. The matrix Ψ, in turn,
depends only on Q(k,P), and ξ′. The sample analogue Q(k,P) is a consistent estimator for
Q(k,P), as mentioned above, and
ξ′=
1
n(T − k − 1)
T∑t=k+2
n∑i=1
xit
[∆xit L∆yit · · · Lk∆yit
](21)
is consistent for ξ′. We now have all the quantities needed to construct the GFIC for µSR,
the short-run effect of x on y. Since µSR = θ, we can read off the AMSE expression for
this parameter directly from Theorem 5.1. For the long-run effect µLR, however, we need to
formally apply the Delta-method and account for the fact that the true value of (γr+1, . . . , γk)′
is δ/√n. Expressed as a function ϕ of the underlying model parameters,
µLR = ϕ(θ,γr,γ−r) = θ/[1− ι′rγr − ι′mγ−r
]where we define γ−r ≡ (γr+1, . . . , γk)
′. The derivatives of ϕ are
∇ϕ ≡[∂ϕ
∂θ
∂ϕ
∂γ ′r
∂ϕ
∂γ ′−r
]=
(1
1− ι′rγr − ι′mγ−r
)2 [ (1− ι′rγr − ι′mγ−r
)θ ι′r θ ι′m
].
Using this notation, the limiting value of µLR is µ0LR = ϕ(θ,γr,0
′m) while the true value is
µnLR = ϕ(θ,γr, δ/
√n). Similarly, the limiting value of ∇ϕ is ∇ϕ0 = ∇ϕ(θ,γr,0
′m)
′, obtained
by putting zero in place of γ−r. We estimate this quantity consistently by plugging in the
estimates from β(k,P).
18
6 Simulation Study
We now consider two simulation experiments based on section 5, applying the GFIC to a
dynamic panel model. For both experiments our data generating process is similar to that
of Andrews and Lu (2001), specifically xi
ηi
vi
∼ iid N
0T
0
0T
, IT σxηιT σxvΓT
σxηι′T 1 0′
T
σxvΓ′T 0T IT
, ΓT =
[0′T−1 0
IT−1 0T−1
]. (22)
where 0m denotes an m-vector of zeros, Im the (m × m) identity matrix, and ιm an m-
vector of ones. Under this covariance matrix structure ηi and vi are uncorrelated with each
other, but both are correlated with xi: E[xitηi] = σxη and xit is predetermined but not
strictly exogenous with respect to vit. Specifically, E[xitvit−1] = σxv, while E[xitvis] = 0 for
s 6= t−1. We initialize the pre-sample observations of y to zero, the mean of their stationary
distribution, and generate the remaining time periods according to Equation 14 with θ = 0.5
and σxη = 0.2. The true lag length differs in our two examples as does the target parameter,
so we explain these features of the simulation designs below. Unlike Andrews and Lu (2001)
we do not generate extra observations to keep the time dimension fixed across estimators
with different lag specifications. This is for two reasons. First, in real-world applications
such additional observations would not be available. Second, we are explicitly interested in
trading off the efficiency gain from including additional time periods in estimation against
the bias that arises from estimating an incorrect lag specification.
6.1 Long-run versus Short-run Effects
Consider two different researchers who happen to be working with the same panel dataset.
One wishes to estimate the short-run effect of x on y while the other wishes to estimate the
long-run effect. Should they use the same model specification? We now present an example
showing that the answer, in general, is no. Suppose that the true model is
yit = θxit + γ1yit−1 + γ2yit−2 + ηi + vit
where i = 1, . . . , n = 250 and t = 1, . . . , T = 5 and the regressor, individual effect and error
term are generated according to Equation 22, as described in the preceding section. Our
model selection decision in this example is whether to set γ2 = 0 and estimate a specification
with one lag only. We denote this one-lag specification by L1 and the true specification,
including both lags, by L2. To focus on the model selection decision, we fix the instrument
19
set in this experiment to zit(`,P), defined in Equation 17. Because this instrument set is
valid when x is pre-determined, it does not introduce bias into our estimation. Thus, bias
only emerges if we estimate L1 when γ2 6= 0. Our simulation design takes θ = 0.5, γ1 =
0.4, σxη = 0.2, and σxv = 0.1 and varies γ2 over the range 0.10, 0.11, . . . , 0.19, 0.20.Table 1 presents the results of the simulation, based on 1000 replications at each grid
point. Because they are based on ratios of estimators of θ and γ1, γ2, estimators of the long-
run effect may not have finite moments, making finite-sample MSE undefined. The usual
solution to this problem in simulation settings is to work with so-called “trimmed” MSE
by discarding observations that fall outside, say, a range [−M,M ] before calculating MSE.8
Because there is no clear way to set the trimming constant M , it can be difficult to interpret
results based on trimmed MSE unless one considers a variety of values of M . To avoid this
issue, Table 1 reports simulation results for median absolute deviation (MAD). Results for
trimmed MSE with different choices of M are similar and are available upon request.
Short-run Effect Long-run Effectγ2 L2 L1 GFIC L2 L1 GFIC
0.10 0.231 0.141 0.173 0.801 0.582 0.6880.11 0.237 0.156 0.181 0.834 0.633 0.7160.12 0.240 0.174 0.193 0.850 0.685 0.7520.13 0.238 0.187 0.201 0.870 0.729 0.7870.14 0.220 0.198 0.203 0.870 0.764 0.8080.15 0.201 0.219 0.211 0.844 0.822 0.8390.16 0.205 0.223 0.210 0.883 0.856 0.8620.17 0.181 0.242 0.204 0.860 0.911 0.8970.18 0.162 0.258 0.189 0.835 0.959 0.8910.19 0.161 0.265 0.181 0.866 0.997 0.9170.20 0.143 0.288 0.162 0.858 1.054 0.910
Table 1: Comparisons of mean absolute deviation (MAD) for estimators of the Short-run andLong-run effects of x on y in the simulation experiment described in Section 6.1. The columnslabeled L1 and L2 give the MAD of estimators that fix the lag length to one and two, while thecolumns labeled GFIC give the MAD of an estimator that selects lag length via the GFIC. Resultsare based on 1000 simulation replications from the DGP described in Section 6 with γ1 = 0.4, usingthe estimators described in Section 5 and the instrument set zit(`,P) from Equation 17.
The columns of Table 1 labeled L1 and L2 give the MAD of estimators that fix the lag
length to one and two, while those labeled GFIC give the MAD of an estimator that selects
lag length via the GFIC. Notice that throughout the table γ2 6= 0 so that L1 is mis-specified.
8Note that asymptotic MSE remains well-defined even for estimators that do not possess finite-samplemoments so that GFIC comparisons remain meaningful. By taking the trimming constant M to infinity,one can formalize the notion that asymptotic MSE comparisons can be used to “stand in” for finite-sampleMSE even when the latter does not exist. For more details, See Hansen (2016) and online appendix C ofDiTraglia (2016).
20
Nonetheless, L1 yields lower MAD estimators of both the short-run and long-run effects when
γ2 is sufficiently small and the difference can be substantial. When γ2 = 0.2, for example,
MAD for is 0.582 for the long-run effect estimator based on L1 versus 0.801 for that based on
L2. Note moreover that the point at which γ2 becomes large enough for L2 to be preferred
depends on which effect we seek to estimate. When γ2 equals 0.15 or 0.16, L1 gives a lower
MAD for the short-run effect while L2 gives a lower MAD for the long-run effect. Because
it is subject to random model selection errors, the GFIC can never outperform the oracle
estimator that uses L1 when it is optimal in terms of MAD and L2 otherwise. Instead, the
GFIC represents a compromise between two extremes: its MAD is never as large as that of
the worst specification and never as small as that of the best specification. When there are
large MAD differences between L1 and L2, however, GFIC is generally close to the optimum.
6.2 Model and Moment Selection for the Short-run Effect
We now consider a more complicated simulation experiment that simultaneously selects over
lag specification and endogeneity assumptions. In this simulation our target parameter is
the short-run effect of x on y, as in our empirical example below and the true model contains
one lag. Specifically,
yit = θxit + γyit−1 + ηi + vit
where i = 1, . . . , n and t = 1, . . . , T and the regressor, individual effect and error term are
generated according to Equation 22. Our model selection decision example is whether to set
γ = 0 and estimate a specification wihout the lagged dependent variable, while our moment
selection decision is whether to use only the instrument set zit(`,P), which assumes that x
is predetermined, or the instrument set zit(`, S) which assumes that it is strictly exogenous.
Both instrument sets are defined in Equation 17. We consider four specifications, each
estimated by TSLS using the expressions from section 5. The correct specification, LP,
estimates both γ and θ using only the “predetermined” instrument set. In contrast, LS
estimates both parameters using the “strict exogeneity” instrument set. The specifications
P and S set γ = 0 and estimate only θ, using the predetermined and strictly exogenous
instrument sets, respectively. Our simulation design sets θ = 0.5, σxη = 0.2 and varies
γ, σxv, T and n over a grid. Specifically, we take γ, σxv ∈ 0, 0.005, 0.01, . . . , 0.195, 0.2,n ∈ 250, 500, T ∈ 4, 5.9 All values are computed based on 2000 simulation replications.
Table 2, presents RMSE values multiplied by 1000 for ease of reading for each of the fixed
specifications – LP, LS, P, and S – and for the various selection procedures.10 We see that
9Setting T no smaller than 4 ensures that MSE exists for all four estimators: the finite sample momentsof the TSLS estimator only exist up to the order of over-identification.
10In the interest of space, Table 2 uses a coarser simulation grid than Figure B.3. The supplementary
21
there are potentially large gains to be had by intentionally using a mis-specified estimator.
Indeed, the correct specification, LP, is only optimal when both ρxv and γ are fairly large
relative to sample size. When T = 4 and n = 250, for example, γ and ρxv must both
exceed 0.10 before LP has the lowest RMSE. Moreover, the advantage of the mis-specified
estimators can be substantial.
In practice, of course, we do not know the values of γ, ρ, θ, or the other parameters of
the DGP so this comparison of finite-sample RMSE values is infeasible. Instead, we consider
using the GFIC to select between the four specifications. Clearly there are gains to be had
from estimating a mis-specified model in certain situations. The questions remains, can the
GFIC identify them? Because it is an efficient rather than consistent selection criterion, the
GFIC remains random, even in the limit. This means that the GFIC can never outperform
the “oracle” estimator that uses whichever specification gives the lowest finite-sample RMSE.
Moreover, because our target parameter is a scalar, Stein-type results do not apply: the
post-GFIC estimator cannot provide a uniform improvement over the true specification LP.
Nevertheless, the post-GFIC estimator can provide a substantial improvement over LP when
ρxv and γ are relatively small relative to sample size, as shown in the two leftmost panes of
the top panel in Table 2. This is precisely the situation for which the GFIC is intended: a
setting in which we have reason to suspect that mis-specification is fairly mild. Moreover,
in situations where LP has a substantially lower RMSE than the other estimators, the post
GFIC-estimator’s performance is comparable.
To provide a basis for comparison, we now consider results for a number of alternative
selection procedures. The first is a “Downward J-test,” which is intended to approximate
what applied researchers may do in practice when faced with a model and moment selection
decision such as this one. The Downward J-test selects the most restrictive specification
that is not rejected by a J-test test with significance level α ∈ 0.05, 0.1. We test the
specifications S,P,LS,LP in order and report the first that is not rejected. This means
that we only report LP if all the other specifications have been rejected. This procedure is,
of course, somewhat ad hoc because the significance threshold α is chosen arbitrarily rather
than with a view towards some kind of selection optimality. We also consider the GMM
model and moment selection criteria of Andrews and Lu (2001). These take the form
MMSCn(b, c) = Jn(b, c)− (|c| − |b|)κn
where |b| is the number of parameters estimated, |c| the number of moment conditions used,
and κn is a function of n. Setting κn = log n gives the GMM-BIC, while κn = 2 gives the
figures in Online Appendix B present results over the full simulation grid.
22
GFIC
LP
LS
PS
ρxv
00.05
0.1
0.15
00.05
0.1
0.15
00.05
0.1
0.15
00.05
0.1
0.15
00.05
0.1
0.15
TN
γ4
250
057
6062
6772
7069
6851
5874
9551
5352
5242
4965
860.05
6062
6769
7471
7170
5259
7795
5857
5856
4354
7291
0.1
6667
7475
7772
7471
5257
7697
7673
7170
4760
7797
0.15
7175
7882
8177
7475
5362
7699
9996
9289
5569
84103
500
040
4348
4851
4949
4837
4767
9037
3736
3630
4059
830.05
4347
5149
5250
5149
3647
6790
4545
4443
3147
6587
0.1
4553
5654
5251
5050
3647
6990
6764
6259
3753
7392
0.15
5056
5956
5653
5250
3648
6992
9290
8583
4563
80100
5250
045
4854
5656
5655
5442
5170
9144
4544
4536
4462
830.05
4852
5655
5856
5554
4352
7092
5251
5148
3850
6889
0.1
5157
6158
5957
5555
4453
7294
6866
6562
4257
7595
0.15
5561
6564
6060
5857
4452
7494
9489
8581
5164
83100
500
033
3640
3841
4038
3831
4263
8632
3132
3227
3656
790.05
3540
4038
4139
3838
3142
6387
4240
3838
2943
6285
0.1
3744
4441
4240
3940
3143
6688
6360
5855
3552
7291
0.15
3845
4442
4242
3940
3144
6790
8885
8076
4462
8098
J-test5%
J-test10%
GMM-B
ICGMM-A
ICGMM-H
Qρxv
00.05
0.1
0.15
00.05
0.1
0.15
00.05
0.1
0.15
00.05
0.1
0.15
00.05
0.1
0.15
TN
γ4
250
044
5266
8046
5368
7744
5168
8952
5872
8146
5570
880.05
4755
7288
4956
7386
4655
7393
5360
7588
4856
7493
0.1
5561
7895
5863
7894
5060
7898
5964
7992
5360
7897
0.15
6874
86102
7375
86101
5769
83103
6673
84100
5969
84102
500
032
4155
6134
4355
5731
4261
8537
4758
6134
4463
790.05
3447
6474
3648
6369
3247
6688
3949
6467
3447
6682
0.1
4655
7186
4956
7182
3953
7392
4454
7078
4053
7389
0.15
6867
8197
7369
8195
4362
80100
5261
7892
4360
7998
5250
037
4562
7438
4661
7140
4968
8943
5066
7540
5069
860.05
4151
6782
4252
6678
4252
7092
4454
6879
4252
7089
0.1
4758
7491
5059
7488
4456
7495
4858
7387
4556
7494
0.15
6367
8297
6768
8296
4658
8099
5461
7993
4758
8098
500
028
3751
5129
3849
4630
4162
8431
4153
5030
4160
730.05
3144
5966
3244
5761
3144
6486
3344
5756
3144
6377
0.1
4152
7083
4453
6978
3248
7190
3749
6670
3348
7086
0.15
6565
7994
7066
7890
3352
7696
4155
7384
3353
7694
Table
2:RMSE
values
multiplied
by10
00forthesimulation
experim
entfrom
Section
6.2.
23
GMM-AIC and κn = 2.01(log log n) gives the GMM-HQ. Under certain assumptions, it can
be shown that both the GMM-BIC and GMM-HQ are consistent: they select the maximal
correctly specified estimator with probability approaching one in the limit. To implement
these criteria, we calculate the J-test based on the optimal, two-step GMM estimator with
a panel robust, heteroscedasticity-consistent, centered covariance matrix estimator for each
specification. To compare selection procedures we use the same simulation grid as above,
namely γ, σxv ∈ 0, 0.005, 0.01, . . . , 0.195, 0.20. Again, each point on the simulation grid is
calculated from 2000 simulation replications. The bottom panel of Table 2 presents results
for these alternative selection procedures. There is no clear winner in point-wise RMSE
comparisons between the GFIC and its competitors. A substantial difference between the
GFIC and its competitors emerges, however, when we examine worst-case RMSE. Here the
GFIC clearly dominates, providing the lowest worst-case RMSE across all configurations
of T and n. The differences are particularly stark for larger sample sizes. For example,
when T = 5 and n = 500 the worst-case RMSE of GFIC is approximately half that of its
nearest competitor: GMM-AIC. The consistent criteria, GMM-BIC and GMM-HQ, perform
particularly poorly in terms of worst-case RMSE. This is unsurprising given that the worst-
case risk of any consistent selection criteria diverges as sample size increases.11
7 Empirical Example: The Demand for Cigarettes
We now consider an empirical example illustrating the GFIC in the dynamic panel setting
introduced in Section 5. Our exercise is based on Baltagi et al. (2000) who study the demand
for cigarettes using panel data for 46 U.S. states between 1963 and 1992. Their model is
lnCit = γ lnCi,t−1 + θ lnPit + β1 lnYit + β2 lnPnit + ηi + λt + vit
where Cit is the number of packs of cigarettes sold per person aged 16 and above, Pit is the
real average retail price of a pack of cigarettes, Yit is per capita disposable income, Pnit is
the minimum average price of a pack of cigarettes in any state that neighbors state i, ηi is a
state fixed effect, and λt is a time fixed effect. The lagged dependent variable in this model
is meant to capture habit-persistence in cigarette consumption but it is the price elasticity
not the habit-persistence per se that is of primary interest.
Baltagi et al. (2000) consider an exhaustive list of possible estimators for two target
parameters, the short-run price elasticity θ and the long-run price elasticity θ/(1 − γ), and
explore how the resulting estimates vary. Here we consider selecting between four alternative
11See, e.g., Leeb and Potscher (2008).
24
estimators of the short-run price elasticity θ, as in the second simulation experiment from
Section 6. Each specification is estimated by TSLS in first differences, using the expressions
from section 5.12 For simplicity, we assume that the controls lnYit and lnPnit are exogenous
with respect to vit. We focus on two questions. First: what exogeneity assumption should
we impose on lnPit? Second: should we allow for habit persistence by estimating γ?
Our baseline specification, LP, estimates both γ and θ and assumes only that lnPit is
predetermined with respect to vit. This estimator uses the instrument set zit(`,P) with ` = 1
from Equation 17. For the purposes of this exercise we assume that LP is correctly specified.
The specification LS also estimates γ, but uses the expanded instrument set zit(`, S) with
` = 1 from equation 17. The additional instruments used in LS are only valid if lnPit is
strictly exogenous with respect to vit. Like LP and LS, the specifications P and S differ in
whether or not they impose that lnPit is predetermined or strictly exogenous. In contrast,
however, they set γ = 0 and estimate a model with no habit persistence (` = 0). This
increases the number of time periods available for estimation.
(a) 1975–1980 (T = 6)
LP LS P S
θ -0.68 -0.32 -0.28 -0.37Var. 0.16 0.02 0.07 0.01Bias2 — -4.20 0.01 -3.56GFIC 0.16 -4.18 0.08 -3.54GFIC+ 0.16 0.02 0.08 0.01
(b) 1975–1985 (T = 11)
LP LS P S
θ -0.30 -0.26 -0.38 -0.28Var. 0.06 0.01 0.05 0.01Bias2 — 2.21 0.03 1.29GFIC 0.06 2.22 0.08 1.30GFIC+ 0.06 2.22 0.08 1.30
Table 3: Estimates and GFIC values for the price elasticity of demand for cigarettes examplefrom Section 7 under four alternative specifications. Panel (a) presents results using data from1975–1980 while Panel (b) presents results using data from 1975–1985. GFIC+ gives an alternativeversion of the GFIC in which a negative squared bias estimate is set equal to zero.
Estimates and GFIC results for all specifications appear in Table 3. In Panel 3a we use
data from 1975–1980 only (T = 6). After first-differencing, this leaves 4 time periods for
estimation in specifications that include a lag (LP and LS) versus 5 for those that do not (P
and S). In panel 3b we use data from 1975–1985 (T = 11). After first-differencing this leaves
9 time periods for estimation without a lag versus 10 for estimation with a lag. We choose
to artificially shorten the time dimension of the panel from Baltagi et al. (2000) to illustrate
a key feature of the GFIC, namely that it takes into account the number of available time
12Our baseline specification is similar to the estimator that Baltagi et al. (2000) refer to as FD2SLS. Thereare two differences however. First, whereas they use lags of the exogenous controls lnYit and lnPnit, ourinstrument set follows Anderson and Hsiao (1982). Second, whereas Baltagi et al. (2000) appear to haveinadvertantly omitted the time dummies from their FD2SLS specification, we include them.
25
periods when selecting over parameter restrictions and moment conditions.13 Each column
in Table 3 refers to a particular specification: LP, LS, P or S. The first row of each panel
gives the associated estimate of the target parameter θ while the second gives the estimated
asymptotic variance of√n(θ− θ0), one of the two ingredients of the GFIC.14 Unsurprisingly
the asymptotic variance decreases with the number of time periods available for estimation:
for a given sample period LP and LS have a higher asymptotic variance than P and S, and
all the estimates based on the 1975–1985 sample are more precise than their counterparts
for the 1975-1980 sample. Moreover, for a given sample period, the estimators with a large
instrument set show a lower asymptotic variance: LS is more precisely estimated than LP
and S is more precisely estimated than P.
The third row of each panel gives our estimate of the squared asymptotic bias of the
various estimators of θ. The “—” entry for the LP estimator indicates that the GFIC is
constructed under the assumption that this specification has no asymptotic bias. Note that
the squared bias estimator is negative for LS and S in the 1975–1980 sample. This occurs
when the second term in the estimate of the bias matrix B from Corollary 3.5, namely ΨΩΨ′,
is larger than the first. Accordingly, we consider two alternative ways of constructing the
GFIC from the asymptotic variance and squared bias estimates. The first, labelled “GFIC,”
simply adds squared bias and variance. The second, labelled “GFIC+,” first truncates a
negative squared bias estimate to zero, and then adds the result to the variance estimate.
For the 1975–1980 sample period we find no evidence of appreciable bias in any of the three
“suspect” specifications: LS, P, and LP. In contrast, each of these has a substantially small
asymptotic variance than LP, so we would select either LS or S depending on whether we
prefer to use GFIC or GFIC+.15 In the 1975–1985 sample, the situation changes drastically.
Over this longer time period, the relative advantage of LS, P and S over LP in asymptotic
variance decreases substantially and we find evidence of substantial bias in both the LS and
S specifications. It appears that over these additional time periods, the assumption that
lnPit is strictly exogenous fails. Interestingly the difference in GFIC values between LP
and P in this longer sample is negligible. As P has a lower GFIC value than LP in the
1975–1980 sample, it appears that accounting for habit persistence is relatively unimportant
in estimating the short-run price elasticity of cigarette demand.
13Appendix C presents additional results, including the full-sample estimates, and some further discussion.14We estimate the asymptotic variance matrix as in Baltagi et al. (2000).15When GFIC and GFIC+ disagree, we prefer GFIC+ for the same reason that the positive-part Stein
estimator is preferred to the “plain-vanilla” Stein estimator.
26
8 Conclusion
This paper has introduced the GFIC, a proposal to choose moment conditions and parameter
restrictions based on the quality of the estimates they provide. The GFIC performs well in
simulations for our dynamic panel example. While we focus here on applications to panel
data, the GFIC can be applied to any GMM problem in which a minimal set of correctly
specified moment conditions identifies an unrestricted model. A possible extension of this
work would be to consider risk functions other than MSE, by analogy to Claeskens et al.
(2006) and Claeskens and Hjort (2008). Another possibility would be to derive a version of
the GFIC for GEL estimators. Although first-order equivalent to GMM, GEL estimators
often exhibit superior finite-sample properties and may thus improve the quality of the
selection criterion (Newey and Smith, 2004).
References
Anderson, T., Hsiao, C., 1982. Formulation and estimation of dynamic models using panel data.Journal of Econometrics 18, 47–82.
Andrews, D. W. K., Lu, B., 2001. Consistent model and moment selection procedures for GMMestimation with application to dynamic panel data models. Journal of Econometrics 101, 123–164.
Arellano, M., Bond, S., 1991. Some tests of specification for panel data: Monte carlo evidence andan application to employment equations. The Review of Economic Studies 58 (2), 277–297.
Baltagi, B. H., Griffin, J. M., Xiong, W., 2000. To pool or not to pool: Homogeneous versusheterogeneous estimators applied to cigarette demand. The Review of Economics and Statistics82, 117–126.
Caner, M., 2009. Lasso-type GMM estimator. Econometric Theory 25, 270–290.
Claeskens, G., Croux, C., Kerckhoven, J. V., 2006. Variable selection for logistic regression using aprediction-focused information criterion. Biometrics 62, 972–979.
Claeskens, G., Hjort, N. L., 2003. The focused information criterion. Journal of the AmericanStatistical Association 98 (464), 900–945.
Claeskens, G., Hjort, N. L., 2008. Minimizing average risk in regression models. Econometric Theory24, 493–527.
DiTraglia, F. J., 2016. Using invalid instruments on purpose: Focused moment selection and aver-aging for GMM. Journal of Econometrics 195 (2), 187–208.
Guggenberger, P., 2010. The impact of a Hausman pretest on the size of a hypothesis test: Thepanel data case. Journal of Econometrics 156, 337–343.
Hall, A. R., Peixe, F. P., 2003. A consistent method for the selection of relevant instruments.Econometric Reviews 22, 269–287.
27
Hansen, B. E., 2016. Efficient shrinkage in parametric models. Journal of Econometrics 190,115–132.
Hjort, N. L., Claeskens, G., 2003. Frequentist model average estimators. Journal of the AmericanStatistical Association 98 (464), 879–899.
Hong, H., Preston, B., Shum, M., 2003. Generalized empirical likelihood-based model selectioncriteria for moment condition models. Econometric Theory 19, 923–943.
Judge, G. G., Mittelhammer, R. C., 2007. Estimation and inference in the case of competing setsof estimating equations. Journal of Econometrics 138, 513–531.
Lai, T. L., Small, D. S., Liu, J., 2008. Statistical inference in dynamic panel data models. Journalof Statistical Panning and Inference 138, 2763–2776.
Leeb, H., Potscher, B. M., 2008. Sparse estimators and the oracle property, or the return of Hodges’estimator. Journal of Econometrics 142, 201–211.
Newey, W. K., Smith, R. J., 2004. Higher order properties of GMM and generalized empiricallikelihood estimators. Econometrica 72 (1), 219–255.
Pesaran, M. H., Shin, Y., Smith, R. P., 1999. Pooled mean group estimation of dynamic heteroge-neous panels. Journal of the American Statistical Association 94, 621–634.
Pesaran, M. H., Smith, R. P., 1995. Estimating long-run relationships from dynamic heterogeneouspanels. Journal of Econometrics 68, 79–113.
Roodman, D., 2009. A note on the theme of too many instruments. Oxford Bulletin of Economicsand Statistics 71 (1), 135–158.
Schorfheide, F., 2005. VAR forecasting under misspecification. Journal of Econometrics 128, 99–136.
Smith, R. J., 1992. Non-nested tests for competing models estimated by generalized method ofmoments. Econometrica 60 (4), 973–980.
Swamy, P. A. V. B., 1970. Efficient inference in a random coefficient regression model. Econometrica38, 311–323.
Yang, Y., 2005. Can the strengths of AIC and BIC be shared? a conflict between model identifica-tion and regression estimation. Biometrika 92 (4), 937–950.
Proofs
Proof of Theorem 2.1. By a mean-value expansion around (θ0, γ0),
√n(β(b, c)− β
(b)0
)= −K(b, c)Ξc
√nfn(θ0, γ0) + op(1)
28
and by Assumption 2.2,√nfn(θ0, γ0) −
√nE [f(Zni, θ0, γ0)]
d→ (N ′g ,N
′h )
′. Now, by a mean-valueexpansion around γn,
√nE [f(Zni, θ0, γ0)] =
√nE [f(Zni, θ0, γn)] +
√n∇γ′E [f(Zni, θ0, γ)] (γ0 − γn)
=
([0τ
]−∇γ′E [f(Zni, θ0, γ)] δ
)→([
0τ
]− Fγδ
)
so√nfn(θ0, γ0)
d→ N +(0′, τ ′)′−Fγδ. The result follows by the continuous mapping theorem.
Proof of Corollary 2.1. Since Ξc picks out only the components corresponding to g,
Ξc
([Ng
Nh
]+
[0τ
]− Fγδ
)= Ng −Gγδ.
Thus,√n(βv − β0
)d→ −Kv (Ng −Gγδ). Finally, by the definition of a matrix inverse,
KvGγδ =
([G′
θ
G′γ
]Wgg
[Gθ Gγ
])−1 [G′
θ
G′γ
]WggGγδ
=
[G′
θWggGθ G′θWggGγ
G′γWggGθ G′
γWggGγ
]−1 [G′
θWggGγ
G′γWggGγ
]δ =
[0δ
].
Proof of Corollary 3.2. For some γ between γ0 and γn = γ0 + δ/√n.
µn = ϕ(θ0, γ0) +∇γϕ(θ0, γ)′(γn − γ0) = µ0 +∇γϕ(θ0, γ)
′δ/√n
by a mean-value expansion. Hence,√n(µn − µ0) = ∇γϕ(θ0, γ)
′δ → ∇γϕ(θ0, γ0)′δ. The result
follows since√n (µ(b, c)− µn) =
√n (µ(b, c)− µ0)−
√n (µn − µ0).
Proof of Corollary 3.3. The result follows from Corollaries 2.1 and 3.2 since,
√n (µv − µn)
d→ ∇βϕ′0
[0δ
]−KvNg
−∇γϕ
′0δ
= −∇βϕ′0KvNg +
[∇θϕ
′0 ∇γϕ
′0
] [ 0δ
]−∇γϕ
′0δ = −∇β(γ0, θ0)
′KvNg.
Proof of Corollary 3.4. The result follows immediately from Corollary 2.1.
Proof of Lemma 3.1. By a mean-value expansion around β0 = (θ′0, γ′0)
′,
√nhn
(βv
)=
√nhn(β0) +H
√n(βv − β0
)+ op(1).
Now, since√nfn(θ0, γ0)
d→[
Ng
Nh
]+
[0τ
]−[
Gγ
Hγ
]δ
29
we have√nhn(θ0, γ0)
d→ Nh + τ −Hγδ. Substituting,
√nhn(βv)
d→ Nh + τ −Hγδ +H
(−KvNg +
[0δ
])= τ −HKvNg + Nh.
Proof of Corollary 3.5. Define (U ′, V ′)′ = (δ′, τ ′)′ + Ψ(N ′g ,N
′h )
′. By the Continuous MappingTheorem and Theorem 3.1, [
δτ
] [δ′ τ ′
]d→[UV
] [U ′ V ′ ]
The result follows since ΨΩΨ = E[(U ′, V ′)′(U ′, V ′)]− (δ′, τ ′)′(δ′, τ ′).
Proof of Theorem 4.1. To simplify the argument let ξ = (δ′, τ ′)′ and M = ξ + ΨN . Now, leta(ξ∗) and b(ξ∗) be quantiles of the distribution of Λ calculated under the assumption that the truebias parameter ξ is equal to ξ∗ such that Pa(ξ∗) ≤ Λ(ξ∗) ≤ b(ξ∗) = 1− α2. Define
amin(M ) = min a(ξ∗) : ξ∗ ∈ R(M |α1) , bmax(M ) = max b(ξ∗) : ξ∗ ∈ R(M |α1)
andR(M |α1) =
ξ∗ :
[(M − ξ∗)′
(ΨΩΨ′)−1
(M − ξ∗)]≤ χ2
p+q(α1)
where χ2p+q(α1) is the 1− α1 quantile of a χ2
p+q distribution. By Theorem 3.1 and Corollary 4.1,
Pµ0 ∈ CIsim → Pamin ≤ Λ(δ, τ) ≤ bmax.
Let A = ξ ∈ R(M |α1). By construction P (A) = 1−α1. Decomposing P a(ξ∗) ≤ Λ(ξ∗) ≤ b(ξ∗)into the sum of mutually exclusive events, we have
P [a(ξ∗) ≤ Λ(ξ∗) ≤ b(ξ∗) ∩A] + P [a(ξ∗) ≤ Λ(ξ∗) ≤ b(ξ∗) ∩Ac] = 1− α2
for every ξ ∈ R(M|α1). But since P [a(ξ∗) ≤ Λ(ξ∗) ≤ b(ξ∗) ∩Ac] ≤ P (Ac) = α1, we see thatP [a(ξ∗) ≤ Λ(ξ∗) ≤ b(ξ∗) ∩A] ≥ 1−α1 −α2, for every ξ∗ ∈ R(M |α1). By definition, if A occursthen ξ ∈ R(M |α1) and hence P [a(ξ) ≤ Λ(ξ) ≤ b(ξ) ∩A] ≥ 1−α1−α2. But when ξ∗ ∈ R(M |α1)we have amin ≤ a(ξ) and b(ξ) ≤ bmax. It follows that
a(ξ) ≤ Λ(ξ) ≤ b(ξ) ∩A ⊆ amin ≤ Λ(ξ) ≤ bmax
and therefore 1− α1 − α2 ≤ P [a(ξ) ≤ Λ(ξ) ≤ b(ξ) ∩A] ≤ P [amin ≤ Λ(ξ) ≤ bmax].
Proof of Theorem 5.1. This proof is standard, so we provide only a sketch. Expanding,
√n[β(k, ·)− β
]= (0,0′k−m, δ′)′ + Q(k, ·)
[n−1/2Z ′(k, ·)∆v
]√n[β(r, ·)− βr
]= Q(r, ·)
[n−1Z ′(r, ·)(Lk−m+1∆y+, . . . , Lk∆y+)δ
]+[n−1/2Z ′(r, ·)∆v+
]The result follows by applying the Lindeberg-Feller CLT to n−1/2Z ′(k, ·)∆v and n−1/2Z ′(r, ·)∆v+
and an appropriate law of large numbers to each element of n−1Z ′(r, ·)(Lk−m+1∆y+, . . . , Lk∆y+),where (·) is either P or S depending on the instrument set used.
30
Proof of Theorem 5.2. Expanding β(k,P) from Equation 19
n−1/2X ′[∆y −W (k)β(k,P)
]= n−1/2X ′
[∆v −W (k) Q(k,P)n−1Z ′(k,P)∆v
]=[−n−1X ′W (k) Q(k,P) IT−k−1
] [ n−1/2Z ′(k,P)∆v
n−1/2X ′∆v
]using ∆y = W (k)βn +∆v. Similarly, expanding β(k,P) from Equation 19,
√n[β(k,P)
]=[0 0′k−m δ′
]′+ Q(k,P)n−1/2Z ′(k,P)∆v
and since δ is√n times the last m elements of β(k,P) and the last m elements of β is zero, we
haveδ = δ +
[0 0′k−m ι′m
]′Q(k,P)n−1/2Z ′(k,P)∆v.
By a law of large numbers Q(k,P) →p Q(k,P) and n−1X ′W (k) →p ξ′ ⊗ ιT−k−1, hence[
δ − δτ
]= Ψ
[n−1/2Z ′(k,P)∆v
n−1/2X ′∆v
]+ op(1).
The result follows by applying the Lindeberg-Feller CLT jointly to n−1/2Z ′(k,P)∆v and n−1/2X ′∆v
and noting that Π maps [n−1/2Z ′(k, S)∆v] to[
n−1/2Z ′(k, P )∆v′
n−1/2X ′∆v′ ]′
.
31
Online Appendix
A Generalized Focused Information Criterion for GMM
Minsu Chang, Francis J. DiTragliaUniversity of Pennsylvania
A Additional Examples
A.1 Random Effects versus Fixed Effects Example
In this section we consider a simple example in which the GFIC is used to choose betweenand average over alternative assumptions about individual heterogeneity: Random Effectsversus Fixed Effects. For simplicity we consider the homoskedastic case and assume thatany strictly exogenous regressors, including a constant term, have been “projected out” sowe may treat all random variables as mean zero. To avoid triple subscripts in the notation,we further suppress the dependence of random variables on the cross-section dimension nexcept within statements of theorems. Suppose that
yit = βxit + vit (A.1)
vit = αi + εit (A.2)
for i = 1, . . . , n, t = 1, . . . , T where εit is iid across i, t with V ar(εit) = σ2ε and αi is iid across
i with V ar (αi) = σ2α. Stacking observations for a given individual over time in the usual
way, let yi = (yi1, . . . , yiT )′ and define xi,vi and εi analogously. Our goal in this example
is to estimate β, the effect of x on y. Although xit is uncorrelated with the time-varyingportion of the error term, Cov(xit, εit) = 0, we are unsure whether or not it is correlatedwith the individual effect αi. If we knew for certain that Cov(xit, αi) = 0, we would preferto report the “random effects” generalized least squares (GLS) estimator given by
βGLS =
(n∑
i=1
x′iΩ
−1xi
)−1( n∑i=1
x′iΩ
−1yi
)(A.3)
where Ω−1 is a preliminary consistent estimator of
Ω−1 = [V ar(vi)]−1 =
1
σ2ε
[IT − σ2
α
(Tσ2α + σ2
ε )ιT ι
′T
](A.4)
and IT denotes the T × 1 identity matrix and ιT a T -vector of ones. This estimator makesefficient use of the variation between and within individuals, resulting in an estimator witha lower variance. When Cov(xit, αi) 6= 0, however, the random effects estimator is biased.Although its variance is higher than that of the GLS estimator, the “fixed effects” estimator
Online Appendix - 1
given by
βFE =
(n∑
i=1
x′iQxi
)−1( n∑i=1
x′iQyi
), (A.5)
where Q = IT − ιT ι′T/T , remains unbiased even when xit is correlated with αi.The conventional wisdom holds that one should use the fixed effects estimator whenever
Cov(xit, αi) 6= 0. If the correlation between the regressor of interest and the individual effectis sufficiently small, however, the lower variance of the random effects estimator could morethan compensate for its bias in a mean-squared error sense. This is precisely the possibilitythat we consider here using the GFIC. In this example, the local mis-specification assumptiontakes the form
T∑t=1
E [xitαi] =τ√n
(A.6)
where τ is fixed, unknown constant. In the limit the random effects assumption thatCov(xit, αi) = 0 holds, since τ/
√n → 0. Unless τ = 0, however, this assumption fails to
hold for any finite sample size. An asymptotically unbiased estimator of τ for this exampleis given by
τ = (T σ2α + σ2
ε )
[1√n
n∑i=1
x′iΩ
−1(yi − xiβFE)
](A.7)
leading to the following result, from which we will construct the GFIC for this example.
Theorem A.1 (Fixed versus Random Effects Limit Distributions). Let (xni, αni, εni) be aniid triangular array of random variables such that V ar(εi|xni, αni) → σ2
εIT , E[x′niQεni] = 0,
and E [αiι′Txni] = τ/
√n for all n. Then, under standard regularity conditions, √
n(βRE − β)√n(βFE − β)
τ
d→
cτ0τ
, η2 η2 0η2 c2σ2 + η2 −cσ2
0 −cσ2 σ2
where η2 = E[x′
iΩ−1xi], c = E[x′
iQxi]/(Tσ2α + σ2
ε), and
σ2 =(Tσ2
α + σ2ε)
2
E [x′iΩ
−1xi]
(σ2ε
E [xiΩ−1xi]E [xiQxi]− 1
).
Proof of Theorem A.1. This proof is standard so we provide only a sketch. First, letAn = (n−1
∑ni=1 x
′iΩ
−1xi), Bn = (n−1∑n
i=1 x′iQxi), and Cn = T σ2
α + σ2ε . Now, expanding
βFE, βRE, and τ and re-arranging √n(βRE − β)√n(βFE − β)
τ
=
A−1n 00 B−1
n
Cn −CnAnB−1n
[ n−1/2∑n
i=1 x′iΩ
−1vi
n−1/2∑n
i=1 x′iQvi
].
The result follows by applying a law of large numbers to An, Bn, Cn, and Ω and the Lindeberg-Feller CLT jointly to n−1/2
∑ni=1 x
′iQvi and n
−1/2∑n
i=1 x′iΩ
−1vi.
Online Appendix - 2
We see from Theorem A.1 that AMSE(βRE) = c2τ 2 + η2, AMSE(βFE) = c2σ2 + η2,and τ 2−σ2 provides an asymptotically unbiased estimator of τ 2. Thus, substituting τ 2−σ2
for τ and rearranging the preceding AMSE expressions, the GFIC tells us that we shouldselect the random effects estimator whenever |τ | ≤
√2σ. To implement this rule in practice,
we construct a consistent estimator of σ2, for which we require estimators of σ2α, σ
2ε and
σ2v = V ar(αi + εit). We estimate these from the residuals
εit = (yit − yi)− (xit − xi)βFE; vit = yit − xitβOLS
where βOLS denotes the pooled OLS estimator of β, leading to the variance estimators
σ2α = σ2
v − σ2ε ; σ2
ε =1
n(T − 1)− 1
n∑i=1
T∑t=1
ε2it; σ2v =
1
nT − 1
n∑i=1
T∑t=1
v2it
Selection, of course, is a somewhat crude procedure: it is essentially an average that usesall-or-nothing weights. As a consequence, relatively small changes to the data could producediscontinuous changes in the weights, leading to a procedure with a high variance. Ratherthan selecting between the random effects and fixed effects estimators based on estimatedAMSE, an alternative idea is to consider a more general weighted average of the form
β(ω) = ωβFE + (1− ω)βRE
and for ω ∈ [0, 1] optimize the choice of ω to minimize AMSE. From Theorem A.1 we see thatthe AMSE-minimizing value of ω is ω∗ = (1 + τ 2/σ2)−1. Substituting our asymptoticallyunbiased estimator of τ 2 and our consistent estimator σ2 of σ2, we propose the followingplug-in estimator of ω∗
ω∗ =
[1 +
max τ 2 − σ2, 0σ2
]−1
where we take the maximum over τ 2 − σ2 and zero so that ω∗ is between zero and one.This proposal is related to the Frequentist Model Average estimators of Hjort and Claeskens(2003) as well as Hansen (2016), and DiTraglia (2016).
A.2 Slope Heterogeneity Example
Suppose we wish to estimate the average effect β of a regressor x in a panel setting wherethis effect may vary by individual: say βi ∼ iid over i. One idea is to simply ignore theheterogeneity and fit a pooled model. A pooled estimator will generally be quite precise,but depending on the nature and extent of heterogeneity could show a serious bias. Anotheridea is to apply the mean group estimator by running separate time-series regressions foreach individual and averaging the result over the cross-section (Pesaran et al., 1999; Pesaranand Smith, 1995; Swamy, 1970). This approach is robust to heterogeneity but may yieldan imprecise estimator, particularly in panels with a short time dimension. To see how the
Online Appendix - 3
GFIC navigates this tradeoff, consider the following DGP:
yit = βixit + εit (A.8)
βi = β + ηi, ηi ∼ iid(0, σ2η) (A.9)
where xit is uncorrelated with εit but is not assumed to be independent of ηi. As in thepreceding examples i = 1, . . . , n indexes individuals, t = 1, . . . , T indexes time periods,and we assume without loss of generality that all random variables are mean zero and anyexogenous controls have been projected out. For the purposes of this example, assume furtherthat εit is iid over both i and t with variance σ2
ε and that both error terms are homoskedastic:E[ε2it|xit] = σ2
ε and E[η2i |xit] = σ2η, and E[ηiεit|xit] = 0. Neither homoskedasticity nor time-
independent errors are required to apply the GFIC to this example, but these assumptionssimplify the exposition. We place no assumptions on the joint distribution of xit and ηi.
Stacking observations, let yi = (yi1, . . . , yiT )′ and define xi analogously. We consider two
estimators: the pooled OLS estimator βOLS and the mean-group estimator βMG
βOLS =
(n∑
i=1
x′ixi
)−1( n∑i=1
x′iyi
)(A.10)
βMG =1
n
n∑i=1
βi =1
n
n∑i=1
(x′ixi)
−1(x′
iyi) (A.11)
where βi denotes the OLS estimator calculated using observations for individual i only. Ifwe knew with certainty that there was no slope heterogeneity, we could clearly prefer βOLS
as it is both unbiased and has the lower variance. In the presence of heterogeneity, however,the situation is more complicated. If E[x′
ixiηi] 6= 0 then βOLS will show a bias whereas
βMG will not. To encode this idea within the local mis-specification framework, we takeE[x′
ixiηi] = τ/√n so that, for any fixed n the OLS estimator is biased unless τ = 0 but this
bias disappears in the limit. Turning our attention from bias to variance, we might expectthat βOLS would remain the more precise estimator in the presence of heterogeneity. In fact,however, this need not be the case: as we show below, βMG will have a lower variance thanβOLS if σ2
η is sufficiently large. To construct the GFIC for this example, we estimate the biasparameter τ by substituting the mean group estimator into the OLS moment condition:
τ =1√n
n∑i=1
x′i(yi − xiβMG). (A.12)
The key result needed to apply the GFIC in this this example gives the joint limiting distri-bution of τ , the mean-group estimator, and the OLS estimator.
Theorem A.2 (Limit Distribution of OLS and Mean-Group Estimators). Let (xni, ηni, εni)be an iid triangular array of random variables such that E[x′
niεni] = 0, V ar(εni|xni) → σ2εIT ,
V ar(ηni|xni) → σ2η, E[ηniεni|xni] → 0, and E[x′
nixniηni] = τ/√n. Then, under standard
Online Appendix - 4
regularity conditions,
√n(βOLS − β)√n(βMG − β)
τ
d→ N
τ/κ
0τ
,(λ2 + κ2
κ2
)σ2η +
σ2ε
κσ2η +
σ2ε
κ
(λ2
κ
)σ2η
σ2η + ζσ2
ε σ2ε(1− κζ)
λ2σ2η + κ(κζ − 1)σ2
ε
where κ = E[x′ixi], λ
2 = V ar (x′ixi), and ζ = E
[(x′
ixi)−1].
Proof of Theorem A.2. Expanding the definitions of the OLS and mean-group estima-tors,
√n(βOLS − β) =
[(n−1
∑ni=1 x
′ixi)
−1(n−1
∑ni=1 x
′ixi)
−1 ] [ n−1/2∑n
i=1 x′ixiηi
n−1/2∑n
i=1 x′iεi
]√n(βMG − β) = n−1/2
n∑i=1
[ηi + (x′
ixi)−1x′
iεi]
and proceeding similarly for τ ,
τ =[1 1 −n−1
∑ni=1 x
′ixi
] n−1/2∑n
i=1 x′ixiηi
n−1/2∑n
i=1 x′iεi
n−1/2∑n
i=1 ηi + (x′ixi)
−1x′iεi
.The result follows, after some algebra, by a LLN and the Lindeberg-Feller CLT.
As mentioned above, the OLS estimator need not have a lower variance than the mean-group estimator if σ2
η is sufficiently large. This fact follows as a corollary of Theorem A.2.
Corollary A.1. Under the conditions of Theorem A.2, the asymptotic variance of the OLSestimator is lower than that of the mean-group estimator if and only if λ2σ2
η < σ2ε(κ
2ζ − κ),
where κ = E[x′ixi], λ
2 = V ar (x′ixi), and ζ = E
[(x′
ixi)−1].
Note, as a special case of the preceding, that the OLS estimator is guaranteed to have thelower asymptotic variance when σ2
η = 0 since E[x′ix]
−1 < E[(x′ixi)
−1] by Jensen’s inequality.When σ2
η 6= 0, the situation is in general much more complicated. A simple normal example,however, provides some helpful intuition. Suppose that for a given individual the observationsxit are iid standard normal over t. Then x′
ixi ∼ χ2T , so that κ = T , λ2 = 2T and ζ = 1/(T−2),
provided of course that T > 2. Substituting these into Corollary A.1, the OLS estimatorwill have the lower asymptotic variance whenever (T − 2)σ2
η < σ2ε . All else equal, the shorter
the panel, the more likely that OLS will have the lower variance. But if σ2η is large enough,
the length of the panel becomes irrelevant: with enough slope heterogeneity, the mean-groupestimator has the advantage both in bias and variance.
To apply the GFIC in practice, we first need to determine whether the OLS estimatorhas the smaller asymptotic variance. This requires us to estimate the quantities λ2, κ, and ζfrom Theorem A.2 along with σ2
η and σ2ε . The following estimators are consistent under the
Online Appendix - 5
assumptions of Theorem A.2:
κ =1
n
n∑i=1
x′ixi ζ =
1
n
n∑i=1
(x′ixi)
−1
λ2 =1
n− 1
n∑i=1
(x′ixi − κ)2 σ2
ε =1
nT − 1
n∑i=1
T∑t=1
(yit − xitβOLS)2
σ2η =
Sb
n− 1− 1
n
n∑i=1
σ2ε (x
′ixi)
−1 Sb =n∑
i=1
β2i − n
(1
n
n∑i=1
βi
)2
If the estimated asymptotic variance of the mean-group estimator is lower than that of theOLS estimator, then there is no need to estimate AMSE: we should simply use the mean-group estimator. If this is not the case, then we construct the GFIC using the asymptoticallyunbiased estimator τ 2 − σ2
τ of τ 2, where σ2τ = λ2σ2
η + κ(κζ − 1)σ2ε is a consistent estimator of
the asymptotic variance of τ .
B Supplementary Simulation Results
B.1 Fixed vs. Random Effects Example
We employ a simulation design similar to that used by Guggenberger (2010), namely
yit = 0.5xit + αi + εit
where xi1xi2...xiTαi
iid∼ N
00...00
,1 ρ . . . ρ γρ 1 . . . ρ γ...
.... . .
......
ρ ρ . . . 1 γγ γ . . . γ 1
independently of (εi1, . . . , εiT )
′ ∼ iid N(0, σ2εIT ). In this design, γ controls the correlation
between xit and the individual effects αi, while ρ controls the persistence of xit over time.Larger values of γ correspond to larger violations of the assumption underlying the randomeffects estimator, increasing its bias. Larger values of ρ, on the other hand, decrease theamount of variation within individuals, thus increasing the variance of the fixed effectsestimator. Figures B.1 presents RMSE values for the random effects GLS estimator andfixed effects estimator along with those for the post-GFIC and averaging estimators describedabove in Section A.1 over a grid of values for γ, ρ and n with T = 2. Results for T = 5 appearin Figure B.2 of Appendix B. All calculations are based on 10,000 simulation replications.In the interest of space, we present only results for σ2
ε = 2.5 and a coarse parameter grid forρ here. Additional results are available upon request.
We see from Figure B.1 that, regardless of the configuration of the other parametervalues, there is always a range of values for γ for which the random effects estimator has a
Online Appendix - 6
0.0 0.2 0.4 0.6
0.2
0.3
0.4
0.5
0.6
0.7
N = 100, ρ = 0.3
γ
RMSE
GFICFEREAVG
0.0 0.2 0.4 0.6
0.2
0.3
0.4
0.5
0.6
N = 200, ρ = 0.3
γ
RMSE
GFICFEREAVG
0.0 0.2 0.4 0.6
0.1
0.2
0.3
0.4
0.5
0.6
N = 500, ρ = 0.3
γ
RMSE
GFICFEREAVG
0.0 0.2 0.4 0.6
0.2
0.3
0.4
0.5
0.6
0.7
N = 100, ρ = 0.6
γ
RMSE
GFICFEREAVG
0.0 0.2 0.4 0.6
0.2
0.3
0.4
0.5
0.6
0.7
N = 200, ρ = 0.6
γ
RMSE
GFICFEREAVG
0.0 0.2 0.4 0.6
0.1
0.2
0.3
0.4
0.5
0.6
0.7
N = 500, ρ = 0.6
γ
RMSE
GFICFEREAVG
0.0 0.2 0.4 0.6
0.2
0.3
0.4
0.5
0.6
0.7
0.8
N = 100, ρ = 0.9
γ
RMSE
GFICFEREAVG
0.0 0.2 0.4 0.6
0.2
0.3
0.4
0.5
0.6
0.7
N = 200, ρ = 0.9
γ
RMSE
GFICFEREAVG
0.0 0.2 0.4 0.6
0.1
0.2
0.3
0.4
0.5
0.6
0.7
N = 500, ρ = 0.9
γ
RMSE
GFICFEREAVG
Figure B.1: RMSE values for the Random vs. Fixed effects simulation example from Section B.1with T = 2 and σ2
ε = 2.5. Results are based on 10,000 simulation replications.
Online Appendix - 7
smaller RMSE than the fixed effects estimator. The width of this range increases as eitherthe number of individuals N or the number of time periods T decrease. It also increasesas the persistence ρ of xit increases. Indeed, when N and T are relatively small and ρ isrelatively large, the individual effects αi can be strongly correlated with xit and still resultin a random effects estimator with a lower RMSE than the fixed effects estimator. Thepost-GFIC estimator essentially “splits the difference” between the random and fixed effectsestimators. While it cannot provide a uniform improvement over the fixed effects estimator,the post-GFIC estimator performs well. When γ is not too large it can yield a substantiallylower RMSE than the fixed effects estimator. The gains are particularly substantial whenxit is relatively persistent and T relatively small, as is common in micro-panel datasets. Theaveraging estimator performs even better, providing a nearly uniform improvement over thepost-GFIC estimator. Only at very large values of γ does it yield a higher RMSE, and theseare points in the parameter space where the fixed effects, post-GFIC and averaging estimatorsare for all intents and purposes identical in RMSE. Results for T = 5 are qualitatively similar.See Figure B.2 of Appendix B for details. Note that in when T = 5, setting ρ = 0.3 violatespositive definiteness so we take ρ = 0.4 as our smallest value in this case.
The results we have discussed here focus on the comparison of the GFIC to the fixedeffects, random effects, and averaging estimators. One might also wonder how the GFICcompares to a Durbin-Hausman-Wu (DHW) pre-test estimator that reports the random
effects estimator unless the difference between βFE and βRE is sufficiently large. By anargument similar to that of DiTraglia (2016) Section 3.2, the GFIC in this particular exampleis essentially equivalent to a DHW pre-test estimator based on a particular significancelevel dictated by our desire to minimize an asymptotically unbiased estimator of AMSE. Assuch, a comparison of the GFIC to a DHW pre-test estimator based on the more standardsignificance levels 0.1 and 0.05 will be qualitatively similar to DiTraglia (2016) Figure 2.In particular, there is no choice of significance level for which one DHW-based pre-testestimator, including in this case the GFIC, uniformly dominates any other.
Online Appendix - 8
0.0 0.2 0.4 0.6
0.2
0.3
0.4
0.5
0.6
N = 100, ρ = 0.4
γ
RMSE
GFICFEREAVG
0.0 0.2 0.4 0.6
0.1
0.2
0.3
0.4
0.5
0.6
N = 200, ρ = 0.4
γRMSE
GFICFEREAVG
0.0 0.2 0.4 0.6
0.1
0.2
0.3
0.4
0.5
0.6
N = 500, ρ = 0.4
γ
RMSE
GFICFEREAVG
0.0 0.2 0.4 0.6
0.2
0.3
0.4
0.5
0.6
N = 100, ρ = 0.6
γ
RMSE
GFICFEREAVG
0.0 0.2 0.4 0.6
0.1
0.2
0.3
0.4
0.5
0.6
N = 200, ρ = 0.6
γ
RMSE
GFICFEREAVG
0.0 0.2 0.4 0.6
0.1
0.2
0.3
0.4
0.5
0.6
N = 500, ρ = 0.6
γ
RMSE
GFICFEREAVG
0.0 0.2 0.4 0.6
0.2
0.3
0.4
0.5
0.6
0.7
N = 100, ρ = 0.9
γ
RMSE
GFICFEREAVG
0.0 0.2 0.4 0.6
0.1
0.2
0.3
0.4
0.5
0.6
0.7
N = 200, ρ = 0.9
γ
RMSE
GFICFEREAVG
0.0 0.2 0.4 0.6
0.1
0.2
0.3
0.4
0.5
0.6
0.7
N = 500, ρ = 0.9
γ
RMSE
GFICFEREAVG
Figure B.2: RMSE for Random vs. Fixed effects example from Section B.1: T = 5, σ2ε = 2.5
B.2 Figures for Dynamic Panel Simulation
In this section we present figures to complement Table 2 from Section 6.2. Figure B.3colors each region of the parameter space according to which of the estimators of θ – LP,LS, P or S – yields the lowest finite-sample RMSE. The saturation of a color indicates therelative difference in RMSE of the lowest RMSE estimator at that point measured againstthe second lowest RMSE estimator. Darker indicates a larger advantage for the first-best
Online Appendix - 9
estimator while lighter values indicate a smaller advantage. Figure B.4 depicts the relativedifferent between the RMSE of the GFIC and that of the true specification, LP, expressed inpercentage points. Red indicates that the GFIC has the lower RMSE, blue that LP has thelower RMSE, and white that the RMSE values are the same. Darker colors indicate a largerdifference. Figure B.5 compares the RMSE of the GFIC to that of the oracle procedure thatuses whichever fixed specification – LP, LS, P, or S – yields the lowest finite sample MSEat a give point in the parameter space. As in Figure B.4, the comparison is one of relativeRMSE in percentage points. But, as the GFIC can by definition can never have a lowerfinite-sample MSE than the oracle estimator, the color scale used in this figure is different.The remaining figures in this section compare the RMSE of the GFIC to that of the otherselection procedures: GMM-AIC, GMM-BIC, etc.
T = 4 T = 5
N =
250N
= 500
0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
ρxv
γ
LP
LS
P
S
Figure B.3: Minimum RMSE specification at each combination of parameter values for the simu-lation experiment from Section 6.2. Color saturation at a given grid point indicates RMSE relativeto second best specification.
Online Appendix - 10
T = 4 T = 5
N =
250N
= 500
0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
ρxv
γ
−20
−10
0
10
Figure B.4: RMSE of the post-GFIC estimator relative to that of the true specification (LP) inthe dynamic panel simulation experiment from Section 6.2.
T = 4 T = 5
N =
250N
= 500
0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
ρxv
γ
0
10
20
30
40
Figure B.5: RMSE of GFIC relative to Oracle Estimator in the Simulation from Section 6.2
Online Appendix - 11
T = 4 T = 5
N =
250N
= 500
0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
ρxv
γ
−40
−20
0
Figure B.6: RMSE of GFIC relative to GMM-AIC in the Simulation from Section 6.2
T = 4 T = 5
N =
250N
= 500
0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
ρxv
γ
−60
−40
−20
0
20
Figure B.7: RMSE of GFIC relative to GMM-BIC in the Simulation from Section 6.2
Online Appendix - 12
T = 4 T = 5
N =
250N
= 500
0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
ρxv
γ
−60
−40
−20
0
20
Figure B.8: RMSE of GFIC relative to 5% Downward J-test in the Simulation from Section 6.2
T = 4 T = 5
N =
250N
= 500
0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
ρxv
γ
−60
−40
−20
0
20
Figure B.9: RMSE of GFIC relative to 10% Downward J-test in the Simulation from Section 6.2
Online Appendix - 13
T = 4 T = 5
N =
250N
= 500
0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
ρxv
γ
−50
−25
0
25
Figure B.10: RMSE of GFIC relative to GMM-HQ in the Simulation from Section 6.2
C Supplementary Results for the Empirical Example
Table C.1 presents additional empirical results to supplement those discussed in Section 7:C.1c is based on data for the 1963-1974 sub-sample (T = 12) while C.1d is based on datafor the full 1963–1992 sample (T = 30). The results in C.1a and C.1b are the same as thosein Table 3 and are included here for ease of comparison. Although there is some variationin the magnitudes of coefficient estimates across sub-samples, the basic pattern of resultsis similar. In each of the longer samples (T = 11, 12 or 30), the GFIC ranks specificationLP as the best and specification P as the second best. With enough time periods availablefor estimation, the reduction in variance from using specifications LS, P, and S becomesnegligible and is hence outweighed by any bias that they may induce.
Online Appendix - 14
(a) 1975–1980 (T = 6)
LP LS P S
θ -0.68 -0.32 -0.28 -0.37Var. 0.16 0.02 0.07 0.01Bias2 — -4.20 0.01 -3.56GFIC 0.16 -4.18 0.08 -3.54GFIC+ 0.16 0.02 0.08 0.01
(b) 1975–1985 (T = 11)
LP LS P S
θ -0.30 -0.26 -0.38 -0.28Var. 0.06 0.01 0.05 0.01Bias2 — 2.21 0.03 1.29GFIC 0.06 2.22 0.08 1.30GFIC+ 0.06 2.22 0.08 1.30
(c) 1963–1974 (T = 12)
LP LS P S
θ -0.31 -0.52 -0.16 -0.51Var. 0.03 0.00 0.04 0.00Bias2 — 1.15 0.61 1.28GFIC 0.03 1.15 0.66 1.28GFIC+ 0.03 1.15 0.66 1.28
(d) 1963–1992 (T = 30)
LP LS P S
θ -0.15 -0.38 -0.07 -0.38Var. 0.01 0.00 0.01 0.00Bias2 — 2.36 0.35 2.18GFIC 0.01 2.36 0.36 2.18GFIC+ 0.01 2.36 0.36 2.18
Table C.1: Estimates and GFIC values for the price elasticity of demand for cigarettes examplefrom Section 7 under four alternative specifications. GFIC+ gives an alternative version of theGFIC in which a negative squared bias estimate is set equal to zero.
Online Appendix - 15