Upload
hanhi
View
217
Download
2
Embed Size (px)
Citation preview
An Averaging GMM Estimator Robust to
Misspeci�cation�
Xu Chengy Zhipeng Liaoz Ruoyao Shix
First Draft: August 2013; This Version: April, 2016
Abstract
This paper studies the averaging GMM estimator that combines a conservative GMM esti-
mator based on valid moment conditions and an aggressive GMM estimator based on both valid
and possibly misspeci�ed moment conditions, where the weight is the sample analog of an infea-
sible optimal weight. We establish asymptotic theory on uniform approximation of the upper
and lower bounds of the �nite-sample risk di¤erence between two estimators, which is used to
show that the averaging estimator uniformly dominates the conservative estimator by reducing
the risk under any degree of misspeci�cation. Extending seminal results on the James-Stein esti-
mator, the uniform dominance is established in non-Gaussian semiparametric nonlinear models.
The simulation results support our theoretical �ndings. The proposed averaging estimator is
applied to estimate the human capital production function in a life-cycle labor supply model.
Keywords: Asymptotic Risk, Finite-Sample Risk, Generalized Shrinkage Estimator, GMM, Mis-
speci�cation, Model Averaging, Non-Standard Estimator, Uniform Approximation
�We thank Donald Andrews, Denis Chetverikov, Patrik Guggenberger, Jinyong Hahn, Jia Li, Rosa Matzkin,Hyunsik Roger Moon, Ulrich Mueller, Joris Pinkse, Frank Schorfheide, Shuyang Sheng, and seminar participants atBrown University, Duke University, University of Pennsylvania, Pennsylvania State University, University of CaliforniaLos Angeles, Yale University, and 2014 New York Area Econometrics Colloquium for helpful comments.
yDepartment of Economics, University of Pennsylvania, 3718 Locust Walk, Philadelphia, PA 19104, USA. Email:[email protected]
zDepartment of Economics, UCLA,8379 Bunche Hall, Mail Stop: 147703, Los Angeles, CA 90095. Email:[email protected]
xDepartment of Economics, UCLA, Mail Stop: 147703, Los Angeles, CA 90095. Email: [email protected]
1
1 Introduction
The parameter of interest �0 2 Rd� satis�es the moment conditions
EF [g1(W; �0)] = 0 (1.1)
for some known function g1 (�; �) :W��!Rr1 , where � is a compact subset of Rd� , W is a random
vector with support W and joint distribution F , and EF [�] denotes the expectation operator underF . Suppose we have i.i.d. data fWigni=1, where Wi has distribution F for any i = 1; : : : ; n.1 Let
g1(�) = n�1Pn
i=1 g1(Wi; �). The e¢ cient GMM estimator for �0 is
b�1 � argmin�2�
g1(�)0(1)
�1g1(�); (1.2)
where 1 = n�1Pni=1 g1(Wi;e�1)g1(Wi;e�1)0 � g1(e�1)g1(e�1)0 is the e¢ cient weighting matrix with
some preliminary consistent estimator e�1.2 In a linear instrumental variable (IV) example, Yi =X 0i�0 + Ui where the IV Z1;i 2 Rr1 satis�es EF [Z1;iUi] = 0. The moments in (1.1) hold with
g1(Wi; �0) = Z1;i(Yi �X 0i�0) and �0 is identi�ed if EF [Z1;iX 0
i] has full column rank. Under certain
regularity conditions, it is well-known that b�1 is consistent and achieves the lowest asymptoticvariance among GMM estimators based on the moments in (1.1), see Hansen (1982).
If one has additional moments EF [g�(Wi; �0)] = 0 for some known function g�(�; �) : W ��!Rr� , imposing them together with (1.1) can further reduce the asymptotic variance of the GMMestimator. However, if these additional moments are misspeci�ed in the sense that EF [g�(Wi; �0)] 6=0; imposing EF [g�(Wi; �0)] = 0 may result in inconsistent estimation. The choice of moment
conditions is routinely faced by empirical researchers. Take the linear IV model for example. One
typically starts with a large number of candidate IVs but only has con�dence that a small number
of them are valid, denoted by Zi: The rest of them, denoted by Z�i , are valid only under certain
economic hypothesis that yet to be tested. In this example, g�(Wi; �0) = Z�i (Yi�X 0
i�0): In contrast
to the conservative estimator b�1, an aggressive estimator b�2 always imposes EF [g�(Wi; �0)] = 0
regardless of its validity. Let g2(Wi; �) = (g1(Wi; �)0; g�(Wi; �)
0)0 and g2(�) = n�1Pni=1 g2(Wi; �):
The aggressive estimator b�2 takes the formb�2 � argmin
�2�g2(�)
0(2)�1g2(�); (1.3)
1The main theory of the paper can be easily extended to time series models with dependent data, as long as thepreliminary results in Lemma A.1 hold.
2For example, e�1 could be the GMM estimator similar to b�1 but with an identity weighting matrix, see (A.5) inthe Appendix.
2
where 2 takes the same form as 1 except g1(Wi; �) is replaced by g2(Wi; �).3
Because imposing EF [g�(Wi; �0)] = 0 is a double-edged sword, a data-dependent decision usually
is made to choose between b�1 and b�2. To study such a decision and the subsequent estimator, let�0 � EF [g�(Wi; �0)] 2 Rr
�: (1.4)
The pre-testing approach tests the null hypothesis H0 : �0 = 0 and constructs an estimator
b�pre � 1fTn > c�gb�1 + 1fTn � c�gb�2 (1.5)
for some test statistic Tn with the critical value c� at the signi�cance level �. One popular test
is the J-test, see Hansen (1982), and c� is the 1 � � quantile of the chi-squared distribution withdegree of freedom r1 + r
� � d�. Because the power of this test against most �xed alternative is1, b�pre equals b�1 with probability 1 asymptotically (n ! 1) for these �xed misspeci�ed modelwhere �0 6= 0. Thus, it seems that b�pre is immune to moment misspeci�cation. However, we careabout the �nite-sample mean squared error (MSE) of b�pre in practice and this standard pointwiseasymptotic analysis (�0 is �xed and n!1) provides a poor approximation to the former.4 To seethe comparison between b�pre and b�1; the dashed line in Figure 1 plots the �nite-sample (n = 250)MSE of b�pre while the MSE of b�1 is normalized to be 1. For some values of �0, the MSE of b�pre islarger than that of b�1; sometimes by 40%.
The goal of this paper is twofold. First, we propose a data-dependent averaging of b�1 and b�2that takes the form b�eo = (1� e!eo)b�1 + e!eob�2 (1.6)
where e!eo 2 [0; 1] is a data-dependent weight speci�ed in (4.7) below. The subscript in e!eo isshort for empirical optimal because this weight is an empirical analog of an infeasible optimal
weight !� de�ned in (4.3) below. We plot the �nite-sample MSE of this averaging estimator as
the solid line in Figure 1. This averaging estimator is robust to misspeci�cation in the sense
that the solid line is below 1 for all values of �0, in contrast to the bump in the dashed line
that represents the pre-test estimator. Second, we develop a uniform asymptotic theory to justify
the �nite-sample robustness of this averaging estimator. We show that this averaging estimator
dominates the conservative estimator uniformly over a large class of models with di¤erent degrees
3See the �rst line of equations (A.10) for the de�nition of 2.4The poor approximation of the pointwise asymptotics to the �nite sample properties of the pre-test estimator
has been noted in Shibata (1986), Pötscher (1991), Kabaila (1995, 2005) and Leeb and Pötscher (2005, 2008), amongothers.
3
Figure 1. Finite Sample (n = 250) MSEs of the Pre-test and the Averaging GMM Estimators
Note: The Pre-test estimator is based on the J-test with a norminal size 0.01. The location parameter in the horizontalaxis is �0 multiplied by n�1=2. Details of the simulation design for Figure 1 is provided in Subsection 5.1.
of misspeci�cation.5 The standard asymptotic theory is pointwise and fails to reveal the fragile
nature of the pre-test estimator. A stronger uniform notion of robustness is crucial for this model.
Furthermore, we quantify the upper and lower bounds of the asymptotic risk di¤erences between
the averaging estimator and the conservative estimator.6
The rest of the paper is organized as follows. Section 2 discusses the literatures related to our
paper. Section 3 de�nes the parameter space over which the uniform result is established and de�nes
uniform dominance. Section 4 introduces the averaging weight. Section 5 provides an analytical
representation of the bounds of the asymptotic risk di¤erences and applies it to show that the
averaging GMM estimator uniformly dominates the conservative estimator. Section 6 investigates
the �nite sample performance of our averaging estimator and Section 7 applies it to estimate the
human capital production function in a life-cycle labor supply model. Section 8 concludes. Proofs
and technical arguments are given in the Appendix.
Notation. Let C > 1 be generic positive constant whose value does not change with the sample
size. For any real matrix A, we use jjAjj to denote the Frobenius norm of A. We use supnorm
on the space of real functions. If A is a real square matrix, we use tr(A) denote the trace of A,
and �min(A) and �max(A) to denote the smallest and largest eigenvalues of A, respectively. For
any positive integers d1 and d2, Id1 and 0d1�d2 stand for the d1 � d1 identity matrix and d1 � d2zero matrix, respectively. Let vec (�) denotes vectorization of a matrix and vech (�) denotes the half
5The uniform dominace is established under the truncated weighted loss function which is de�ned in (3.7) below.6The lower and upper bounds of asymptotic risk di¤erence are de�ned in (3.9) below.
4
vectorization of a symmetric matrix. For any (possibly random) positive sequences fang1n=1 andfbng1n=1, an = Op(bn) means that limc!1 lim supn Pr (an=bn > c) = 0; an = op(bn) means that forall " > 0, limn!1 Pr (an=bn > ") = 0. Let "!p" and "!d" stand for convergence in probability
and convergence in distribution, respectively.
2 Related Literature
In this section, we discuss some related literature. Our uniform dominance result is related to the
Stein�s phenomenon (Stein, 1956) in parametric models. The James-Stein (JS) estimator (James
and Stein, 1961) is shown to dominate the maximum likelihood estimator in exact normal sampling.
Hansen (2016) considers local asymptotic analysis of the JS-type averaging estimator in general
parametric models and substantially extends its application in econometrics. The present paper
focuses on the uniformity issue and studies the Stein�s phenomenon in non-Gaussian semiparametric
nonlinear models. The proposed averaging estimator is not a mimic of the JS-type estimator
for moment-based models and we �nd its �nite-sample risk compares favorably to that of the
latter. The asymptotic results are developed along drifting sequences of data generating processes
(DGPs) with di¤erent degrees of misspeci�cation. This class of DGPs include the crucial n�1=2
local sequences that are considered by Hjort and Claeskens (2003), Liu (2015), Hansen (2014, 2015,
2016), DiTraglia (2014) for various averaging estimators, as well as some more distant sequences.
The theoretical results glue all sequences together and show that they are su¢ cient to provide a
uniform approximation of the �nite-sample risk di¤erences. The proof uses the techniques developed
in Andrews and Guggenberger (2010) and Andrews, Cheng, and Guggenberger (2011) for uniform
size control for inference in non-standard problems.
Measured by the MSE, the post-model-selection estimator based on consistent model selection
procedure usually does better than the unrestricted estimator in part of the parameter space and
worse than the latter in other part of the parameter space. One standard example is the Hodge�s
estimator, whose scaled maximal MSE diverges to in�nity with the growth of the sample size (see,
e.g., Lehmann and Casella, 1998). Similar unbounded risk results for other post-model-selection
estimators are established in Yang (2005) and Leeb and Pötscher (2008). The post-model-selection
estimator has unbounded (scaled) maximal MSE because it is based on a non-smooth transition rule
between the restricted and unrestricted estimators and a consistent model selection procedure is
employed in the transition rule. However, the averaging estimator proposed in this paper is based
on a smooth combination of the restricted and unrestricted estimators and no model selection
procedure is used in the smooth combination. Hence our averaging estimator is essentially di¤erent
5
from the post-model-selection estimator and the uniform dominance result established in this paper
does not contradict the unbounded risk property of the post-model-selection estimator found in
Yang (2005) and Leeb and Pötscher (2008).
The estimator proposed in this paper is a frequentist model averaging (FMA) estimator. FMA
estimators have received much attention in recent years. Important works include Buckland, Burn-
ham, and Augustin (1997), Hjort and Claeskens (2003, 2006), Leung and Barron (2006), Claeskens
and Carroll (2007), Hansen (2007, 2008), Hansen and Racine (2012), Cheng and Hansen (2014) and
Lu and Su (2015), to name only a few. Uniform asymptotic properties are important for frequentist
estimators when standard pointwise asymptotic properties fail to capture their �nite-sample behav-
iors. Our paper provides a uniform asymptotic framework to compare di¤erent FMA estimators in
moment-based models.
Recently, DiTraglia (2014) and Hansen (2015) both consider averaging estimators that combine
the ordinary least squares (OLS) estimator and the two-stage-least-squares (2SLS) estimator in lin-
ear IV models. In linear IV models with homoskedastic errors, our conservative estimator becomes
the 2SLS estimator, and our aggressive estimator using both the IVs and the endogenous variables
becomes the OLS estimator7. However, the averaging weight is di¤erent from those in DiTraglia
(2014) and Hansen (2015) and we study it in a di¤erent asymptotic framework.
There is a large literature studying the validity of GMM moment conditions. Many methods
can be applied to detect the validity, including the over-identi�cation tests (see, e.g., Sargan, 1958;
Hansen, 1982; and Eichenbaum, Hansen and Singleton 1988), the information criteria (see, e.g.,
Andrews, 1999; Andrews and Lu, 2003; Hong, Preston and Shum, 2003), and the penalized estima-
tion methods (see, e.g., Liao, 2013 and Cheng and Liao, 2014). Recently, misspeci�ed moments and
their consequences are considered by Ashley (2009), Berkowitz, Caner, and Fang (2012), Conley,
Hansen, and Rossi (2012), Doko Tchatoka and Dufour (2008, 2014), Guggenberger (2012), Nevo
and Rosen (2012), Kolesar, Chetty, Friedman, Glaeser, Imbens (2014), Small (2007), Small, Cai,
Zhang, Hyunseung (2015) among others. Moon and Schorfheide (2009) explore over-identifying
moment inequalities to reduce the MSE. This paper contributes to this literature by providing new
uniform results for potentially misspeci�ed semiparametric models.
7Consider the linear IV model Yi = X 0i�0 + ui with instruments Zi. The aggressive estimator is equivalent to
the OLS estimator because (X 0P[X;Z]X)�1X 0P[X;Z]Y = (X
0X)�1X 0Y , where Y = (Y1; : : : ; Yn)0, X = (X1; : : : ; Xn)0,
Z = (Z1; : : : ; Zn)0 and P[X;Z] = (X;Z) [(X;Z)
0(X;Z)]�1(X;Z)0 denotes the projection matrix.
6
3 Parameter Space and Uniform Dominance
We assume g2(w; �) is twice continuously di¤erentiable with respect to (wrt) � for any w 2 W.The �rst and second order derivatives are denoted by g2�(w; �) and g2��(w; �), respectively. Let
F be a set of distribution functions of W . For k = 1 and 2, de�ne the Jacobian matrix and the
variance-covariance matrix
Gk;F (�) � EF [gk;�(W; �)] 2 Rrk�d� and k;F (�) � VarF [gk(W; �)] 2 Rrk�rk ; (3.1)
for any � 2 � and any F 2 F . These moments exist by Assumption 3.2 below. We consider the riskdi¤erence between two estimators uniformly over F 2 F that satis�es Assumptions 3.1-3.3 below.
Let Gk � Gk;F (�0), k � k;F (�0) and QF (�) � EF [g2(W; �)]0�12 EF [g2(W; �)] for any � 2 �.For any � 2 �, de�ne Bc"(�) = f�0 2 � : jj�0 � �jj � "g.
Assumption 3.1 The following conditions hold for any F 2 F .(i) EF [g1(W; �0)] = 0 for some �0 2 int(�).(ii) For any " > 0 there is �1;" > 0 such that inf
�2Bc"(�0)jjEF [g1(W; �)] jj � �1;".
(iii) For any " > 0 there is �2;" > 0 such that inf�2Bc"(��)
QF (�)�QF (��) � �2;" for some �� 2 int(�).
(iv) jjG02�12 �2;0jj � C�1 k�2;0k, where �2;0 = (01�r1 ; �00)0.
Assumptions 3.1.(i)-(ii) require that the true unknown parameter �0 is identi�ed by the moment
conditions EF [g1(W; �0)] = 0. Assumption 3.1.(iii) implies that a pseudo true value �� is identi�ed
by the minimizer of the GMM criterion QF (�) under misspeci�cation. Assumption 3.1.(iv) requires
that �2;0 does not lie in the eigenspace of the right zero eigenvalues of the matrix G02�12 , which
rules out the special case that �0 may be consistently estimable even with severely misspeci�ed
moment conditions.
Assumption 3.2 For some > 0, the following conditions hold for any F 2 F :(i) EF [sup�2�
�jjg2(W; �)jj2+ + jjvec(g2�(W; �))jj2+ + jjvec(g2��(W; �))jj2+
�] � C.
(ii) For k = 1; 2, C�1 � �min(k) � �max(k) � C:(iii) For k = 1; 2, C�1 � �min(G0kGk) � �max(G0kGk) � C.
Assumption 3.2 imposes moment conditions and non-singularity condition for the moments, the
Jacobian matrix, and the covariance matrix in the GMM estimation.
Uniform dominance is of interest only if we allow for di¤erent degrees of misspeci�cation in the
parameter space. If we only allow for correctly speci�ed models or severely misspeci�ed models,
the desired dominance results hold trivially following a pointwise analysis. The next assumption
7
states that the parameter space contains a continuous perturbation from correctly speci�ed models
to misspeci�ed models and it forms a closed set. Write
v0(�) ��vec[G2;F (�)]
0; vech[2;F (�)]0; [M2;F (�)�M2;F (�0)]
0� , (3.2)
where M2;F (�) � EF [g2(W; �)]. Note that G2;F (�), 2;F (�) and M2;F (�) are all functions indexed
by � for any F . Let
� � f(�0; v0 (�)) : � 2 � and F 2 Fg (3.3)
where �0 and v0(�) depend on F , �0 characterizes the degree of misspeci�cation, v0(�) denotes theJacobian matrix, the variance-covariance matrix, and the moments evaluated at �.8
Assumption 3.3 (i) For some " > 0, (�; v(�)) 2 � with 0 < k�k < " exists and it implies that
(e�; v(�)) 2 � for all e� 2 Rr� with 0 � jje�jj < "; (ii) � is a closed set.Example 3.1 (Linear IV Model) We study a simple linear IV model and provide a set of simple
conditions that imply Assumptions 3.1, 3.2 and 3.3. The parameter of interest �0 is the coe¢ cient
of the endogenous regressor Xi in
Yi = X0i�0 + Ui, where EF � [Z1;iUi] = 0 (3.4)
for some valid instruments Z1;i 2 Rr1 . In addition, we have some potentially misspeci�ed IVs
Z�i 2 Rr�. In a reduced form equation obtained by projecting Z�i on Ui, we can write
Z�i = Ui�0 + Vi; where EF � [UiVi] = 0 (3.5)
and �0 characterizes the degree of misspeci�cation. These additional IVs are valid only if �0 = 0:
In this example, EF [g1(Wi; �)] = EF [Z1;i(Yi �X 0i�)] and EF [g�(Wi; �)] = EF [Z�i (Yi �X 0
i�)].
Let F � denote the joint distribution of (Ui; V 0i ; Z01;i; X
0i) and F� denote a class of distributions
containing F �. Let Z2;i = (Z 01;i; Z�0i )
0. The Jacobian and the variance-covariance matrices are
Gk = �EF � [Zk;iX 0i] and k = EF � [U2i Zk;iZ 0k;i]� EF � [UiZk;i]EF � [UiZ 0k;i]; (3.6)
respectively. Note that EF � [UiZk;i] 6= 0 if �0 6= 0:Without loss of generality, we assume EF � [U2i ] = 1such that EF [g�(Wi; �0)] = EF � [Z�i Ui] = �0 as in (1.4). Given the linear structures in (3.4) and
8For ease of notations, the dependence of �0 and v0 on F is omitted.
8
(3.5), the joint distribution of Wi = (Yi; Z01;i; Z
�0i ; X
0i), denoted by F , is determined by �0, �0 and
F �.
Assumption 3.4 For some > 0, the following conditions hold for any F � 2 F�:(i) EF � [Z1;iUi] = 0 and EF � [ViUi] = 0:
(ii) For k = 1; 2, �min(G0kGk) � C�1 and �min(k) � C�1:
(iii) EF � [jjXijj4+ + jjZ1;ijj4+ + jjZ�i jj4+ + U4+ i ] � C.
(iv) jjG02�12 �2;0jj � C�1 k�2;0k, where �2;0 = (01�r1 ; �00)0.(v) fG2 : F � 2 F�g and f2 : F � 2 F�g are closed sets.
Let �� = (G02�12 G2)
�1G02�12 EF � [Z2;iYi] which is well de�ned under Assumptions 3.4.(ii)-(iii).
Lemma 3.1 Let F be the collection of F such that Wi is generated by the linear model (3.4) and
(3.5) with: (i) �0; �� 2 int(�) and � is compact; (ii) �0 2 [0; C]r�= [0; C] � � � � � [0; C] for some
C > 0; and (iii) F� satis�es Assumption 3.4. Then, F satis�es Assumptions 3.1, 3.2 and 3.3.
For the linear IV model, Lemma 3.1 provides simple conditions on �0, �0 and F� on whichuniformity results are subsequently established.
Now we get back to the general set up. For a generic estimator b�; consider a weighted quadraticloss function
`(b�) � n(b� � �0)0�(b� � �0); (3.7)
where � is a d��d� pre-determined positive semi-de�nite matrix. For example, if � = Id� ; EF [`(b�)]is the MSE of b�: If � = (�1��2)�1; the weighting matrix � �rst rescale b� by the scale of variancereduction. If � = EF [XiX 0
i] for regressors Xi; EF [`(b�)] is the MSE of X 0ib�; an estimator of X 0
i�0:
Below we compare the averaging estimator b�eo and the conservative estimator b�1:We are inter-ested in the bounds of the �nite sample risk di¤erence
RDn(b�eo;b�1) � inf
F2FEF [`(b�eo)� `(b�1)];
RDn(b�eo;b�1) � supF2F
EF [`(b�eo)� `(b�1)]: (3.8)
We investigate these �nite-sample objects by asymptotic analysis with n1=2(b�� �0) in `(b�) replacedby its asymptotic distribution. To apply the bounded convergence theorem, we approximate `(b�)with the trimmed loss `�(b�) � minf`(b�); �g and consider arbitrarily large trimming (� ! 1): As
9
such, the �nite-sample bounds in (3.8) are approximated by
AsyRD(b�eo;b�1) � lim�!1
lim infn!1
infF2F
EF [`�(b�eo)� `�(b�1)] andAsyRD(b�eo;b�1) � lim
�!1lim supn!1
supF2F
EF [`�(b�eo)� `�(b�1)]; (3.9)
which are called lower and upper bounds of asymptotic risk di¤erence respectively in this paper.
The averaging estimator b�eo asymptotically uniformly dominates the conservative estimator b�1 ifAsyRD(b�eo;b�1) < 0 and AsyRD(b�eo;b�1) � 0: (3.10)
The bounds of the asymptotic risk di¤erence build the uniformity over F 2 F into the de�nitionby taking infF2F and supF2F before lim infn!1 and lim supn!1. Uniformity is crucial for the
asymptotic results to give a good approximation to their �nite-sample counterparts. These uniform
bounds are di¤erent from pointwise results which are either obtained under a �xed DGP or a
particular sequence of drifting DGP. The sequence of DGPs fFng along which the supremum or
the in�mum are approached often varies with the sample size.9 Therefore, to determine the bounds
of the asymptotic risk di¤erence, one has to derive the asymptotic distributions of these estimators
under various sequences fFng. For each sample size n, the true values F , �0; �0; v0 are denoted asFn; �n; �n; vn, respectively. These true values satisfy the model speci�ed in (1.1), (1.4), and (3.2)
with the subscript 0 replaced by n. Under fFng, the observations fWn;igni=1 form a triangular array.For notational simplicity, Wn;i is abbreviated to Wi.
To study the bounds of asymptotic risk di¤erence, we consider sequences of DGPs fFng suchthat �n satis�es
(i) n1=2�n ! d 2 Rr� or (ii) jjn1=2�njj ! 1: (3.11)
Case (i) models mild misspeci�cation, where �n is a local deviation from 0: Case (ii) includes
the severe misspeci�cation where �n is bounded away from 0 as well as the intermediate case in
which �n ! 0 and jjn1=2�njj ! 1. To obtain a uniform approximation, all of these sequences are
necessary. Once we study the bounds of asymptotic risk di¤erence along each of these sequences,
we show that we can glue them together to obtain the bounds of asymptotic risk di¤erence.
9 In the rest of the paper, we use fFng to denote fFn 2 F : n = 1; 2; :::g.
10
4 Averaging Weight
We start by deriving the joint asymptotic distribution of b�1 and b�2 under di¤erent degrees of mis-speci�cation. We consider sequences of DGPs fFng such that (i) n1=2�n ! d 2 Rr� or jjn1=2�njj !1; and (ii) G2;Fn(�), 2;Fn(�) andM2;Fn(�) are convergent (under the uniform norm). The followinglemma characterizes the limits of G2;Fn(�), 2;Fn(�) and M2;Fn(�) under Assumption 3.3.
Lemma 4.1 Suppose that G2;Fn(�) ! G2(�), 2;Fn(�) ! 2(�) and M2;Fn(�) ! M2(�) for somefunctions G2(�), 2(�) and M2(�). Then under Assumption 3.3, there exists some F 2 F such that
G2 (�) = G2;F (�) , 2 (�) = 2;F (�) and M2 (�) =M2;F (�) :
Let M1;F (�) denote the �rst r1 components in M2;F (�). By Assumption 3.1.(i), there is a unique�0 2 int(�) which satis�es M1;F (�0) = 0. Under �xed DGP, we use F to denote the true DGP
and �0 to denote the true value under F: Under drifting DGP approximation, F is the de�ned in
Lemma 4.1 by the limit and �0 denotes the true value in the limit, e.g., the unique value identi�ed
by the limit of the valid moment conditions.
For k = 1; 2, de�ne
Gk � Gk;F (�0);k � k;F (�0); and �k � ��G0k
�1k Gk
��1G0k
�1k : (4.1)
Let Z2 denote a zero mean normal random vector with variance-covariance matrix 2 and Z1 denoteits �rst r1 components.
Lemma 4.2 Suppose Assumptions 3.1-3.3 hold. Consider fFng such that G2;Fn(�) ! G2(�),2;Fn(�)! 2(�) and M2;Fn(�)!M2(�) for some functions G2(�);2(�);M2(�).(a) If n1=2�n ! d for some d 2 Rr�, then0@ n1=2(b�1 � �n)
n1=2(b�2 � �n)1A!d
0@ �1
�2
1A �
0@ �1Z1:�2 (Z2 + d0)
1A , where d0 = (01�r1 ; d0)0:(b) If jjn1=2�njj ! 1, then n1=2(b�1 � �n)!d �1 and jjn1=2(b�2 � �n)jj !p 1.
Given the joint asymptotic distribution of b�1 and b�2, it is straightforward to study b�(!) =(1� !)b�1 + !b�2 if ! is deterministic. Following Lemma 4.2(a),
n1=2(b�(!)� �n)!d �(!) � (1� !)�1+!�2 (4.2)
11
for n1=2�n ! d, where d 2 Rr� . In the Appendix, a simple calculation shows that E[`(�(!))] isminimized at the infeasible optimal weight
!� � tr [� (�1 � �2)]d00 (�2 � ��1)
0�(�2 � ��1) d0 + tr [� (�1 � �2)]; (4.3)
where
�k ��G0k
�1k Gk
��1for k = 1; 2 and ��1 � [�1;0d��r� ] : (4.4)
To gain some intuition, consider the case where � = Id� such that the MSE of b�(!) is minimizedat !�: In this case, the infeasible optimal weight !� yields the ideal bias and variance trade o¤.
However, the bias depends on d, which cannot be consistently estimated. Hence, !� cannot be
consistently estimated. Our solution to this problem is to provide an inconsistent estimator of !�
such that the resulting averaging estimator reduces the MSE for any value of d.
The empirical analog of !� is constructed as follows. First, for k = 1 and 2, replace �k by its
consistent estimator10 b�k � ( bG0kb�1k bGk)�1, wherebGk � n�1 nX
i=1
gk;�(Wi;b�1) and bk � n�1 nXi=1
gk(Wi;b�1)gk(Wi;b�1)0 � gk(b�1)gk(b�1)0: (4.5)
Second, replace (�2 � ��1) d0 by its asymptotically unbiased estimator n1=2(b�2 � b�1) becausen1=2(b�2 � b�1)!d (�2 � ��1) (Z2 + d0) ; (4.6)
for d0 = (01�r1 ; d0)0 and d 2 Rr� following Lemma 4.2(a). The empirical optimal weight takes the
form
e!eo � trh��b�1 � b�2�i
n(b�2 � b�1)0�(b�2 � b�1) + tr h��b�1 � b�2�i ; (4.7)
where � is the matrix speci�ed in the loss function. The averaging GMM estimator takes the form
b�eo = (1� e!eo)b�1 + e!eob�2: (4.8)
Next we consider the asymptotic distribution of b�eo under di¤erent degrees of misspeci�cation.Lemma 4.3 Suppose that Assumptions 3.1-3.3 hold. Consider fFng such that G2;Fn(�) ! G2(�),2;Fn(�)! 2(�) and M2;Fn(�)!M2(�) for some functions G2(�);2(�);M2(�).10The consistency is proved in Lemma 4.3.
12
(a) If n1=2�n ! d for some d 2 Rr�, then
e!eo !d ! �tr (� (�1 � �2))
(Z2 + d0)0 (�2 � ��1)0�(�2 � ��1) (Z2 + d0) + tr (� (�1 � �2))
and
n1=2(b�eo � �n)!d � � (1� !)�1 + !�2:
(b) If jjn1=2�njj ! 1, then e!eo !p 0 and n1=2(b�eo � �n)!d �1.
To study the bounds of asymptotic risk di¤erence between b�eo and b�1, it is important to takeinto account the data-dependent nature of e!eo. Unlike b�1 and b�2; the randomness in e!eo is non-negligible in the mild misspeci�cation case (a). In consequence, b�eo does not achieve the samebounds of asymptotic risk di¤erence as the ideal averaging estimator (1 � !�)b�1 + !�b�2 does.Nevertheless, below we show that b�eo is insured against potentially misspeci�ed moments becauseit uniformly dominates b�1:5 Bounds of Asymptotic Risk Di¤erence under Misspeci�cation
In this section, we study the bounds of the asymptotic risk di¤erence de�ned in (3.9). Note that
the asymptotic distributions of b�1 and b�eo in Lemma 4.2 and 4.3 only depend on d, G2 and 2. Fornotational convenience, de�ne
h � (d0; vec(G2)0; vech(2)0): (5.1)
For the mild misspeci�cation case, de�ne the parameter space of h as
H = fh : d 2 Rr� ; G2 = G2;F (�0) and 2 = 2;F (�0) for some F 2 F with �0 = 0g; (5.2)
where �0 and �0 are de�ned by (1.1) and (1.4) for a given F:
If n1=2�n ! d 2 Rr� , by the asymptotic distributions of b�eo and b�1, and the bounded convergencetheorem,
EF [`�(b�eo)� `�(b�1)]! g�(h) � Ehmin
n�0��; �
o�min
��01��1; �
i(5.3)
where � is a constant which does not depend on n and E[�] denotes expectation with respect to(wrt) the Lebesgue measure. As � !1, the limit of g�(h) is
g(h) � E[�0��]� E��01��1
�: (5.4)
13
Figure 2. The Finite Sample Risk and the (Simulated) Asymptotic Risk
Note: The �nite-sample and (simulated) asymptotic risk of the averaging GMM estimator are represented by "GMMA-FRisk" and "GMMA-SRisk", respectively. The �nite-sample and (simulated) asymptotic risk of the pre-test GMMestimator are represented by "GMMP-FRisk" and "GMMP-SRisk", respectively.
Theorem 5.1 Suppose that Assumptions 3.1-3.3 hold. The bounds of the asymptotic risk di¤erence
satisfy
AsyRD(b�eo;b�1)= lim�!1
min
�infh2H
[g�(h)] ; 0
�� min
�infh2H
[g(h)] ; 0
�;
AsyRD(b�eo;b�1)= lim�!1
max
�suph2H
[g�(h)] ; 0
�� max
�suph2H
[g(h)] ; 0
�:
To show that b�eo uniformly dominates b�1, Theorem 5.1 implies that it is su¢ cient to show that
infh2H [g(h)] < 0 and suph2H [g(h)] � 0. We can investigate infh2H g(h) and suph2H g(h) by simu-lating g(h) in (5.4). In practice, we replace G2 and 2 by their consistent estimators and plot g(h) as
a function of d. Even if the uniform dominance condition does not hold, min finfh2H [g(h)] ; 0g andmax fsuph2H [g(h)] ; 0g quantify the most- and least-favorable scenarios for the averaging estimator.
Following (5.3) and (5.4), we call E[�0��] and E��01��1
�the asymptotic risks of b�eo and b�1 such
that g(h) represents the asymptotic risk di¤erence. For any given h; Figure 2 uses simulation to
demonstrate that these asymptotic risks provide good approximations to the �nite-sample risks
EF [`�(b�eo)] and EF [`�(b�1)]: Figure 2 normalizes the �nite-sample risk of b�1 to be 1 in all cases14
and plots the (simulated) asymptotic risks and �nite-sample risks of two estimators: one is the
averaging estimator b�eo and one is the pre-test GMM estimator based on the over-identi�cation
J-test with signi�cance level 0.01.11 The asymptotic risk for this pre-test estimator is given by
(A.89) in the Appendix. It is clear that the �nite sample risk and the (simulated) asymptotic risk
are fairly close and the averaging GMM estimator uniformly dominates the conservatives estimator
while the pre-test estimator does not.
Theorem 5.2 Let A � �(�1 � �2). Suppose that Assumptions 3.1-3.3 hold. If tr(A) > 0 and
tr(A) � 4�max(A) for any F 2 F , we have
AsyRD�(b�eo;b�1) < 0 and AsyRD�(b�eo;b�1) = 0:Thus, b�eo uniformly dominates b�1:
To shed light on the su¢ cient conditions in Theorem 5.2, let us consider a scenario similar
to the James-Stein estimator: �1 = �21Id� ; �2 = �22Id� ; and � = Id� . In this case, the su¢ cient
conditions become �1 > �2 and d� � 4: The �rst condition tr(A) > 0; which is reduced to �1 > �2;requires that the additional moments EF [g�(Wi; �0)] = 0 are non-redundant in the sense that they
lead to a more e¢ cient estimator of �. The second condition tr(A) � 4�max(A); which is reducedto d� � 4; requires that we are interested in the total risk of several parameters rather than that ofa single one. In a more general case where �1 and �2 are not proportional to the identity matrix,
the su¢ cient conditions are reduced to �1 > �2 and d� � 4 under the choice � = (�1 � �2)�1,which rescales b� by the variance reduction �1 � �2. In a simple linear IV model (Example 3.1)
where Z�i is independent of Z1;i and the regression error Ui is homoskedastic conditional on the
IVs, �1 > �2 requires that EF � [Z�iX 0i] and EF � [Z�i Z�0i ] both have full rank.
Note that these conditions are su¢ cient but not necessary. If these su¢ cient conditions do not
hold, we can still simulate the upper bounds in Theorem 5.1 to check the uniform dominance
condition. In fact, simulation studies in the next session show that in many cases b�eo has asmaller �nite-sample risk than b�1 even if these su¢ cient conditions are violated. Nevertheless,these analytical su¢ cient conditions can be checked easily before the simulation-based methods are
adopted.
11For the �nite-sample results, �n = c0n�1=21r��1 where c0 is a scale from 0 to 20. The �nite sample risks
are calculated using 100,000 simulated samples and the asymptotic risks are simulated by drawing 10,000 normalrandom vectors with mean zero and variance-covariance b2 in each simulated sample. No truncation is applied tothe �nite-sample risk.
15
6 Simulation Studies
In this section, we investigate the �nite sample performance of our averaging GMM estimator in
linear IV models. In addition to the empirical optimal weight e!eo, we consider two other averagingestimators based on the JS type of weights. The �rst one is based on the positive part of the JS
weight12:
!P;JS = 1� 1� tr( bA)� 2�max( bA)
n(b�2 � b�1)0H(b�2 � b�1)!+
(6.1)
where (x)+ = max f0; xg and bA is the estimator of A using b�k for k = 1; 2. The second one usesthe restricted JS weight
!R;JS = (!P;JS)+ : (6.2)
By construction, !P;JS � 1 and 0 � !R;JS � 1. We compare the �nite-sample MSEs of these threeaveraging estimators, the conservative GMM estimator b�1, and the pre-test GMM estimator based
on the J-test. The �nite-sample MSE of the conservative GMM estimator is normalized to be 1.
In Theorem 5.2, the su¢ cient condition for the uniform dominance is tr(A) � 4�max(A). Whenthis condition is not satis�ed, however, it is still possible that our averaging GMM estimator
has a smaller risk than the conservative GMM estimator. Therefore we consider two models in
simulation studies. In the �rst model, tr(A) � 4�max(A) and hence the su¢ cient condition in
Theorem 5.2 is satis�ed. In the second model 2�max(A) < tr(A) < 4�max(A), which means that the
su¢ cient condition in Theorem 5.2 does not hold. In each model, we consider four sample sizes,
n = 50; 100; 250; 500, and use 100,000 simulation repetitions.
6.1 Simulation in Model 1
Our �rst simulation model is
Yi =
6Xj=1
�jXj;i + �i; (6.3)
where Xj;i are generated by
Xj;i = �j(Zj;i + Zj+6;i) + Zj+12;i + uj;i for j = 1; :::; 6: (6.4)
12This formula is a GMM analog of the generalized JS type shrinkage estimator in Hansen (2016) for parametricmodels. The shrinkage scalar � is set to tr( bA)� 2�max(tr( bA)) in a fashion similar to the original JS estimator.
16
We draw i.i.d. random vectors (Z1;i; :::; Z18;i; u1;i; :::; u6;i; �i)0 from normal distributions with mean
zero and variance-covariance matrix diag(I18�18;�7�7), where
�7�7 =
0@ I6�6 0:25� 16�10:25� 11�6 1
1A : (6.5)
We set (�1; :::; �6) = 2:5 � 11�6 and (�1; :::; �6) = 0:5 � 11�6. The observed data are Wi =
(Yi; X1;i; :::; X6;i; Z1;i; :::; Z12;i; ~Z13;i; :::; ~Z18;i)0; where
~Zj;i = Zj;i + n�1=2dj�i, for j = 13; :::; 18: (6.6)
In the main regression equation (6.3), all regressors are endogenous because E(Xj;i�i) = 0:25 for
j = 1; :::; 6. The instruments (Z1;i; :::; Z12;i)0 are valid and ( ~Z13;i; :::; ~Z18;i)0 are misspeci�ed because
E( ~Zj;i�i) = n�1=2dj for j = 13; :::; 18. In the simulation studies, we consider d = (d13; :::; d18) =
c0 � 11�6 where c0 is a scalar that takes values on the grid points between 0 and 20 with the gridlength 0:1.
Figure 3 presents the MSEs of various estimators of the parameters in (6.3). Our �ndings in
model 1 are summarized as follows. First, the averaging GMM estimator b�eo has smaller MSE thanb�1 uniformly over d in all sample sizes considered, which is predicted by our theory because thekey su¢ cient condition is satis�ed in this model. Second, the pre-test GMM estimator does not
dominate the conservative GMM estimator b�1. When the location parameter c0 is close to zero, thepre-test GMM estimator has relative MSE as low as 0.4. However, its relative MSE is above 1 when
c0 is around 5. Third, among the three averaging estimators, the one based on e!eo has the smallestMSE. Fourth, the positive JS averaging estimator has relative MSE above 1 when the sample size
is small, e.g., n = 50 and n = 100, and it has relative MSE below 1 when the sample size becomes
slightly large. Fifth, it is interesting to see that as the sample size grows, the �nite sample MSEs
of the positive and restricted JS averaging estimators converge to that of the averaging estimator
based on e!eo.6.2 Simulation in Model 2
The second model is
Yi =6Xj=1
�jXj;i + �i; (6.7)
17
Figure 3. Finite Sample MSEs of the Averaging Estimators in Model 1
Note: "Pre-test(0.01)" refers to the pre-test GMM estimator based on the J-test with nominal size 0.01; "Plug-opt"refers to the averaging GMM estimator based on the empirical optimal weight ; "Posi-JS" and "ReSt-JS" refer to theaveraging estimators based on the positive part of the JS weight and the restricted JS weight, respectively.
where X1;i, X2;i and X3;i are exogenous variables generated by
X1;i = 3�1=2(Z1;i+Z2;i+Z4;i); X2;i = 3
�1=2(Z2;i+Z3;i+Z6;i); X3;i = 3�1=2(Z3;i+Z1;i+Z8;i); (6.8)
and Xj;i (j = 4; 5; 6) are generated by
Xj;i = �j(Zj;i + Zj+3;i) + Zj+6;i + uj;i for j = 4; 5; 6: (6.9)
We draw i.i.d. random vectors (Z1;i; :::; Z12;i; u4;i; :::; u6;i; �i)0 from normal distributions with mean
zero and variance-covariance matrix diag(I12�12;�4�4), where
�4�4 =
0@ I3�3 0:25� 13�10:25� 11�3 1
1A : (6.10)
18
The observed data are Wi = (Yi; X1;i; :::; X6;i; Z4;i; :::; Z9;i; ~Z4;i; :::; ~Z6;i)0, where
~Zj;i = Zj+6;i + n�1=2dj�i for j = 4; 5; 6: (6.11)
We set (�1; :::; �6) = 2:5 � 11�6 and (�4; :::; �6) = 0:5 � 11�3. In this model, Xj;i (j = 4; 5; 6) areendogenous regressors, (Z4;i; :::; Z9;i)0 are valid IVs, and ( ~Z4;i; :::; ~Z6;i)0 are misspeci�ed IVs. In the
simulation, we consider d = (d4; :::; d6) = c0 � 11�3 where c0 is a scalar that takes values on thegrid points between 0 and 20 with grid length 0:1.
Figure 4. Finite Sample Risks of the Averaging Estimators in Model 2
Our �ndings in Figure 4 are summarized as follows. First, even though the su¢ cient condition
in Theorem 5.1(c) is not satis�ed, the averaging estimator based on e!eo has a smaller MSE thanb�1 uniformly over d in all sample sizes considered. Moreover, its MSE is much smaller than that ofthe other two averaging estimators. Second, the properties of the pre-test estimator are similar to
those in model 1. That is, it does not dominate the conservative estimator. Third, the averaging
estimator using !P;JS has very large and unstable MSE when the sample size is small (e.g., n = 50
and 100). When the sample size is 50, its MSE is above 1.5 and hence does not show up in the
�rst panel of Figure 3. When the sample size becomes slightly large, (e.g., n = 250 and 500), the
positive JS averaging estimator has larger MSE than b�1 when the location parameter c0 is close tozero. Fourth, the averaging estimator using !R;JS has almost identical MSE as b�1 when the sample
19
size is small (e.g., n = 50 and 100), and its relative MSE becomes slightly lower than 1 when c0 is
close to zero and the sample size is large (e.g., n = 250 and 500).
7 An Empirical Application
One important issue in the empirical analysis of life cycle labor supply is to estimate the individual
human capital production function. The knowledge about the human capital function allows re-
searchers to estimate the household�s utility function, and hence to evaluate how changes in policies,
such as tax reduction, a¤ect consumption, labor market outcomes, and welfare (see, e.g., Heckman,
1976; Shaw, 1989; and Imai and Keane, 2004). This section applies the averaging GMM to estimate
the human capital production function.
We follow the literature (see, e.g., Shaw, 1989) to specify the human capital production function
as a quadratic function of ki;t, log of the human capital stock Ki;t, and hi;t, log of the hours of work
Hi;t:
f(ki;t; hi;t; �) = 1hi;t + 2h2i;t + 3hi;tki;t + 4ki;t + 5k
2i;t; (7.1)
where � = ( 1; : : : ; 5) are unknown parameters. Denote the regressors by
Xi;t =�hi;t; h
2i;t; hi;tki;t; ki;t; k
2i;t
�0: (7.2)
The log human capital stock ki;t is accumulated through the equation
ki;t+1 = f(ki;t; hi;t; �) + "i;t (7.3)
where "i;t = �i + ui;t is the unobservable residual term that contains an individual heterogeneity
component �i and a random shock ui;t. To avoid unnecessary complications, we follow Shaw (1989)
to specify the real wage as wi;t = Ri;tKi;t and follow Hokayem and Ziliak (2014) to assume Ri;t = 1
for all i and all t.13
We use the same data set as in Hokayem and Ziliak (2014) from the Panel Study of Income
Dynamics (PSID). The sample includes biennial observations for 1654 men from 1999 to 2009. We
further narrow the sample to individuals with at least three consecutive periods of observations,
which gives us a data set with 5774 individual-year observations.
To eliminate the individual e¤ect, we take �rst di¤erence on equation (7.3):
�ki;t+1 = �f(ki;t; hi;t; �) + �ui;t (7.4)
13Another way to think of this speci�cation is to use real wage rate wt as a proxy for the human capital stock Kt.
20
Table 1. Estimator of Human Capital Production Function
1 2 3 4 5 J-testb�1 0.0236 -0.0070 0.0310 0.0656 -0.0381 0.8427(0.0571) (0.0444) (0.0626) (0.0621) (0.0447) � �b�2 0.0009 0.0265 -0.0113 -0.2232 -0.0925 0(0.0328) (0.0240) (0.0496) (0.0529) (0.0247) � �
Note: (i) Numbers in the brackets are the standard errors; (ii) Numbers in the last column are the p-values of theJ-tests; (iii) GMM estimators are based on the sample from PSID in year 2003, 2005, 2007 and 2009; (iv) Four yeardummy variables are included in the moment functions and they are used as their own IVs in the GMM estimation.
where "�" denotes the �rst order di¤erence operator. The unknown parameter � can be estimated
by GMM estimator b�1 with the moment functionsg1(�ki;t+1;�Xi;t; Z1;t; �) = [�ki;t+1 ��f(ki;t; hi;t; �)] Z1;t (7.5)
where Z1;t = (X 0i;t�1; Z
0�;t)
0 is a set of IVs including Xi;t�1 and
Z�;t =�ci;t�1, c2i;t�1, ci;t�1li;t�1, li;t�1, l
2i;t�1
�0; (7.6)
where ci;t�1 = logCi;t�1, li;t�1 = logLi;t�1, and Ci;t�1 and Li;t�1 are, respectively, the consumption
and leisure of individual i at period t�1. The lagged consumption and leisure variables are includedto provide extra identi�cation restrictions for the human capital function.
In equation (7.4), the regressors �Xi;t may be endogenous because: (i) ki;t is correlated with
ui;t�1 and hence �ui;t in view of equation (7.3); and (ii) hi;t is partly determined by ki;t through the
individual�s labor decision. As a result, the LS estimator based on the following moment function
g�(�ki;t+1;�Xi;t; �) = [�ki;t+1 ��f(ki;t; hi;t; �)]�Xi;t (7.7)
may be inconsistent. The aggressive GMM estimator b�2 is constructed using the moment conditionsin both (7.5) and (7.7).
Table 1 reports the estimation results on the conservative and the aggressive estimators. The
conservative and aggressive GMM estimators of � di¤er substantially. The J-test strongly rejects the
validities of the moment conditions in (7.7), while it supports the validities of the moment conditions
in (7.5). On the other hand, the aggressive GMM estimator b�2 has much smaller standard errorthan the conservative estimator b�1.
Next, we consider the averaging GMM estimator under the quadratic loss function with � = Id� .
The empirical weight e!eo on the aggressive GMM estimator is 0:0770. It is interesting that the
21
Figure 5. Simulated Asymptotic Risk of the Averaging Estimatorof the Human Capital Function
averaging estimator assigns nontrivial weight to b�2, even though the J-test indicates misspeci�cationof the moment conditions in (7.7).
To evaluate the performance of the averaging GMM estimator, we simulate its asymptotic risk
following the formula in (5.4). This exercise is the same as that for Figure 2, which shows that this
simulated asymptotic risk is a good approximation to the �nite-sample risk. As there are 5 moment
conditions in (7.7), the risk of the averaging GMM estimator is a function of a 5-dimensional vector
of location parameters d = (d1; d2; d3; d4; d5)0 2 R5: We parameterize it as
d1 =pr cos�1;
d2 =pr sin�1 sin�2 sin�3; d3 =
pr sin�1 sin�2 cos�3;
d4 =pr sin�1 cos�2 sin�4; d5 =
pr sin�1 cos�2 cos�4 (7.8)
for some r 2 [0;+1) and �1; �2; �3; �4 2 [0; 2�] such thatP5k=1 d
2k = r: To simulate the risk, we
consider 1001 equally spaced grid points for r between 0 and 100, and for each grid point of r,
we consider 30 equally spaced grid points for �1, �2, �3 and �4, respectively, between 0 and 2�
(starting at 0). For each grid point of r, this gives 304 values for the simulated asymptotic risk
and we record the minimum and maximum values. As in the Monte Carlo simulation studies, the
asymptotic risk of the conservative GMM estimator is normalized to be 1 for each DGP.
8 Conclusion
This paper studies the averaging GMM estimator that combines the conservative estimator and
the aggressive estimator with a data-dependent weight. The averaging weight is the sample analog
22
of an optimal non-random weight. We provide a su¢ cient class of drifting DGPs under which
the pointwise asymptotic results combine to yield uniform approximations to the �nite-sample
risk di¤erence between two estimators. Using this asymptotic approximation, we show that the
proposed averaging GMM estimator uniformly dominates the conservative GMM estimator.
Inference based on the averaging estimator is an interesting and challenging problem. In addition
to the uniform validity, a desirable con�dence set should have smaller volume than that obtained
from the conservative moments alone. We leave the inference issue to future investigation.
References
[1] Andrews, D. W. K. (1999): "Consistent Moment Selection Procedures for Generalized Methodof Moments Estimation," Econometrica, 67(3), 543-563.
[2] Andrews, D. W. K., X. Cheng and P. Guggenberger (2011): �Generic Results for Establishingthe Asymptotic Size of Con�dence Sets and Tests," Cowles Foundation Discussion Paper,No.1813.
[3] Andrews, D. W. K., P. Guggenberger (2006): "On the Asymptotic Maximal Risk of Estimatorsand the Hausman Pre-Test Estimator," unpublished manuscript.
[4] Andrews, D. W. K. and B. Lu (2001): "Consistent Model and Moment Selection Procedures forGMM Estimation with Application to Dynamic Panel Data Models," Journal of Econometrics,101(1), 123-164.
[5] Ashley, R. (2009): "Assessing the Credibility of Instrumental Variables Inference with Imper-fect Instruments via Sensitivity Analysis," Journal of Applied Econometrics, 24, 325�337.
[6] Berkowitz, D. M. Caner and Y. Fang (2012): "The Validity of Instruments Revisited," Journalof Econometrics, 166(2), 255-266.
[7] Buckland, S. T., K. P. Burnham, and N. H. Augustin (1997): "Model Selection: An IntegralPart of Inference." Biometrics, 53, 603-618.
[8] Cheng X. and B. Hansen (2015): "Forecasting with Factor-Augmented Regression: A Fre-quentist Model Averaging Approach," Journal of Econometrics, 186(2), 280-293.
[9] Cheng X. and Z. Liao (2015): "Select the Valid and Relevant Moments: An Information-BasedLASSO for GMM with Many Moments," Journal of Econometrics, 186(2), 443-464.
[10] Claeskens, G. and R. J. Carroll (2007): "An Asymptotic Theory for Model Selection Inferencein General Semiparametric Problems," Biometrika, 94, 249-265.
[11] Conley T. G., C. B. Hansen, and P. E. Rossi (2012): "Plausibly Exogenous," Review of Eco-nomics and Statistics, 94(1), 260-272.
23
[12] Doko Tchatoka, F. and J.-M. Dufour (2008): "Instrument Endogeneity and Identi�cation-robust Tests: Some Analytical Results," Journal of Statistical Planning and Inference, 138(9),2649�2661.
[13] Doko Tchatoka, F. and J.-M. Dufour (2014): "Identi�cation-robust Inference for EndogeneityParameters in Linear Structural Models," Econometrics Journal, 17, 165�187.
[14] DiTraglia, F. (2014): "Using Invalid Instruments on Purpose: Focused Moment Selection andAveraging for GMM," University of Pennsylvania, Working Paper.
[15] Eichenbaum, M., L. Hansen and K. Singleton (1988): "A Time Series Analysis of Representa-tive Agent Models of Consumption and Leisure Choice under Uncertainty," Quarterly Journalof Economics, 103(1), 51-78.
[16] Guggenberger, P. (2012): "On the Asymptotic Size Distortion of Tests when InstrumentsLocally Violate the Exogeneity Assumption," Econometric Theory, 28(2), 387-421.
[17] Hansen, B. E. (2007): "Least Squares Model Averaging," Econometrica, 75, 1175-1189.
[18] Hansen, B. E. (2015): �A Stein-Like 2SLS Estimator,�Econometric Reviews, forthcoming.
[19] Hansen, B. E. (2016): �E¢ cient Shrinkage in Parametric Models,�Journal of Econometrics,190(1), 115�132.
[20] Hansen, B.E., and Racine, J. (2012): "Jackknife Model Averaging," Journal of Econometrics,167, 38�46.
[21] Hansen, L. P. (1982): "Large Sample Properties of Generalized Method of Moments Estima-tors," Econometrica, 50, 1029-1054.
[22] Heckman, J. J. (1976): "Estimation of A Human Capital Production Function Embeddedin A Life-cycle Model of Labor Supply," In Household Production and Consumption, NBER,225-264.
[23] Hong, H., B. Preston and M. Shum (2003): "Generalized Empirical Likelihood-Based ModelSelection Criteria for Moment Condition Models," Econometric Theory, 19(6), 923-943.
[24] Hjort, N. L. and G. Claeskens (2003): "Frequentist Model Average Estimators," Journal ofthe American Statistical Association, 98, 879-899.
[25] Hjort, N. L. and G. Claeskens (2006): "Focused Information Criteria and Model Averagingfor the Cox Hazard Regression Model," Journal of the American Statistical Association, 101,1449-1464.
[26] Hokayem, C. and J. P. Shum (2014): "Health, Human Capital, and Life Cycle Labor Supply,"American Economic Review: Papers & Proceedings 2014, 104(5), 127-131.
[27] Imai, S. and M. P. Kean (2004): "Intertemporal Labor Supply and Human Capital Accumu-lation,�International Economic Review, 45, 601-641.
24
[28] James W. and C. Stein (1961): �Estimation with Quadratic Loss,� Proc. Fourth BerkeleySymp. Math. Statist. Probab., vol(1), 361-380.
[29] Leeb, H. and B. M. Pötscher (2008): "Sparse Estimators and the Oracle Property, or theReturn of the Hodges Estimator," Journal of Econometrics, 142(1), 201-211.
[30] Leung, G., and Barron, A. (2006): "Information Theory and Mixing Least-square Regressions,"IEEE Transactions on Information Theory, 52, 3396�3410.
[31] Kolesar, M., R. Chetty, J. Friedman, E. Glaeser, G. Imbens (2014): "Identi�cation and In-ference with Many Invalid Instruments," Journal of Business Economics and Statistics, forth-coming
[32] Liao, Z. (2013): "Adaptive GMM Shrinkage Estimation with Consistent Moment Selection,"Econometric Theory, 29, 1�48.
[33] Liu, C-A. (2015): �Distribution Theory of the Least Squares Averaging Estimator,�Journalof Econometrics, 186(1), 142�159.
[34] Lu, X. and L. Su (2015): "Jackknife Model Averaging for Quantile Regressions,�Journal ofEconometrics, forthcoming.
[35] Moon. H. R. and F. Schorfheide (2009): "Estimation with Overidentifying Inequality MomentConditions," Journal of Econometrics, 153(2), 136�154.
[36] Nevo, A. and A. Rosen (2012): "Identi�cation with Imperfect Instruments," Review of Eco-nomics and Statistics, 93(3), 659-671.
[37] RudinW (1976): Principles of Mathematical Analysis, International Series in Pure and AppliedMathematics, McGraw-Hill, New York.
[38] Sargan, J. (1958): "The estimation of economic relationships using instrumental variables,"Econometrica, 26(3), 393-415
[39] Shaw, K. (1989): "Life-cycle Labor Supply with Human Capital Accumulation," InternationalEconomic Review, 30, 431-456.
[40] Small, D. S. (2007): "Sensitivity Analysis for Instrumental Variables Regression with Overi-dentifying Restrictions," Journal of the American Statistical Association, 102(479), 1049�1058.
[41] Small, D. S., Cai, T. T., Zhang, A. and Hyunseung, K. (2015): "Instrumental VariablesEstimation with Some Invalid Instruments and Its Application to Mendelian Randomization,"Journal of the American Statistical Association, forthcoming.
[42] Stein, C. M. (1956): "Inadmissibility of the Usual Estimator for the Mean of A MultivariateNormal Distribution," Proc. Third Berkeley Symp. Math. Statist. Probab., vol(1), 197-206.
[43] Yang, Y. (2005): "Can the Strengths of AIC and BIC be Shared?�A Con�ict Between ModelIdenti�cation and Regression Estimation," Biometrika, 92, 937�950.
25
A Appendix
Proof of Lemma 3.1. By Assumption 3.4.(i), EF [g1(Wi; �0)] = EF � [Z1;iUi] = 0 for the truevalue �0 in Yi = X 0
i�0 + Ui. Also �0 2 int(�) is assumed in Condition (i) of the Lemma. Thisveri�es Assumption 3.1.(i). For any � 2 � with jj� � �0jj � ",
kEF [g1(Wi; �)]k = kG1 (�0 � �)k � (�min(G01G1))1=2 k�0 � �k � C�1=2" > 0 (A.1)
by Assumption 3.4.(ii), which veri�es Assumption 3.1.(ii). To show Assumption 3.1.(iii), note that
QF (�) = EF [Z2;i(Yi �X 0i�)]
0�12 EF [Z2;i(Yi �X 0i�)]
= �0G02�12 G2� � 2�0G02
�12 EF � [Z2;iYi] + EF � [YiZ
02;i]
�12 EF � [YiZ2;i]; (A.2)
which is uniquely identi�es �� by Assumption 3.4.(ii). Moreover �� 2 int(�) by Condition (i) ofthe Lemma. Assumption 3.1.(iv) is imposed in Assumption 3.4.(iv).
To verify Assumption 3.2, note that g2(Wi; �) = Z2;i[X0i(�0 � �) + Ui], g2�(Wi; �) = �Z2;iX 0
i
and g2��(Wi; �) = 0. Assumption 3.2.(i) follows from Assumption 3.4.(iii), and � and [0; C]r�are
bounded. Assumptions 3.2(ii)-(iii) follow from Assumptions 3.4.(ii)-(iii).Note that M2;F (�) �M2;F (�0) = EF [g2(Wi; �)] � EF [g2(Wi; �0)] = G2(� � �0), which implies
thatv0(�) =
�vec[G2]
0; vech[2]0; (� � �0)0G02
�, (A.3)
In this example, we have � = [0; C]r� � V; where V = fv0(�) : F � 2 F� and � 2 �g. Assumption
3.3.(i) holds because � is a product space. Assumption 3.3.(ii) holds by Assumption 3.4.(v) andCondition (i) of the Lemma.
A.1 Proofs for the asymptotic distribution of estimators in Section 4
Let �n(g2(W; �)) = n�1=2Pn
i=1(g2(Wi; �)� EFn [g2(Wi; �)]).
Lemma A.1 Suppose that Assumption 3.2.(i) holds and � is compact. Then we have(i) sup�2� kg2(�)� EFn [g2(Wi; �)]k = op(1);(ii) sup�2�
n�1Pni=1 g2(Wi; �)g2(Wi; �)
0 � EFn [g2(Wi; �)g2(Wi; �)0] = op(1);
(iii) sup�2� n�1Pn
i=1 g2;�(Wi; �)� EFn [g2;�(Wi; �)] = op(1);
(iv) �n(g2(W; �)) is stochastic equicontinuous over � 2 �;(v) �1=22;Fn
(�n)�n(g2(W; �n))!d N(0r2�1; Ir2).
Proof of Lemma A.1. See Lemma 11.3-11.5 of Andrews and Cheng (2013).
Proof of Lemma 4.1. De�ne v0(�) � (vec[G2;F (�)]0; vech[2;F (�)]
0;M2;F (�)0). By the Arzela-
Ascoli Theorem, fv0(�) : � 2 � and F 2 Fg is compact because it is equicontinuous and boundedby Assumption 3.2 and the compactness of � and it is closed by Assumption 3.3. Hence, for theconvergent sequences G2;Fn(�)! G2(�), 2;Fn(�)! 2(�) andM2;Fn(�)!M2(�), there exists F 2 Fsuch that
G2(�) = G2;F (�), 2(�) = 2;F (�) and M2(�) =M2;F (�); (A.4)
26
which proves the Lemma.
Proof of Lemma 4.2. First, we provide formulae for e�1 and 2; which are used in the constructionof b�1 and b�2; and show consistency of e�1 and 2 under fFng: In the proof below, we use G2;Fn(�)!G2;F (�), 2;Fn(�) ! 2;F (�), M2;Fn (�) ! M2;F (�) following Lemma 4.1 and the fact that G2;F (�),2;F (�) and M2;F (�) are continuous functions of � for any F 2 F . The continuity follows fromAssumption 3.2.(i) and the dominated convergence theorem (DCT).
The preliminary estimator e�1 is de�ned ase�1 � argmin
�2�g1(�)
0g1(�): (A.5)
which is based on the �rst r1 valid moments and an identity weighting matrix. For ease of notations,we write
Gk;Fn � Gk;Fn(�n) and k;Fn � k;Fn(�n) for k = 1; 2: (A.6)
By Lemma A.1.(i),
g2(�) =M2;Fn(�) +
"n�1
nXi=1
g2(Wi; �)�M2;Fn(�)
#=M2;Fn(�) + op(1); (A.7)
uniformly over � 2 �. As g1(W; �) is a subvector of g2(W; �), by (A.7) and Assumption 3.2.(i),
g1(�)0g1(�) =M1;Fn(�)
0M1;Fn(�) + op(1) (A.8)
uniformly over � 2 �.By Assumption 3.1.(i) and Fn 2 F , M1;Fn(�)
0M1;Fn(�) is uniquely minimized at �n, whichtogether with the uniform convergence in (A.8) implies that e�1 � �n !p 0. By M1;Fn(�)!M1;F (�)and Assumption 3.2.(i),
M1;Fn(�)0M1;Fn(�) =M1;F (�)
0M1;F (�) + o(1): (A.9)
Because �n and �0 are uniquely identi�ed by M1;Fn(�) = 0 and M1;F (�) = 0; respectively, we alsohave �n = �0 + o(1).
To show the consistency of 2; note that
2 = n�1nXi=1
g2(Wi;e�1)g2(Wi;e�1)0 � g2(e�1)g2(e�1)0= EFn [g2(W;e�1)g2(W;e�1)0]�M2;Fn(
e�1)0M2;Fn(e�1) + op(1)
= 2;Fn(e�1) + op(1)
= 2;Fn + op(1); (A.10)
where the �rst equality is the de�nition, the second equality holds by (A.7), Lemma A.1.(ii) andAssumption 3.2.(i), the third equality follows from the de�nition of 2;F (�); and the last equality
27
holds because
2;Fn(e�1)� 2;Fn(�n) (A.11)
=h2;Fn(
e�1)� 2;F (e�1)i+ h2;F (e�1)� 2;F (�n)i+ [2;F (�n)� 2;Fn(�n)] = op(1)by 2;Fn(�)! 2;F (�), 2;F (�) is continuous, and the consistency of e�1. This shows the consistencyof 2:
We next show that under fFng,
b�1 � �n !p 0 and b�2 � ��n !p 0; (A.12)
where ��n denotes pseudo true value that uniquely minimizes of the criterion function QFn(�) =M2;Fn(�)
0�12;FnM2;Fn(�). By (A.7), (A.10) and Assumptions 3.2.(i)-(ii), we have
g2(�)0(2)
�1g2(�) =M2;Fn(�)0�12;FnM2;Fn(�) + op(1) (A.13)
uniformly over � 2 �. In addition, M2;Fn(�)0�12;FnM2;Fn(�) uniquely identi�es �
�n. The consistency
result b�2 � ��n !p 0 follows from standard arguments for the consistency of an extremum estima-tor. As g1(�) is a subvector of g2(�), and 1;n is a submatrix of 2;n, using (A.7), (A.10) andAssumptions 3.2.(i)-(ii), we have
g1(�)0(1)
�1g1(�) =M1;Fn(�)0�11;FnM1;Fn(�) + op(1); (A.14)
uniformly over � 2 �. AsM1;Fn(�)0�11;FnM1;Fn(�) is uniquely minimized at �n, b�1��n !p 0 follows
from the unique identi�cation of �n and the uniform convergence in (A.14).Next, we derive the asymptotic distribution under fFng: We start with the case that �n = o(1)
and show b�2 � �n = op(1) in this case. First, recall the pseudo true value ��n for n and thepseudo true value �� in the limit are the unique minimizers of QFn(�) and QF (�); respectively, bothare continuous functions of �: We have ��n ! �� by QFn(�) ! QF (�); which further follows fromM2;Fn(�)!M2;F (�);
2;Fn(�n)! 2;F (�0); (A.15)
�n ! �0; and the continuity of 2;F (�): The convergence in (A.15) follows from 2;Fn(�)! 2;F (�);the continuity of 2;F (�); and �n ! �0: Second, we show if �n = M2;Fn(�n) ! 0; we have �� = �0because (i) QF (�) = M2;F (�)
0�12 M2;F (�) has a unique minimizer called �� and (ii) M2;F (�0) = 0:
To see (ii), note that M2;Fn(�n)! 0r2�1 and
M2;Fn(�n)!M2;F (�0) = 0r2�1 (A.16)
where the convergence follows from the same arguments for (A.15): Putting together (i) b�2���n !p 0;
(ii) ��n ! ��; (iii) �� = �0; and (iv) �n ! �0; we have b�2 � �n !p 0:
28
Using b�2 � �n = op(1), Lemma A.1.(iv) and Assumption 3.2.(i),g2(b�2) = g2(�n) +
hM2;Fn(
b�2)�M2;Fn(�n)i+ op(n
�1=2)
= g2(�n) + [G2;Fn(�n) + op(1)] (b�2 � �n) + op(n�1=2): (A.17)
We have
n�1nXi=1
g2;�(Wi;b�2) = G2;Fn(b�2) + op(1) = G2;Fn(�n) + op(1); (A.18)
where the �rst equality follows from Lemma A.1.(iii) and the second equality follows from the sameargument used to show (A.11). From the �rst order condition for the GMM estimator b�2, we deducethat
0 =
"n�1
nXi=1
g2;�(Wi;b�2)#0 (2)�1g2(b�2)= (G02;Fn
�12;Fn
+ op(1))hg2(�n) + (G2;Fn + op(1))(
b�2 � �n) + op(n�1=2)i (A.19)
where the second equality follows from (A.10), (A.17) and (A.18). By (A.19) and Assumption 3.2,
n1=2(b�2 � �n) = (�2;n + op(1))n�n(g2(W; �n)) + n1=2EFn [g2(W; �n)]o+ op(1); (A.20)
where �2;n = ��G02;Fn
�12;Fn
G2;Fn
��1G02;Fn
�12;Fn
. Similarly, we can show that
n1=2(b�1 � �n) = �1;n�n(g1(W; �n)) + op(1); (A.21)
where �1;n = ��G01;Fn
�11;Fn
G1;Fn
��1G01;Fn
�11;Fn
. Combining (A.20) and (A.21) yields
0@ n1=2(b�1 � �n)n1=2(b�2 � �n)
1A =
0@ �1;n�n(g1(W; �n))
(�2;n + op(1))��n(g2(W; �n)) + n
1=2�0;n1A+ op(1); (A.22)
where �0;n = (0r1�1; �0n)0:
Suppose that n1=2�n ! d for some d 2 Rr� . We have �1;n = �1 + o(1) and �2;n = �2 + o(1)
by arguments used to show (A.15), which together with n1=2�n = d+ o(1), (A.15), (A.22), LemmaA.1.(v) and the continuous mapping theorem (CMT) deduces that0@ n1=2(b�1 � �n)
n1=2(b�2 � �n)1A!d
0@ ��1
�2
1A (Z2 + d0) , where Z2 � N(0r2�1;2); (A.23)
where ��1 � [�1;0d��r� ]. Part (i) of the Lemma follows from (A.23) and the de�nitions of ��1 andZ1.
29
Now we consider the case that �n = o(1) and jjn1=2�njj ! 1. Note that (A.15) and (A.21) donot depend on �n. By �1;n = �1 + o(1), (A.15), (A.21), Lemma A.1.(v) and the CMT,
n1=2(b�1 � �n)!d �1Z1: (A.24)
Assumption 3.2, (A.20) and Lemma A.1.(v) imply that
n1=2(b�2 � �n) = (�2;n + op(1))n1=2�n +Op(1): (A.25)
By Assumption 3.1.(iv) and jjn1=2�njj ! 1,
n�0n�02;n�2;n�n � C�2n�0n�n !1 (A.26)
which together with (A.25) implies that jjn1=2(b�2 � �n)jj !p 1.Finally, we consider the case where k�nk > C�1. By Assumption 3.1.(iv), jjG02;Fn
�12;Fn
�njj >C�1 k�nk > C�2 which implies that G02;F�12 M2;F
> 2�1C�2 (A.27)
by G2;Fn ! G2;F and 2;Fn ! 2;F as in (A.15). Because �� is the unique minimizer ofM2;F (�)0�12
M2;F (�), we have G2;F (��)0�12 M2;F (�
�) = 0d� , which combined with (A.27) and the continuity ofG2;F (�) and M2;F (�); implies that
k�� � �0k > " (A.28)
for some " > 0. Using b�2 = ��n + op(1), ��n = �� + o(1), �n = �0 + o (1), (A.28) and the triangle
inequality, we obtain
n1=2 b�2 � �n � n1=2
hk�� � �0k � jjb�2 � ��njj � k��n � ��k � k�n � �0ki
= n1=2 k�� � �0k (1 + op(1))!p 1 (A.29)
which �nishes the proof.
Proof for the claim in equation (4.3).Consider the case n1=2�n ! d 2 Rr� . By Lemma 4.2,
n1=2hb�(!)� �ni = n1=2(b�1 � �n) + ! hn1=2(b�2 � �n)� n1=2(b�1 � �n)i
!d ��1Zd;2 + !(�2 � ��1)Zd;2: (A.30)
This implies that
`(b�(!)) = n hb�n(!)� �ni0� hb�n(!)� �ni!d �(!) (A.31)
where �(!) = Z 0d;2��01 ���1Zd;2 + 2!Z 0d;2(�2 � ��1)0���1Zd;2 + !2Z 0d;2(�2 � ��1)0�(�2 � ��1)Zd;2.
30
Now we consider E[�(!)] using the equalities in Lemma A.2 below this proof. First,
E[Z 0d;2��01 ���1Zd;2] = tr(��1) (A.32)
because ��1Zd;2 = �1Z1 and �1E(Z1Z 01)�01 = �1 by de�nition. Second,
E�Z 0d;2(�2 � ��1)0���1Zd;2
�= tr(���1E
�Zd;2Z 0d;2
�(�2 � ��1)0)
= tr(���1�d0d
00 +2
�(�2 � ��1)0)
= tr(�(�2 � �1)); (A.33)
where the last equality holds by Lemma A.2. Third,
E�Z 0d;2(�2 � ��1)0�(�2 � ��1)Zd;2
�= tr(�(�2 � ��1)
�d0d
00 +2
�(�2 � ��1)0)
= d00�02��2d0 + tr(�(�1 � �2)) (A.34)
by Lemma A.2. Combining the results in (A.32)-(A.34), we obtain
E[�(!)] = tr(��1)� 2!tr (� (�1 � �2)) + !2�d00�
02��2d0 + tr (� (�1 � �2))
�. (A.35)
Note that d00�02��2d0 = d00(�2 � ��1)0�(�2 � ��1)d0 because ��1d0 = 0d� . The optimal weight !
�
minimizes the quadratic function of ! in (A.35).
Lemma A.2 (a) ��1d0 = 0d� ; (b) ��12�
�01 = �1; (c) �
�12�
02 = �2; (d) �22�
02 = �2.
Proof of Lemma A.2. By construction, ��1d0 = 0d� : For the ease of notation, we write 2 andG2 as
2 =
0@ 1 1r
r1 r
1A and G2 =
0@ G1
Gr
1A : (A.36)
To prove part (b), we have
��12��01 = [�1;0d��r� ]
0@ 1 1r
r1 r
1A [�1;0d��r� ]= �11�
01 =
�G01
�11 G1
��1= �1: (A.37)
To show part (c), note that
��12�02 = � [�1;0d��r� ] 2
�12 G2
�G02
�12 G2
��1= ��1G1
�G02
�12 G2
��1=�G02
�12 G2
��1= �2 (A.38)
31
because ��1G1 = Id��d� : Part (d) follows from the de�nition of �2.
Proof of Lemma 4.3. We �rst prove the consistency of bk, bGk and b�k for k = 1; 2. By Lemma4.2, we see that b�1 satis�es b�1 = �n + op(1) and b�1 = �0 + op(1). Using the same arguments inshowing (A.10) and (A.15), we can show that
b2 = 2;Fn + op(1) = 2 + op(1): (A.39)
As b1 is a submatrix of b2, by (A.39) we haveb1 = 1;Fn + op(1) = 1 + op(1): (A.40)
By the consistency of b�1 and the same arguments used to show (A.18), we haven�1
nXi=1
g2;�(Wi;b�1) = G2;Fn(�n) + op(1) = G2;F + op(1): (A.41)
As n�1Pni=1 g1;�(Wi;b�1) is a subvector of n�1Pn
i=1 g2;�(Wi;b�1), by (A.39) we haven�1
nXi=1
g1;�(Wi;b�1) = G1;Fn(�n) + op(1) = G1;F + op(1): (A.42)
From (A.39), (A.40), (A.41) and (A.42), we see that bk and bGk are consistent estimators of k andGk respectively for k = 1; 2. By the Slutsky theorem, we know that b�k is a consistent estimator of�k for k = 1; 2.
In the case n1=2�n ! d 2 Rr� , the desired result follows from Lemma 4.2, the consistency of b�1and b�2; and the CMT. In the case jjn1=2�njj ! 1; e!eo !p 0 because n1=2jb�2 � b�1j !p 1;
n1=2(b�eo � �n) = n1=2(b�1 � �n) + e!eon1=2(b�2 � b�1)= n1=2(b�1 � �n) + n1=2(b�2 � b�1)tr h�(b�1 � b�2)i
n(b�2 � b�1)0�(b�2 � b�1) + tr h�(b�1 � b�2)i !d �1 (A.43)
by Lemma 4.2.
A.2 Proofs for the results in Section 5
We �rst present some generic results on the bounds of asymptotic risk di¤erence between twoestimators under some high-level conditions. Then we apply these generic results to the two speci�cestimators we consider in this paper: b�eo and b�1.
Letvn(�) = (vec(G2;Fn(�))0; vech(2;Fn(�))0; [M2;Fn(�)�M2;Fn(�n)]
0) (A.44)
32
De�ne �2;0 = [0r1�1; �00]0. Note that (i) �n ! �0 and vn(�)! (vec(G2(�))0; vech(2(�))0; [M2(�)��2;0]0
and (ii) G2;Fn(�) ! G2(�); 2;Fn(�) ! 2(�); M2;Fn(�) ! M2(�) are equivalent statements, and weinterchange them below for convenience.
Under such convergence sequences, de�ne h = (d0; vec(G2)0; vech(2)0), where G2 and 2 areevaluated at the true value in the limit, as de�ned in (4.1). De�ne
H� = fh : n1=2�n ! d 2 Rr� ; �n ! 0; vn(�)! (vec(G2(�))0; vech(2(�))0;M2(�)0) for some fFngg.(A.45)
Similarly, de�ne
H�1 = fh : jjn1=2�njj !1; �n!�0; vn(�)! (vec(G2(�))0; vech(2(�))0; [M2(�)� �2;0]0) for some fFngg
(A.46)where �2;0 = [0r1�1; �
00]0 and we de�ne d =1 if jjn1=2�njj ! 1.
Lemma A.3 Suppose the following conditions hold.(i) If n1=2�n ! d for d 2 Rr� [ f1g, �n ! �0; vn(�) ! v0(�) = (vec(G2(�))0; vech(2(�))0; [M2(�) ��2;0]
0, then
limn!1
EFn [`�(b�)] = R�(h) 2 (0;1) and limn!1
EFn [`�(e�)] = eR�(h) 2 (0;1);(ii) � = f(�0; v0(�)) : � 2 � and F 2 Fg is a compact set.(iii) For some " > 0; (�; v(�)) 2 � with jj�jj < " implies that (e�; v(�)) 2 � for all e� 2 Rr� withjje�jj < ":
Then the lower and upper bounds of the asymptotic risk di¤erence between b� and e� have thefollowing representations:
AsyRD(b�;e�)= lim�!1
�min
�infh2H�
hR�(h)� eR�(h)i ; inf
h2H�1
hR�(h)� eR�(h)i�� ;
AsyRD(b�;e�)= lim�!1
max
(suph2H�
hR�(h)� eR�(h)i ; sup
h2H�1
hR�(h)� eR�(h)i
)!:
Proof of Lemma A.3. The proof uses the subsequence techniques used to show the asymptoticsize of a test in Andrews, Cheng, and Guggenberger (2011) but we adapt the proof and notationsto the current setup and extend results for test to estimators.
We �rst show that
AsyR�(b�) � lim supn!1
supF2F
EF [`�(b�)] = max(suph2H�
R�(h); suph2H�
1
R�(h)
): (A.47)
For this purpose, we �rst show that
AsyR�(b�) � maxmax(suph2H�
R�(h); suph2H�
1
R�(h)
): (A.48)
33
Let fFng be a sequence such that
lim supn!1
EFn [`�(b�)] = lim supn!1
�supF2F
EF [`(b�)]� : (A.49)
Such a sequence always exists by the de�nition of supremum. The sequence fEFn [`(b�)] : n � 1gmay not converge. Now let fwn : n � 1g be a subsequence of fng such that fEFwn [`(b�)] : n � 1gconverges and its limit equals AsyR�(b�): Such a subsequence always exists by the de�nition oflimsup. Below we show that there exists a subsequence fpng of fwng such that
EFpn [`(b�)]! R�(h) for some h 2 H� (A.50)
orEFpn [`(b�)]! R�(h) for some h 2 H�
1: (A.51)
Provided (A.50) or (A.51) holds, we obtain the desired result in (A.48).To show that there exists a subsequence fpng of fwng such that either (A.50) or (A.51) holds,
it su¢ ces to show claims (1) and (2): (1) for any sequence fFng and any subsequence fwng of fng,there exists a subsequence fpng of fwng for which
p1=2n �pn ! d 2 Rr� , �pn ! 0; vpn(�)! v0(�) such that h 2 H� (A.52)
or p1=2n �pn
!1; �pn ! �0; and vpn(�)! v0(�) such that h 2 H�1; (A.53)
and (2) for any subsequence fpng of fng and any sequence fFpn : n � 1g, (A.52) together withcondition (i) implies (A.50), and (A.53) combined with condition (i) implies (A.51). Below we writev(�) as v for notational simplicity.
To show (1), let �wn;j denote the j-th component of �wn and p1;n = wn 8n � 1. For j = 1,either (i) lim supn!1 jp
1=2j;n �pj;n;j j < 1 or (ii) lim supn!1 jp
1=2j;n �pj;n;j j = 1. If (i) holds, then for
some subsequence fpj+1;ng of fpj;ng, p1=2j+1;n�pj+1;n;j ! dj for some dj 2 R. If (ii) holds, then forsome subsequence fpj+1;ng of fpj;ng, p1=2j+1;n�pj+1;n;j !1 or �1. As r� is a �xed positive integer,we can apply the same arguments successively for j = 1; :::; r� to obtain a subsequence fp�ng offwng such that (p�n)1=2�p�n ! d� 2 Rr� or jj(p�n)1=2�p�n jj ! 1. Finally, there exists a subsequencefpng of fp�ng such that �pn ! �� and vpn ! v� because � is a compact set.
We have constructed the subsequence fpng of fng such that either (i) (pn)1=2�pn ! d� 2 Rr� ;�pn ! �� = 0; vpn ! v�; or (ii) jj(pn)1=2�pn jj ! 1; �pn ! ��;and vpn ! v�. To conclude (A.52)holds in case (i), it remains to show h� = (d; vec(G�2)
0; vech(�2)0) 2 H�; where G�2 and
�2 are
de�ned with the limit �� and v�:. Similarly, to show (A.53) holds in case (ii), it remains to showh� = (1; vec(G�2)0; vech(�2)0) 2 H�
1. This step is necessary because d�; ��;and v� are the limits
along a subsequence, whereas H� and H�1 are de�ned using limits of the full sequence. To close this
gap, we show that for the subsequence fpng constructed above there exists a full sequence with thesame limit. For case (i), such a full sequence of DGP fF �k 2 F : k � 1g can be constructed as follows.
34
First, consider the case where d� 2 Rr� . (i) 8k = pn; de�ne F �k = Fpn and (ii) 8k 2 (pn; pn+1),de�ne F �k to be a true distribution such that
�F �k = (pn=k)1=2�pn and vF �k = vpn ; (A.54)
where �F �k and vF �k denote the values of �0 and v0 under F�k in this case. There exists F
�k 2 F
for which (A.54) holds for large n by Condition (iii). To see it, we �rst note that (�pn ; vpn) 2 �because Fpn 2 F . Moreover, we have pn=k < 1, and jj�pn jj < " for large n because �pn ! 0r� .Condition (iii) ensures the existence of F �k for any k 2 (pn; pn+1). Along this constructed sequencefF �k 2 F : k � 1g, we have k1=2�F �k ! d�; �F �k ! �� = 0; vF �k ! v� as desired. This shows that h� 2H� in case (i). For case (ii), de�ne F �k = Fpn for k 2 [pn; pn+1). Then, k1=2jj�F �k jj � (pn)
1=2jj�pn jj8k 2 [pn; pn+1). In consequence, (pn)1=2jj�pn jj ! 1 as n ! 1 implies that k1=2jj�F �k jj ! 1 ask !1. In addition, �Fk� ! �� and vF �k ! v� as k !1. Hence, in case (ii), h� 2 H�
1. Combinedthe results for case (i) and (ii), we have completed the proof of (1).
To show (2), note that we have proved that for any subsequence fpng of fng and any sequencefFpn : n � 1g such that (A.52) holds, there exists a full sequence fF �k 2 F : k � 1g such thatn1=2�F �k ! d� 2 Rr� , �F �k ! ��; vF �k ! v�, and F �pn = Fpn 8n � 1. Similarly, if (A.53) holds, thereexists a full sequence fF �k 2 F : k � 1g such that jjn1=2�F �k jj ! 1, �F �k ! ��; vF �n ! v�, andF �pn = Fpn 8n � 1. This together with Condition (i) of the Lemma implies (2). This proves either(A.50) or (A.51) holds, which in turn implies (A.48).
Next, we show that
AsyR�(b�) � max(suph2H�
R(h); suph2H�
1
R(h)
): (A.55)
For any h = (d0; vec(G2)0; vech(2)0) 2 H�, there exists a sequence fFn 2 F : n � 1g such thatn1=2�n ! d, �n ! 0;and vn ! v = (vec(G2)
0; vech(2)0;M 02). Moreover,
AsyR�(b�) = lim supn!1
supF2F
EF [`�(b�)] � lim supn!1
EFn [`(b�)] = R�(h); (A.56)
where the last equality holds by Condition (i). Similarly, for any h 2 H�1, there exists a sequence
fFn 2 F : n � 1g such that n1=2jj�njj ! 1; �n ! �0; and vFn ! v = (vec(G2)0; vech(2)0;M 0
2),which together with Condition (i) implies that
AsyR�(b�) = lim supn!1
supF2F
EF [`�(b�)] � lim supn!1
EFn [`�(b�)] = R�(h): (A.57)
(A.56) combined with (A.57) immediately yields (A.55).Finally, (A.47) is implied by (A.48) and (A.55). The desired result follows from (A.47) with
EF [`�(b�)] replaced by EF [`�(b�)� `�(e�)]:Proof of Theorem 5.1. We obtain the two equalities by applying Lemma A.3 with b� = �eoand e� = b�1 and H = H�. Now we verify the conditions in Lemma A.3 under Assumptions 3.1-3.3.Condition (i) holds with R�(h) = minfE[�
0��]; �g and eR�(h) = minfE ��01��1� ; �g for d 2 Rr� and
35
R�(h) = minfE��01��1
�; �g and eR�(h) = minfE ��01��1� ; �g for d = 1, following Lemma 4.3 and
the CMT. The set � is compact because (i) VF = fv0(�) : F 2 Fg is compact as shown in the proofof Lemma 4.1, (ii)M2;F (�) is continuous, (iii) �n ! �0. This veri�es Condition (ii). Condition (iii)is assumed in Assumption 3.3.(i).
In addition, we show H = H�, where H and H� are de�ned in (5.2) and (A.45), respectively.Let fFng be the sequence in (A.45) under Fn. By Conditions (ii)-(iii) of the Lemma A.3, Lemma4.1 implies
G2;Fn(�)! G2;F (�);2;Fn ! 2;F (�);M2;Fn(�)!M2;F (�) (A.58)
for some F 2 F . Therefore, G2 = G2;F (�0) and 2 = 2;F (�0) for F 2 F by de�nition, whichimplies that H� � H.
To show H � H�, we need to show that for all d 2 Rr� and all F 2 F with �0 = 0, there existsa sequence fFn 2 F : n � 1g such that
n1=2�n ! d, G2;Fn(�)! G2;F (�), 2;Fn(�)! 2;F (�) and M2;Fn(�)!M2;F (�): (A.59)
Take �n = d=n1=2 for n large enough such that �n < ". The Condition (iii) in Lemma A.3 impliesthat (�n; v0(�)) 2 � for all large n, where v0(�) is de�ned in (3.2). By the de�nition of �, we knowthat there exists sequence fFng such that �n = d=n1=2 and vn(�) = v0(�), which shows (A.59) andhence H � H�.
Next, we show the inequalities hold for � large enough. As minf�0��; �g � �0��,
Ehminf�0��; �g
i� E
h�0��i: (A.60)
The expectation E[�0��] is �nite, because
Eh�0��i� 2E
�Z 0d;2��01 ���1Zd;2 + !2Z 0d;2 (�2 � ��1)
0�(�2 � ��1)Zd;2�
= 2E
"Z 01�01��1Z1 + tr(A)
tr(A)
Z 0d;2BZd;2 + tr(A)Z 0d;2BZd;2
Z 0d;2BZd;2 + tr(A)
#� 2E
�Z 01�01��1Z1
�+ 2tr(A) � C (A.61)
where the �rst inequality is by the Cauchy-Schwarz inequality, the third inequality is by
tr(A)
Z 0d;2BZd;2 + tr(A)� 1 and
Z 0d;2BZd;2Z 0d;2BZd;2 + tr(A)
� 1 with probability 1, (A.62)
and the last inequality is by the regularity conditions in Assumption 3.2. Similarly, we also have
E��01��1
�� CE[ k�1Z1k]2 � C2: (A.63)
We can writeg(h) = E[�0��]� E
��01��1
�= 2tr(A)J1 + tr(A)
2J2;
36
where
J1 = E
"Z 0d;2DZd;2
Z 0d;2BZd;2 + tr(A)
#and J2 = E
"Z 0d;2BZd;2
(Z 0d;2BZd;2 + tr(A))2
#: (A.64)
From the de�nition of g�(h) and g(h), we use (A.60) to deduce that
g�(h) = Ehminf�0��; �g
i� E
�minf�01��1; �g
�� E
h�0��i� E
�minf�01��1; �g
�= E[�01��1 �minf�01��1; �g] + g(h) for any h 2 H: (A.65)
Next we showlim�!1
suph2H
�E[�01��1 �minf�01��1; �g]
= 0: (A.66)
Recall that �1 is a function of G1 and 1. De�ne
q(Z; G1;1) � Z 01=21 �01��11=21 Z, where Z � N(0r1 ; Ir1�r1). (A.67)
Then we can write
f�(G1;1) � E[�01��1 �minf�01��1; �g]
= E [q(Z; G1;1)�minfq(Z; G1;1); �g] : (A.68)
Let�1 = f(G1;1) : G1;Fn ! G1 and 1;Fn ! 1 for some fFngg: (A.69)
We now havelim�!1
suph2H
f�(G1;1) � lim�!1
sup(G1;1)2�1
f�(G1;1) (A.70)
because h 2 H requires the convergence listed in (A.69) as well as the convergence of some otherfunctions.
We next show lim�!1 sup(G1;1)2�1 f�(G1;1) = 0. First, lim�!1 f�(G1;1) = 0 for any(G1;1) 2 �1 by the DCT because
0 � q(Z; G1;1)�minfq(Z; G1;1); �g � q(Z; G1;1) (A.71)
and E [q(Z; G1;1)] = tr(��1) � C. Second, this convergence is uniform in (G1;1) 2 �1 by theDini�s Theorem (see, Rudin (1976)) because (i) f�(G1;1) is monotonically decreasing in �; (ii) �1 iscompact; and (iii) f�(G1;1) is continuous in (G1;1). The continuity of f�(G1;1) in (G1;1) is bythe DCT because (a) q(Z; G1;1) is continuous in (G1;1); and (b) E[ sup(G1;1)2�1 q(Z; G1;1)] <1. To see (b), note that
sup(G1;1)2�1
q(Z; G1;1) �"
sup(G1;1)2�1
�max
�1=21 �01��1
1=21
�#Z 0Z � CZ 0Z (A.72)
37
by Assumption 3.2.This completes the veri�cation of (A.66). It follows from (A.66) that
lim�!1
suph2H
g�(h) � suph2H
g(h) and lim�!1
infh2H
g�(h) � infh2H
g(h), (A.73)
which proves the inequalities in Theorem 5.1.
Proof of Theorem 5.2. We provide a upper bound for J1 de�ned in (A.64). Let
�(x) =x
x0Bx+ tr(A); where x = Zd;2 and B = (�2 � ��1)0�(�2 � ��1): (A.74)
Its derivative is@�(x)0
@x=
1
x0Bx+ tr(A)Ir2 �
2
(x0Bx+ tr(A))2Bxx0: (A.75)
De�neD = (�2 � ��1)0���1; (A.76)
which satis�es DZd;2 = DZ2 by construction because the last r� rows of ��1 are zeros. ApplyingLemma A.2 yields
tr (D2) = tr�(�2 � ��1)0���12
�= tr(� (��12�2 � ��12��1))
= tr(� (�2 � �1)) = �tr(A): (A.77)
By Lemma 1 of Hansen (2016), which is a matrix version of the Stein�s Lemma (Stein, 1981),
J1 = E��(Zd;2)0DZd;2
�= E
��(Zd;2)0DZ2
�= E
�tr
�@�(Zd;2)0@x
D2
��: (A.78)
Plugging (A.74)-(A.76) into (A.78), we have
J1 = E
"tr (D2)
Z 0d;2BZd;2 + tr(A)
#� 2E
264 tr�BZd;2Z 0d;2D2
��Z 0d;2BZd;2 + tr(A)
�2375
= E
"�tr(A)
Z 0d;2BZd;2 + tr(A)
#+ 2E
264 �Z 0d;2D2BZd;2�Z 0d;2BZd;2 + tr(A)
�2375 (A.79)
38
where the second equality is by (A.77). By de�nition,
�Z 0d;2D2BZd;2 = �Z 0d;2(�2 � ��1)0���12(�2 � ��1)0�(�2 � ��1)Zd;2= Z 0d;2(�2 � ��1)0�(�1 � �2)�(�2 � ��1)Zd;2� �max(�
1=2(�1 � �2)�1=2)(Z 0d;2(�2 � ��1)0�(�2 � ��1)Zd;2)= �max(A)Z 0d;2BZd;2 (A.80)
where the last equality is by �max(�1=2(�1��2)�1=2) = �max(�(�1��2)). Combining the results
in (A.79) and (A.80), we get
J1�E"
�tr(A)Z 0d;2BZd;2 + tr(A)
#+ 2E
264 �max(A)Z 0d;2BZd;2�Z 0d;2BZd;2 + tr(A)
�2375
=E
"�tr(A)
Z 0d;2BZd;2 + tr(A)
#+ 2E
264h�Z 0d;2BZd;2
�+ tr(A)
i�max(A)� tr(A)�max(A)�
Z 0d;2BZd;2 + tr(A)�2
375=E
"2�max(A)� tr(A)Z 0d;2BZd;2 + tr(A)
#� E
264 2�max(A)tr(A)�Z 0d;2BZd;2 + tr(A)
�2375 : (A.81)
Next, note that
J2=E
264 Z 0d;2BZd;2���Z 0d;2BZd;2 + tr(A)���2375 = E
264Z 0d;2BZd;2 + tr(A)� tr(A)���Z 0d;2BZd;2 + tr(A)���2375
=E
"1
Z 0d;2BZd;2 + tr(A)
#� E
264 tr(A)���Z 0d;2BZd;2 + tr(A)���2375 : (A.82)
39
Combining (A.81) and (A.82), we obtain that
g(h)=2tr(A)J1 + tr(A)2J2
�2tr(A)
0B@E" 2�max(A)� tr(A)Z 0d;2BZd;2 + tr(A)
#� E
264 2tr(A)�max(A)���Z 0d;2BZd;2 + tr(A)���23751CA
+tr(A)2
0B@E" 1
Z 0d;2BZd;2 + tr(A)
#� E
264 tr(A)���Z 0d;2BZd;2 + tr(A)���23751CA
=E
"tr(A) (4�max(A)� tr(A))Z 0d;2BZd;2 + tr(A)
#� E
264tr(A)2 (4�max(A) + tr(A))���Z 0d;2BZd;2 + tr(A)���2375 : (A.83)
For all G2 and 2 such that h = (d; vec(G2)0; vech(2)0) 2 H; we have G2 = G2;F and 2 = 2;Ffor some F 2 F because V is compact, which is veri�ed before. Hence, we have desired result iftr(A) > 0 and 4�max(A)� tr(A) � 0 for 8F 2 F .
A.3 Asymptotic risk of the pre-test GMM estimator with the J-test statistic
To simulate the asymptotic risk of the pre-test estimator in Figure 2, we consider the asymptoticrisk under (i) n1=2�n ! d 2 Rr� and (ii) jjn1=2�njj ! 1. We consider case (i) �rst. By (A.17) and(A.20),
g2(b�2) = g2(�n) + [G2;Fn(�n) + op(1)] (
b�2 � �n) + op(n�1=2)= g2(�n) +G2;Fn�2;ng2(�n) + op(n
�1=2)
= (Ir2 +G2;Fn�2;n)g2(�n) + op(n�1=2); (A.84)
which together with (A.39), G2;Fn = G2 + o(1) and �2;n = �2 + o(1) implies that
Jn� ng2(b�2)0(b2)�1g2(b�2) (A.85)
= ng2(�n)0h�12 � �12 G2
�G02
�12 G2
��1G02
�12
ig2(�n) + op(1):
By Lemma A.1.(v), n1=2�n ! d 2 Rr� and the Slutsky Theorem, n�1=2Pni=1 g2(Wi; �n) !d Z2 +
d0 = Zd;2, which combined with the CMT implies that
Jn !d Z 0d;2h�12 � �12 G2
�G02
�12 G2
��1G02
�12
iZd;2 � J1(h) (A.86)
under n1=2�n ! d 2 Rr� .Let � be a pre-speci�ed signi�cance level. The critical value c� is the 1 � � quantile of the
chi-square distribution with (r2� d�) degree of freedom, which is the asymptotic distribution of Jn
40
when d0 = 0r2 . The pre-test estimator based on Jn can be written as
b�p = (1� e!�;p)b�1 + e!�;pb�2 with e!�;p = I fJn < c�g ; wheree!�;p !D I fJ1(G2;2; d0) < c�g � e!�;1 and
n1=2(b�p � �n) = b�1 + e!�;p(b�2 � b�1) ! D ��1Zd;2 + e!�;1(�2 � ��1)Zd;2 (A.87)
under n1=2�n ! d. Furthermore, by the CMT,
`(b�p) = n(b�p � �n)0�(b�p � �n)!De��;1, wheree��;1 = Z 0d;2��01 ���1Zd;2 + 2e!�;1Z 0d;2(�2 � ��1)0���1Zd;2
+e!2�;1Z 0d;2(�2 � ��1)0�(�2 � ��1)Zd;2. (A.88)
Its expectation is
E[e��;1]= tr(��1) + 2E �e!�;1Z 0d;2(�2 � ��1)0���1Zd;2�+ E
�e!�;1Z 0d;2(�2 � ��1)0�(�2 � ��1)Zd;2� : (A.89)
This veri�es Condition (i) in Lemma A.3 for the pre-test estimator with R�(h) = minfE[e��;1]; �g:For jjn1=2�njj ! 1, the J-test is consistent and the pretest estimator is the conservative GMMestimator w.p.a.1.
The asymptotic risk of the pre-test estimator b�p in Figure 2 is simulated based on the formulain (A.89).
41