An Averaging GMM Estimator Robust to … simulation results support our theoretical –ndings. The proposed averaging estimator is applied to estimate the human capital production

An Averaging GMM Estimator Robust to

Misspeci�cation�

Xu Chengy Zhipeng Liaoz Ruoyao Shix

First Draft: August 2013; This Version: April, 2016

Abstract

This paper studies the averaging GMM estimator that combines a conservative GMM esti-

mator based on valid moment conditions and an aggressive GMM estimator based on both valid

and possibly misspeci�ed moment conditions, where the weight is the sample analog of an infea-

sible optimal weight. We establish asymptotic theory on uniform approximation of the upper

and lower bounds of the �nite-sample risk di¤erence between two estimators, which is used to

show that the averaging estimator uniformly dominates the conservative estimator by reducing

the risk under any degree of misspeci�cation. Extending seminal results on the James-Stein esti-

mator, the uniform dominance is established in non-Gaussian semiparametric nonlinear models.

The simulation results support our theoretical �ndings. The proposed averaging estimator is

applied to estimate the human capital production function in a life-cycle labor supply model.

Keywords: Asymptotic Risk, Finite-Sample Risk, Generalized Shrinkage Estimator, GMM, Mis-

speci�cation, Model Averaging, Non-Standard Estimator, Uniform Approximation

�We thank Donald Andrews, Denis Chetverikov, Patrik Guggenberger, Jinyong Hahn, Jia Li, Rosa Matzkin,Hyunsik Roger Moon, Ulrich Mueller, Joris Pinkse, Frank Schorfheide, Shuyang Sheng, and seminar participants atBrown University, Duke University, University of Pennsylvania, Pennsylvania State University, University of CaliforniaLos Angeles, Yale University, and 2014 New York Area Econometrics Colloquium for helpful comments.

yDepartment of Economics, University of Pennsylvania, 3718 Locust Walk, Philadelphia, PA 19104, USA. Email:[email protected]

zDepartment of Economics, UCLA,8379 Bunche Hall, Mail Stop: 147703, Los Angeles, CA 90095. Email:[email protected]

xDepartment of Economics, UCLA, Mail Stop: 147703, Los Angeles, CA 90095. Email: [email protected]

1

1 Introduction

The parameter of interest �0 2 Rd� satis�es the moment conditions

EF [g1(W; �0)] = 0 (1.1)

for some known function g1 (�; �) :W��!Rr1 , where � is a compact subset of Rd� , W is a random

vector with support W and joint distribution F , and EF [�] denotes the expectation operator underF . Suppose we have i.i.d. data fWigni=1, where Wi has distribution F for any i = 1; : : : ; n.1 Let

g1(�) = n�1Pn

i=1 g1(Wi; �). The e¢ cient GMM estimator for �0 is

b�1 � argmin�2�

g1(�)0(1)

�1g1(�); (1.2)

where 1 = n�1Pni=1 g1(Wi;e�1)g1(Wi;e�1)0 � g1(e�1)g1(e�1)0 is the e¢ cient weighting matrix with

some preliminary consistent estimator e�1.2 In a linear instrumental variable (IV) example, Yi =X 0i�0 + Ui where the IV Z1;i 2 Rr1 satis�es EF [Z1;iUi] = 0. The moments in (1.1) hold with

g1(Wi; �0) = Z1;i(Yi �X 0i�0) and �0 is identi�ed if EF [Z1;iX 0

i] has full column rank. Under certain

regularity conditions, it is well-known that b�1 is consistent and achieves the lowest asymptoticvariance among GMM estimators based on the moments in (1.1), see Hansen (1982).

If one has additional moments EF [g�(Wi; �0)] = 0 for some known function g�(�; �) : W ��!Rr� , imposing them together with (1.1) can further reduce the asymptotic variance of the GMMestimator. However, if these additional moments are misspeci�ed in the sense that EF [g�(Wi; �0)] 6=0; imposing EF [g�(Wi; �0)] = 0 may result in inconsistent estimation. The choice of moment

conditions is routinely faced by empirical researchers. Take the linear IV model for example. One

typically starts with a large number of candidate IVs but only has con�dence that a small number

of them are valid, denoted by Zi: The rest of them, denoted by Z�i , are valid only under certain

economic hypothesis that yet to be tested. In this example, g�(Wi; �0) = Z�i (Yi�X 0

i�0): In contrast

to the conservative estimator b�1, an aggressive estimator b�2 always imposes EF [g�(Wi; �0)] = 0

regardless of its validity. Let g2(Wi; �) = (g1(Wi; �)0; g�(Wi; �)

0)0 and g2(�) = n�1Pni=1 g2(Wi; �):

The aggressive estimator b�2 takes the formb�2 � argmin

�2�g2(�)

0(2)�1g2(�); (1.3)

1The main theory of the paper can be easily extended to time series models with dependent data, as long as thepreliminary results in Lemma A.1 hold.

2For example, e�1 could be the GMM estimator similar to b�1 but with an identity weighting matrix, see (A.5) inthe Appendix.

2

where 2 takes the same form as 1 except g1(Wi; �) is replaced by g2(Wi; �).3

Because imposing EF [g�(Wi; �0)] = 0 is a double-edged sword, a data-dependent decision usually

is made to choose between b�1 and b�2. To study such a decision and the subsequent estimator, let�0 � EF [g�(Wi; �0)] 2 Rr

�: (1.4)

The pre-testing approach tests the null hypothesis H0 : �0 = 0 and constructs an estimator

b�pre � 1fTn > c�gb�1 + 1fTn � c�gb�2 (1.5)

for some test statistic Tn with the critical value c� at the signi�cance level �. One popular test

is the J-test, see Hansen (1982), and c� is the 1 � � quantile of the chi-squared distribution withdegree of freedom r1 + r

� � d�. Because the power of this test against most �xed alternative is1, b�pre equals b�1 with probability 1 asymptotically (n ! 1) for these �xed misspeci�ed modelwhere �0 6= 0. Thus, it seems that b�pre is immune to moment misspeci�cation. However, we careabout the �nite-sample mean squared error (MSE) of b�pre in practice and this standard pointwiseasymptotic analysis (�0 is �xed and n!1) provides a poor approximation to the former.4 To seethe comparison between b�pre and b�1; the dashed line in Figure 1 plots the �nite-sample (n = 250)MSE of b�pre while the MSE of b�1 is normalized to be 1. For some values of �0, the MSE of b�pre islarger than that of b�1; sometimes by 40%.

The goal of this paper is twofold. First, we propose a data-dependent averaging of b�1 and b�2that takes the form b�eo = (1� e!eo)b�1 + e!eob�2 (1.6)

where e!eo 2 [0; 1] is a data-dependent weight speci�ed in (4.7) below. The subscript in e!eo isshort for empirical optimal because this weight is an empirical analog of an infeasible optimal

weight !� de�ned in (4.3) below. We plot the �nite-sample MSE of this averaging estimator as

the solid line in Figure 1. This averaging estimator is robust to misspeci�cation in the sense

that the solid line is below 1 for all values of �0, in contrast to the bump in the dashed line

that represents the pre-test estimator. Second, we develop a uniform asymptotic theory to justify

the �nite-sample robustness of this averaging estimator. We show that this averaging estimator

dominates the conservative estimator uniformly over a large class of models with di¤erent degrees

3See the �rst line of equations (A.10) for the de�nition of 2.4The poor approximation of the pointwise asymptotics to the �nite sample properties of the pre-test estimator

has been noted in Shibata (1986), Pötscher (1991), Kabaila (1995, 2005) and Leeb and Pötscher (2005, 2008), amongothers.

3

Figure 1. Finite Sample (n = 250) MSEs of the Pre-test and the Averaging GMM Estimators

Note: The Pre-test estimator is based on the J-test with a norminal size 0.01. The location parameter in the horizontalaxis is �0 multiplied by n�1=2. Details of the simulation design for Figure 1 is provided in Subsection 5.1.

of misspeci�cation.5 The standard asymptotic theory is pointwise and fails to reveal the fragile

nature of the pre-test estimator. A stronger uniform notion of robustness is crucial for this model.

Furthermore, we quantify the upper and lower bounds of the asymptotic risk di¤erences between

the averaging estimator and the conservative estimator.6

The rest of the paper is organized as follows. Section 2 discusses the literatures related to our

paper. Section 3 de�nes the parameter space over which the uniform result is established and de�nes

uniform dominance. Section 4 introduces the averaging weight. Section 5 provides an analytical

representation of the bounds of the asymptotic risk di¤erences and applies it to show that the

averaging GMM estimator uniformly dominates the conservative estimator. Section 6 investigates

the �nite sample performance of our averaging estimator and Section 7 applies it to estimate the

human capital production function in a life-cycle labor supply model. Section 8 concludes. Proofs

and technical arguments are given in the Appendix.

Notation. Let C > 1 be generic positive constant whose value does not change with the sample

size. For any real matrix A, we use jjAjj to denote the Frobenius norm of A. We use supnorm

on the space of real functions. If A is a real square matrix, we use tr(A) denote the trace of A,

and �min(A) and �max(A) to denote the smallest and largest eigenvalues of A, respectively. For

any positive integers d1 and d2, Id1 and 0d1�d2 stand for the d1 � d1 identity matrix and d1 � d2zero matrix, respectively. Let vec (�) denotes vectorization of a matrix and vech (�) denotes the half

5The uniform dominace is established under the truncated weighted loss function which is de�ned in (3.7) below.6The lower and upper bounds of asymptotic risk di¤erence are de�ned in (3.9) below.

4

vectorization of a symmetric matrix. For any (possibly random) positive sequences fang1n=1 andfbng1n=1, an = Op(bn) means that limc!1 lim supn Pr (an=bn > c) = 0; an = op(bn) means that forall " > 0, limn!1 Pr (an=bn > ") = 0. Let "!p" and "!d" stand for convergence in probability

and convergence in distribution, respectively.

2 Related Literature

In this section, we discuss some related literature. Our uniform dominance result is related to the

Stein�s phenomenon (Stein, 1956) in parametric models. The James-Stein (JS) estimator (James

and Stein, 1961) is shown to dominate the maximum likelihood estimator in exact normal sampling.

Hansen (2016) considers local asymptotic analysis of the JS-type averaging estimator in general

parametric models and substantially extends its application in econometrics. The present paper

focuses on the uniformity issue and studies the Stein�s phenomenon in non-Gaussian semiparametric

nonlinear models. The proposed averaging estimator is not a mimic of the JS-type estimator

for moment-based models and we �nd its �nite-sample risk compares favorably to that of the

latter. The asymptotic results are developed along drifting sequences of data generating processes

(DGPs) with di¤erent degrees of misspeci�cation. This class of DGPs include the crucial n�1=2

local sequences that are considered by Hjort and Claeskens (2003), Liu (2015), Hansen (2014, 2015,

2016), DiTraglia (2014) for various averaging estimators, as well as some more distant sequences.

The theoretical results glue all sequences together and show that they are su¢ cient to provide a

uniform approximation of the �nite-sample risk di¤erences. The proof uses the techniques developed

in Andrews and Guggenberger (2010) and Andrews, Cheng, and Guggenberger (2011) for uniform

size control for inference in non-standard problems.

Measured by the MSE, the post-model-selection estimator based on consistent model selection

procedure usually does better than the unrestricted estimator in part of the parameter space and

worse than the latter in other part of the parameter space. One standard example is the Hodge�s

estimator, whose scaled maximal MSE diverges to in�nity with the growth of the sample size (see,

e.g., Lehmann and Casella, 1998). Similar unbounded risk results for other post-model-selection

estimators are established in Yang (2005) and Leeb and Pötscher (2008). The post-model-selection

estimator has unbounded (scaled) maximal MSE because it is based on a non-smooth transition rule

between the restricted and unrestricted estimators and a consistent model selection procedure is

employed in the transition rule. However, the averaging estimator proposed in this paper is based

on a smooth combination of the restricted and unrestricted estimators and no model selection

procedure is used in the smooth combination. Hence our averaging estimator is essentially di¤erent

5

from the post-model-selection estimator and the uniform dominance result established in this paper

does not contradict the unbounded risk property of the post-model-selection estimator found in

Yang (2005) and Leeb and Pötscher (2008).

The estimator proposed in this paper is a frequentist model averaging (FMA) estimator. FMA

estimators have received much attention in recent years. Important works include Buckland, Burn-

ham, and Augustin (1997), Hjort and Claeskens (2003, 2006), Leung and Barron (2006), Claeskens

and Carroll (2007), Hansen (2007, 2008), Hansen and Racine (2012), Cheng and Hansen (2014) and

Lu and Su (2015), to name only a few. Uniform asymptotic properties are important for frequentist

estimators when standard pointwise asymptotic properties fail to capture their �nite-sample behav-

iors. Our paper provides a uniform asymptotic framework to compare di¤erent FMA estimators in

moment-based models.

Recently, DiTraglia (2014) and Hansen (2015) both consider averaging estimators that combine

the ordinary least squares (OLS) estimator and the two-stage-least-squares (2SLS) estimator in lin-

ear IV models. In linear IV models with homoskedastic errors, our conservative estimator becomes

the 2SLS estimator, and our aggressive estimator using both the IVs and the endogenous variables

becomes the OLS estimator7. However, the averaging weight is di¤erent from those in DiTraglia

(2014) and Hansen (2015) and we study it in a di¤erent asymptotic framework.

There is a large literature studying the validity of GMM moment conditions. Many methods

can be applied to detect the validity, including the over-identi�cation tests (see, e.g., Sargan, 1958;

Hansen, 1982; and Eichenbaum, Hansen and Singleton 1988), the information criteria (see, e.g.,

Andrews, 1999; Andrews and Lu, 2003; Hong, Preston and Shum, 2003), and the penalized estima-

tion methods (see, e.g., Liao, 2013 and Cheng and Liao, 2014). Recently, misspeci�ed moments and

their consequences are considered by Ashley (2009), Berkowitz, Caner, and Fang (2012), Conley,

Hansen, and Rossi (2012), Doko Tchatoka and Dufour (2008, 2014), Guggenberger (2012), Nevo

and Rosen (2012), Kolesar, Chetty, Friedman, Glaeser, Imbens (2014), Small (2007), Small, Cai,

Zhang, Hyunseung (2015) among others. Moon and Schorfheide (2009) explore over-identifying

moment inequalities to reduce the MSE. This paper contributes to this literature by providing new

uniform results for potentially misspeci�ed semiparametric models.

7Consider the linear IV model Yi = X 0i�0 + ui with instruments Zi. The aggressive estimator is equivalent to

the OLS estimator because (X 0P[X;Z]X)�1X 0P[X;Z]Y = (X

0X)�1X 0Y , where Y = (Y1; : : : ; Yn)0, X = (X1; : : : ; Xn)0,

Z = (Z1; : : : ; Zn)0 and P[X;Z] = (X;Z) [(X;Z)

0(X;Z)]�1(X;Z)0 denotes the projection matrix.

6

3 Parameter Space and Uniform Dominance

We assume g2(w; �) is twice continuously di¤erentiable with respect to (wrt) � for any w 2 W.The �rst and second order derivatives are denoted by g2�(w; �) and g2��(w; �), respectively. Let

F be a set of distribution functions of W . For k = 1 and 2, de�ne the Jacobian matrix and the

variance-covariance matrix

Gk;F (�) � EF [gk;�(W; �)] 2 Rrk�d� and k;F (�) � VarF [gk(W; �)] 2 Rrk�rk ; (3.1)

for any � 2 � and any F 2 F . These moments exist by Assumption 3.2 below. We consider the riskdi¤erence between two estimators uniformly over F 2 F that satis�es Assumptions 3.1-3.3 below.

Let Gk � Gk;F (�0), k � k;F (�0) and QF (�) � EF [g2(W; �)]0�12 EF [g2(W; �)] for any � 2 �.For any � 2 �, de�ne Bc"(�) = f�0 2 � : jj�0 � �jj � "g.

Assumption 3.1 The following conditions hold for any F 2 F .(i) EF [g1(W; �0)] = 0 for some �0 2 int(�).(ii) For any " > 0 there is �1;" > 0 such that inf

�2Bc"(�0)jjEF [g1(W; �)] jj � �1;".

(iii) For any " > 0 there is �2;" > 0 such that inf�2Bc"(��)

QF (�)�QF (��) � �2;" for some �� 2 int(�).

(iv) jjG02�12 �2;0jj � C�1 k�2;0k, where �2;0 = (01�r1 ; �00)0.

Assumptions 3.1.(i)-(ii) require that the true unknown parameter �0 is identi�ed by the moment

conditions EF [g1(W; �0)] = 0. Assumption 3.1.(iii) implies that a pseudo true value �� is identi�ed

by the minimizer of the GMM criterion QF (�) under misspeci�cation. Assumption 3.1.(iv) requires

that �2;0 does not lie in the eigenspace of the right zero eigenvalues of the matrix G02�12 , which

rules out the special case that �0 may be consistently estimable even with severely misspeci�ed

moment conditions.

Assumption 3.2 For some > 0, the following conditions hold for any F 2 F :(i) EF [sup�2�

�jjg2(W; �)jj2+ + jjvec(g2�(W; �))jj2+ + jjvec(g2��(W; �))jj2+

�] � C.

(ii) For k = 1; 2, C�1 � �min(k) � �max(k) � C:(iii) For k = 1; 2, C�1 � �min(G0kGk) � �max(G0kGk) � C.

Assumption 3.2 imposes moment conditions and non-singularity condition for the moments, the

Jacobian matrix, and the covariance matrix in the GMM estimation.

Uniform dominance is of interest only if we allow for di¤erent degrees of misspeci�cation in the

parameter space. If we only allow for correctly speci�ed models or severely misspeci�ed models,

the desired dominance results hold trivially following a pointwise analysis. The next assumption

7

states that the parameter space contains a continuous perturbation from correctly speci�ed models

to misspeci�ed models and it forms a closed set. Write

v0(�) ��vec[G2;F (�)]

0; vech[2;F (�)]0; [M2;F (�)�M2;F (�0)]

0� , (3.2)

where M2;F (�) � EF [g2(W; �)]. Note that G2;F (�), 2;F (�) and M2;F (�) are all functions indexed

by � for any F . Let

� � f(�0; v0 (�)) : � 2 � and F 2 Fg (3.3)

where �0 and v0(�) depend on F , �0 characterizes the degree of misspeci�cation, v0(�) denotes theJacobian matrix, the variance-covariance matrix, and the moments evaluated at �.8

Assumption 3.3 (i) For some " > 0, (�; v(�)) 2 � with 0 < k�k < " exists and it implies that

(e�; v(�)) 2 � for all e� 2 Rr� with 0 � jje�jj < "; (ii) � is a closed set.Example 3.1 (Linear IV Model) We study a simple linear IV model and provide a set of simple

conditions that imply Assumptions 3.1, 3.2 and 3.3. The parameter of interest �0 is the coe¢ cient

of the endogenous regressor Xi in

Yi = X0i�0 + Ui, where EF � [Z1;iUi] = 0 (3.4)

for some valid instruments Z1;i 2 Rr1 . In addition, we have some potentially misspeci�ed IVs

Z�i 2 Rr�. In a reduced form equation obtained by projecting Z�i on Ui, we can write

Z�i = Ui�0 + Vi; where EF � [UiVi] = 0 (3.5)

and �0 characterizes the degree of misspeci�cation. These additional IVs are valid only if �0 = 0:

In this example, EF [g1(Wi; �)] = EF [Z1;i(Yi �X 0i�)] and EF [g�(Wi; �)] = EF [Z�i (Yi �X 0

i�)].

Let F � denote the joint distribution of (Ui; V 0i ; Z01;i; X

0i) and F� denote a class of distributions

containing F �. Let Z2;i = (Z 01;i; Z�0i )

0. The Jacobian and the variance-covariance matrices are

Gk = �EF � [Zk;iX 0i] and k = EF � [U2i Zk;iZ 0k;i]� EF � [UiZk;i]EF � [UiZ 0k;i]; (3.6)

respectively. Note that EF � [UiZk;i] 6= 0 if �0 6= 0:Without loss of generality, we assume EF � [U2i ] = 1such that EF [g�(Wi; �0)] = EF � [Z�i Ui] = �0 as in (1.4). Given the linear structures in (3.4) and

8For ease of notations, the dependence of �0 and v0 on F is omitted.

8

(3.5), the joint distribution of Wi = (Yi; Z01;i; Z

�0i ; X

0i), denoted by F , is determined by �0, �0 and

F �.

Assumption 3.4 For some > 0, the following conditions hold for any F � 2 F�:(i) EF � [Z1;iUi] = 0 and EF � [ViUi] = 0:

(ii) For k = 1; 2, �min(G0kGk) � C�1 and �min(k) � C�1:

(iii) EF � [jjXijj4+ + jjZ1;ijj4+ + jjZ�i jj4+ + U4+ i ] � C.

(iv) jjG02�12 �2;0jj � C�1 k�2;0k, where �2;0 = (01�r1 ; �00)0.(v) fG2 : F � 2 F�g and f2 : F � 2 F�g are closed sets.

Let �� = (G02�12 G2)

�1G02�12 EF � [Z2;iYi] which is well de�ned under Assumptions 3.4.(ii)-(iii).

Lemma 3.1 Let F be the collection of F such that Wi is generated by the linear model (3.4) and

(3.5) with: (i) �0; �� 2 int(�) and � is compact; (ii) �0 2 [0; C]r�= [0; C] � � � � � [0; C] for some

C > 0; and (iii) F� satis�es Assumption 3.4. Then, F satis�es Assumptions 3.1, 3.2 and 3.3.

For the linear IV model, Lemma 3.1 provides simple conditions on �0, �0 and F� on whichuniformity results are subsequently established.

Now we get back to the general set up. For a generic estimator b�; consider a weighted quadraticloss function

`(b�) � n(b� � �0)0�(b� � �0); (3.7)

where � is a d��d� pre-determined positive semi-de�nite matrix. For example, if � = Id� ; EF [`(b�)]is the MSE of b�: If � = (�1��2)�1; the weighting matrix � �rst rescale b� by the scale of variancereduction. If � = EF [XiX 0

i] for regressors Xi; EF [`(b�)] is the MSE of X 0ib�; an estimator of X 0

i�0:

Below we compare the averaging estimator b�eo and the conservative estimator b�1:We are inter-ested in the bounds of the �nite sample risk di¤erence

RDn(b�eo;b�1) � inf

F2FEF [`(b�eo)� `(b�1)];

RDn(b�eo;b�1) � supF2F

EF [`(b�eo)� `(b�1)]: (3.8)

We investigate these �nite-sample objects by asymptotic analysis with n1=2(b�� 0) in `(b�) replacedby its asymptotic distribution. To apply the bounded convergence theorem, we approximate `(b�)with the trimmed loss `�(b�) � minf`(b�); �g and consider arbitrarily large trimming (� ! 1): As

9

such, the �nite-sample bounds in (3.8) are approximated by

AsyRD(b�eo;b�1) � lim�!1

lim infn!1

infF2F

EF [`�(b�eo)� `�(b�1)] andAsyRD(b�eo;b�1) � lim

�!1lim supn!1

supF2F

EF [`�(b�eo)� `�(b�1)]; (3.9)

which are called lower and upper bounds of asymptotic risk di¤erence respectively in this paper.

The averaging estimator b�eo asymptotically uniformly dominates the conservative estimator b�1 ifAsyRD(b�eo;b�1) < 0 and AsyRD(b�eo;b�1) � 0: (3.10)

The bounds of the asymptotic risk di¤erence build the uniformity over F 2 F into the de�nitionby taking infF2F and supF2F before lim infn!1 and lim supn!1. Uniformity is crucial for the

asymptotic results to give a good approximation to their �nite-sample counterparts. These uniform

bounds are di¤erent from pointwise results which are either obtained under a �xed DGP or a

particular sequence of drifting DGP. The sequence of DGPs fFng along which the supremum or

the in�mum are approached often varies with the sample size.9 Therefore, to determine the bounds

of the asymptotic risk di¤erence, one has to derive the asymptotic distributions of these estimators

under various sequences fFng. For each sample size n, the true values F , �0; �0; v0 are denoted asFn; �n; �n; vn, respectively. These true values satisfy the model speci�ed in (1.1), (1.4), and (3.2)

with the subscript 0 replaced by n. Under fFng, the observations fWn;igni=1 form a triangular array.For notational simplicity, Wn;i is abbreviated to Wi.

To study the bounds of asymptotic risk di¤erence, we consider sequences of DGPs fFng suchthat �n satis�es

(i) n1=2�n ! d 2 Rr� or (ii) jjn1=2�njj ! 1: (3.11)

Case (i) models mild misspeci�cation, where �n is a local deviation from 0: Case (ii) includes

the severe misspeci�cation where �n is bounded away from 0 as well as the intermediate case in

which �n ! 0 and jjn1=2�njj ! 1. To obtain a uniform approximation, all of these sequences are

necessary. Once we study the bounds of asymptotic risk di¤erence along each of these sequences,

we show that we can glue them together to obtain the bounds of asymptotic risk di¤erence.

9 In the rest of the paper, we use fFng to denote fFn 2 F : n = 1; 2; :::g.

10

4 Averaging Weight

We start by deriving the joint asymptotic distribution of b�1 and b�2 under di¤erent degrees of mis-speci�cation. We consider sequences of DGPs fFng such that (i) n1=2�n ! d 2 Rr� or jjn1=2�njj !1; and (ii) G2;Fn(�), 2;Fn(�) andM2;Fn(�) are convergent (under the uniform norm). The followinglemma characterizes the limits of G2;Fn(�), 2;Fn(�) and M2;Fn(�) under Assumption 3.3.

Lemma 4.1 Suppose that G2;Fn(�) ! G2(�), 2;Fn(�) ! 2(�) and M2;Fn(�) ! M2(�) for somefunctions G2(�), 2(�) and M2(�). Then under Assumption 3.3, there exists some F 2 F such that

G2 (�) = G2;F (�) , 2 (�) = 2;F (�) and M2 (�) =M2;F (�) :

Let M1;F (�) denote the �rst r1 components in M2;F (�). By Assumption 3.1.(i), there is a unique�0 2 int(�) which satis�es M1;F (�0) = 0. Under �xed DGP, we use F to denote the true DGP

and �0 to denote the true value under F: Under drifting DGP approximation, F is the de�ned in

Lemma 4.1 by the limit and �0 denotes the true value in the limit, e.g., the unique value identi�ed

by the limit of the valid moment conditions.

For k = 1; 2, de�ne

Gk � Gk;F (�0);k � k;F (�0); and �k � ��G0k

�1k Gk

��1G0k

�1k : (4.1)

Let Z2 denote a zero mean normal random vector with variance-covariance matrix 2 and Z1 denoteits �rst r1 components.

Lemma 4.2 Suppose Assumptions 3.1-3.3 hold. Consider fFng such that G2;Fn(�) ! G2(�),2;Fn(�)! 2(�) and M2;Fn(�)!M2(�) for some functions G2(�);2(�);M2(�).(a) If n1=2�n ! d for some d 2 Rr�, then0@ n1=2(b�1 � �n)

n1=2(b�2 � �n)1A!d

0@ �1

�2

1A �

0@ �1Z1:�2 (Z2 + d0)

1A , where d0 = (01�r1 ; d0)0:(b) If jjn1=2�njj ! 1, then n1=2(b�1 � �n)!d �1 and jjn1=2(b�2 � �n)jj !p 1.

Given the joint asymptotic distribution of b�1 and b�2, it is straightforward to study b�(!) =(1� !)b�1 + !b�2 if ! is deterministic. Following Lemma 4.2(a),

n1=2(b�(!)� �n)!d �(!) � (1� !)�1+!�2 (4.2)

11

for n1=2�n ! d, where d 2 Rr� . In the Appendix, a simple calculation shows that E[`(�(!))] isminimized at the infeasible optimal weight

!� � tr [� (�1 � �2)]d00 (�2 � ��1)

0�(�2 � ��1) d0 + tr [� (�1 � �2)]; (4.3)

where

�k ��G0k

�1k Gk

��1for k = 1; 2 and ��1 � [�1;0d��r� ] : (4.4)

To gain some intuition, consider the case where � = Id� such that the MSE of b�(!) is minimizedat !�: In this case, the infeasible optimal weight !� yields the ideal bias and variance trade o¤.

However, the bias depends on d, which cannot be consistently estimated. Hence, !� cannot be

consistently estimated. Our solution to this problem is to provide an inconsistent estimator of !�

such that the resulting averaging estimator reduces the MSE for any value of d.

The empirical analog of !� is constructed as follows. First, for k = 1 and 2, replace �k by its

consistent estimator10 b�k � ( bG0kb�1k bGk)�1, wherebGk � n�1 nX

i=1

gk;�(Wi;b�1) and bk � n�1 nXi=1

gk(Wi;b�1)gk(Wi;b�1)0 � gk(b�1)gk(b�1)0: (4.5)

Second, replace (�2 � ��1) d0 by its asymptotically unbiased estimator n1=2(b�2 � b�1) becausen1=2(b�2 � b�1)!d (�2 � ��1) (Z2 + d0) ; (4.6)

for d0 = (01�r1 ; d0)0 and d 2 Rr� following Lemma 4.2(a). The empirical optimal weight takes the

form

e!eo � trh��b�1 � b�2�i

n(b�2 � b�1)0�(b�2 � b�1) + tr h��b�1 � b�2�i ; (4.7)

where � is the matrix speci�ed in the loss function. The averaging GMM estimator takes the form

b�eo = (1� e!eo)b�1 + e!eob�2: (4.8)

Next we consider the asymptotic distribution of b�eo under di¤erent degrees of misspeci�cation.Lemma 4.3 Suppose that Assumptions 3.1-3.3 hold. Consider fFng such that G2;Fn(�) ! G2(�),2;Fn(�)! 2(�) and M2;Fn(�)!M2(�) for some functions G2(�);2(�);M2(�).10The consistency is proved in Lemma 4.3.

12

(a) If n1=2�n ! d for some d 2 Rr�, then

e!eo !d ! �tr (� (�1 � �2))

(Z2 + d0)0 (�2 � ��1)0�(�2 � ��1) (Z2 + d0) + tr (� (�1 � �2))

and

n1=2(b�eo � �n)!d � � (1� !)�1 + !�2:

(b) If jjn1=2�njj ! 1, then e!eo !p 0 and n1=2(b�eo � �n)!d �1.

To study the bounds of asymptotic risk di¤erence between b�eo and b�1, it is important to takeinto account the data-dependent nature of e!eo. Unlike b�1 and b�2; the randomness in e!eo is non-negligible in the mild misspeci�cation case (a). In consequence, b�eo does not achieve the samebounds of asymptotic risk di¤erence as the ideal averaging estimator (1 � !�)b�1 + !�b�2 does.Nevertheless, below we show that b�eo is insured against potentially misspeci�ed moments becauseit uniformly dominates b�1:5 Bounds of Asymptotic Risk Di¤erence under Misspeci�cation

In this section, we study the bounds of the asymptotic risk di¤erence de�ned in (3.9). Note that

the asymptotic distributions of b�1 and b�eo in Lemma 4.2 and 4.3 only depend on d, G2 and 2. Fornotational convenience, de�ne

h � (d0; vec(G2)0; vech(2)0): (5.1)

For the mild misspeci�cation case, de�ne the parameter space of h as

H = fh : d 2 Rr� ; G2 = G2;F (�0) and 2 = 2;F (�0) for some F 2 F with �0 = 0g; (5.2)

where �0 and �0 are de�ned by (1.1) and (1.4) for a given F:

If n1=2�n ! d 2 Rr� , by the asymptotic distributions of b�eo and b�1, and the bounded convergencetheorem,

EF [`�(b�eo)� `�(b�1)]! g�(h) � Ehmin

n�0��; �

o�min

��01��1; �

i(5.3)

where � is a constant which does not depend on n and E[�] denotes expectation with respect to(wrt) the Lebesgue measure. As � !1, the limit of g�(h) is

g(h) � E[�0��]� E��01��1

�: (5.4)

13

Figure 2. The Finite Sample Risk and the (Simulated) Asymptotic Risk

Note: The �nite-sample and (simulated) asymptotic risk of the averaging GMM estimator are represented by "GMMA-FRisk" and "GMMA-SRisk", respectively. The �nite-sample and (simulated) asymptotic risk of the pre-test GMMestimator are represented by "GMMP-FRisk" and "GMMP-SRisk", respectively.

Theorem 5.1 Suppose that Assumptions 3.1-3.3 hold. The bounds of the asymptotic risk di¤erence

satisfy

AsyRD(b�eo;b�1)= lim�!1

min

�infh2H

[g�(h)] ; 0

�� min

�infh2H

[g(h)] ; 0

�;

AsyRD(b�eo;b�1)= lim�!1

max

�suph2H

[g�(h)] ; 0

�� max

�suph2H

[g(h)] ; 0

�:

To show that b�eo uniformly dominates b�1, Theorem 5.1 implies that it is su¢ cient to show that

infh2H [g(h)] < 0 and suph2H [g(h)] � 0. We can investigate infh2H g(h) and suph2H g(h) by simu-lating g(h) in (5.4). In practice, we replace G2 and 2 by their consistent estimators and plot g(h) as

a function of d. Even if the uniform dominance condition does not hold, min finfh2H [g(h)] ; 0g andmax fsuph2H [g(h)] ; 0g quantify the most- and least-favorable scenarios for the averaging estimator.

Following (5.3) and (5.4), we call E[�0��] and E��01��1

�the asymptotic risks of b�eo and b�1 such

that g(h) represents the asymptotic risk di¤erence. For any given h; Figure 2 uses simulation to

demonstrate that these asymptotic risks provide good approximations to the �nite-sample risks

EF [`�(b�eo)] and EF [`�(b�1)]: Figure 2 normalizes the �nite-sample risk of b�1 to be 1 in all cases14

and plots the (simulated) asymptotic risks and �nite-sample risks of two estimators: one is the

averaging estimator b�eo and one is the pre-test GMM estimator based on the over-identi�cation

J-test with signi�cance level 0.01.11 The asymptotic risk for this pre-test estimator is given by

(A.89) in the Appendix. It is clear that the �nite sample risk and the (simulated) asymptotic risk

are fairly close and the averaging GMM estimator uniformly dominates the conservatives estimator

while the pre-test estimator does not.

Theorem 5.2 Let A � �(�1 � �2). Suppose that Assumptions 3.1-3.3 hold. If tr(A) > 0 and

tr(A) � 4�max(A) for any F 2 F , we have

AsyRD�(b�eo;b�1) < 0 and AsyRD�(b�eo;b�1) = 0:Thus, b�eo uniformly dominates b�1:

To shed light on the su¢ cient conditions in Theorem 5.2, let us consider a scenario similar

to the James-Stein estimator: �1 = �21Id� ; �2 = �22Id� ; and � = Id� . In this case, the su¢ cient

conditions become �1 > �2 and d� � 4: The �rst condition tr(A) > 0; which is reduced to �1 > �2;requires that the additional moments EF [g�(Wi; �0)] = 0 are non-redundant in the sense that they

lead to a more e¢ cient estimator of �. The second condition tr(A) � 4�max(A); which is reducedto d� � 4; requires that we are interested in the total risk of several parameters rather than that ofa single one. In a more general case where �1 and �2 are not proportional to the identity matrix,

the su¢ cient conditions are reduced to �1 > �2 and d� � 4 under the choice � = (�1 � �2)�1,which rescales b� by the variance reduction �1 � �2. In a simple linear IV model (Example 3.1)

where Z�i is independent of Z1;i and the regression error Ui is homoskedastic conditional on the

IVs, �1 > �2 requires that EF � [Z�iX 0i] and EF � [Z�i Z�0i ] both have full rank.

Note that these conditions are su¢ cient but not necessary. If these su¢ cient conditions do not

hold, we can still simulate the upper bounds in Theorem 5.1 to check the uniform dominance

condition. In fact, simulation studies in the next session show that in many cases b�eo has asmaller �nite-sample risk than b�1 even if these su¢ cient conditions are violated. Nevertheless,these analytical su¢ cient conditions can be checked easily before the simulation-based methods are

adopted.

11For the �nite-sample results, �n = c0n�1=21r��1 where c0 is a scale from 0 to 20. The �nite sample risks

are calculated using 100,000 simulated samples and the asymptotic risks are simulated by drawing 10,000 normalrandom vectors with mean zero and variance-covariance b2 in each simulated sample. No truncation is applied tothe �nite-sample risk.

15

6 Simulation Studies

In this section, we investigate the �nite sample performance of our averaging GMM estimator in

linear IV models. In addition to the empirical optimal weight e!eo, we consider two other averagingestimators based on the JS type of weights. The �rst one is based on the positive part of the JS

weight12:

!P;JS = 1� 1� tr( bA)� 2�max( bA)

n(b�2 � b�1)0H(b�2 � b�1)!+

(6.1)

where (x)+ = max f0; xg and bA is the estimator of A using b�k for k = 1; 2. The second one usesthe restricted JS weight

!R;JS = (!P;JS)+ : (6.2)

By construction, !P;JS � 1 and 0 � !R;JS � 1. We compare the �nite-sample MSEs of these threeaveraging estimators, the conservative GMM estimator b�1, and the pre-test GMM estimator based

on the J-test. The �nite-sample MSE of the conservative GMM estimator is normalized to be 1.

In Theorem 5.2, the su¢ cient condition for the uniform dominance is tr(A) � 4�max(A). Whenthis condition is not satis�ed, however, it is still possible that our averaging GMM estimator

has a smaller risk than the conservative GMM estimator. Therefore we consider two models in

simulation studies. In the �rst model, tr(A) � 4�max(A) and hence the su¢ cient condition in

Theorem 5.2 is satis�ed. In the second model 2�max(A) < tr(A) < 4�max(A), which means that the

su¢ cient condition in Theorem 5.2 does not hold. In each model, we consider four sample sizes,

n = 50; 100; 250; 500, and use 100,000 simulation repetitions.

6.1 Simulation in Model 1

Our �rst simulation model is

Yi =

6Xj=1

�jXj;i + �i; (6.3)

where Xj;i are generated by

Xj;i = �j(Zj;i + Zj+6;i) + Zj+12;i + uj;i for j = 1; :::; 6: (6.4)

12This formula is a GMM analog of the generalized JS type shrinkage estimator in Hansen (2016) for parametricmodels. The shrinkage scalar � is set to tr( bA)� 2�max(tr( bA)) in a fashion similar to the original JS estimator.

16

We draw i.i.d. random vectors (Z1;i; :::; Z18;i; u1;i; :::; u6;i; �i)0 from normal distributions with mean

zero and variance-covariance matrix diag(I18�18;�7�7), where

�7�7 =

0@ I6�6 0:25� 16�10:25� 11�6 1

1A : (6.5)

We set (�1; :::; �6) = 2:5 � 11�6 and (�1; :::; �6) = 0:5 � 11�6. The observed data are Wi =

(Yi; X1;i; :::; X6;i; Z1;i; :::; Z12;i; ~Z13;i; :::; ~Z18;i)0; where

~Zj;i = Zj;i + n�1=2dj�i, for j = 13; :::; 18: (6.6)

In the main regression equation (6.3), all regressors are endogenous because E(Xj;i�i) = 0:25 for

j = 1; :::; 6. The instruments (Z1;i; :::; Z12;i)0 are valid and ( ~Z13;i; :::; ~Z18;i)0 are misspeci�ed because

E( ~Zj;i�i) = n�1=2dj for j = 13; :::; 18. In the simulation studies, we consider d = (d13; :::; d18) =

c0 � 11�6 where c0 is a scalar that takes values on the grid points between 0 and 20 with the gridlength 0:1.

Figure 3 presents the MSEs of various estimators of the parameters in (6.3). Our �ndings in

model 1 are summarized as follows. First, the averaging GMM estimator b�eo has smaller MSE thanb�1 uniformly over d in all sample sizes considered, which is predicted by our theory because thekey su¢ cient condition is satis�ed in this model. Second, the pre-test GMM estimator does not

dominate the conservative GMM estimator b�1. When the location parameter c0 is close to zero, thepre-test GMM estimator has relative MSE as low as 0.4. However, its relative MSE is above 1 when

c0 is around 5. Third, among the three averaging estimators, the one based on e!eo has the smallestMSE. Fourth, the positive JS averaging estimator has relative MSE above 1 when the sample size

is small, e.g., n = 50 and n = 100, and it has relative MSE below 1 when the sample size becomes

slightly large. Fifth, it is interesting to see that as the sample size grows, the �nite sample MSEs

of the positive and restricted JS averaging estimators converge to that of the averaging estimator

based on e!eo.6.2 Simulation in Model 2

The second model is

Yi =6Xj=1

�jXj;i + �i; (6.7)

17

Figure 3. Finite Sample MSEs of the Averaging Estimators in Model 1

Note: "Pre-test(0.01)" refers to the pre-test GMM estimator based on the J-test with nominal size 0.01; "Plug-opt"refers to the averaging GMM estimator based on the empirical optimal weight ; "Posi-JS" and "ReSt-JS" refer to theaveraging estimators based on the positive part of the JS weight and the restricted JS weight, respectively.

where X1;i, X2;i and X3;i are exogenous variables generated by

X1;i = 3�1=2(Z1;i+Z2;i+Z4;i); X2;i = 3

�1=2(Z2;i+Z3;i+Z6;i); X3;i = 3�1=2(Z3;i+Z1;i+Z8;i); (6.8)

and Xj;i (j = 4; 5; 6) are generated by

Xj;i = �j(Zj;i + Zj+3;i) + Zj+6;i + uj;i for j = 4; 5; 6: (6.9)

We draw i.i.d. random vectors (Z1;i; :::; Z12;i; u4;i; :::; u6;i; �i)0 from normal distributions with mean

zero and variance-covariance matrix diag(I12�12;�4�4), where

�4�4 =

0@ I3�3 0:25� 13�10:25� 11�3 1

1A : (6.10)

18

The observed data are Wi = (Yi; X1;i; :::; X6;i; Z4;i; :::; Z9;i; ~Z4;i; :::; ~Z6;i)0, where

~Zj;i = Zj+6;i + n�1=2dj�i for j = 4; 5; 6: (6.11)

We set (�1; :::; �6) = 2:5 � 11�6 and (�4; :::; �6) = 0:5 � 11�3. In this model, Xj;i (j = 4; 5; 6) areendogenous regressors, (Z4;i; :::; Z9;i)0 are valid IVs, and ( ~Z4;i; :::; ~Z6;i)0 are misspeci�ed IVs. In the

simulation, we consider d = (d4; :::; d6) = c0 � 11�3 where c0 is a scalar that takes values on thegrid points between 0 and 20 with grid length 0:1.

Figure 4. Finite Sample Risks of the Averaging Estimators in Model 2

Our �ndings in Figure 4 are summarized as follows. First, even though the su¢ cient condition

in Theorem 5.1(c) is not satis�ed, the averaging estimator based on e!eo has a smaller MSE thanb�1 uniformly over d in all sample sizes considered. Moreover, its MSE is much smaller than that ofthe other two averaging estimators. Second, the properties of the pre-test estimator are similar to

those in model 1. That is, it does not dominate the conservative estimator. Third, the averaging

estimator using !P;JS has very large and unstable MSE when the sample size is small (e.g., n = 50

and 100). When the sample size is 50, its MSE is above 1.5 and hence does not show up in the

�rst panel of Figure 3. When the sample size becomes slightly large, (e.g., n = 250 and 500), the

positive JS averaging estimator has larger MSE than b�1 when the location parameter c0 is close tozero. Fourth, the averaging estimator using !R;JS has almost identical MSE as b�1 when the sample

19

size is small (e.g., n = 50 and 100), and its relative MSE becomes slightly lower than 1 when c0 is

close to zero and the sample size is large (e.g., n = 250 and 500).

7 An Empirical Application

One important issue in the empirical analysis of life cycle labor supply is to estimate the individual

human capital production function. The knowledge about the human capital function allows re-

searchers to estimate the household�s utility function, and hence to evaluate how changes in policies,

such as tax reduction, a¤ect consumption, labor market outcomes, and welfare (see, e.g., Heckman,

1976; Shaw, 1989; and Imai and Keane, 2004). This section applies the averaging GMM to estimate

the human capital production function.

We follow the literature (see, e.g., Shaw, 1989) to specify the human capital production function

as a quadratic function of ki;t, log of the human capital stock Ki;t, and hi;t, log of the hours of work

Hi;t:

f(ki;t; hi;t; �) = 1hi;t + 2h2i;t + 3hi;tki;t + 4ki;t + 5k

2i;t; (7.1)

where � = ( 1; : : : ; 5) are unknown parameters. Denote the regressors by

Xi;t =�hi;t; h

2i;t; hi;tki;t; ki;t; k

2i;t

�0: (7.2)

The log human capital stock ki;t is accumulated through the equation

ki;t+1 = f(ki;t; hi;t; �) + "i;t (7.3)

where "i;t = �i + ui;t is the unobservable residual term that contains an individual heterogeneity

component �i and a random shock ui;t. To avoid unnecessary complications, we follow Shaw (1989)

to specify the real wage as wi;t = Ri;tKi;t and follow Hokayem and Ziliak (2014) to assume Ri;t = 1

for all i and all t.13

We use the same data set as in Hokayem and Ziliak (2014) from the Panel Study of Income

Dynamics (PSID). The sample includes biennial observations for 1654 men from 1999 to 2009. We

further narrow the sample to individuals with at least three consecutive periods of observations,

which gives us a data set with 5774 individual-year observations.

To eliminate the individual e¤ect, we take �rst di¤erence on equation (7.3):

�ki;t+1 = �f(ki;t; hi;t; �) + �ui;t (7.4)

13Another way to think of this speci�cation is to use real wage rate wt as a proxy for the human capital stock Kt.

20

Table 1. Estimator of Human Capital Production Function

1 2 3 4 5 J-testb�1 0.0236 -0.0070 0.0310 0.0656 -0.0381 0.8427(0.0571) (0.0444) (0.0626) (0.0621) (0.0447) � �b�2 0.0009 0.0265 -0.0113 -0.2232 -0.0925 0(0.0328) (0.0240) (0.0496) (0.0529) (0.0247) � �

Note: (i) Numbers in the brackets are the standard errors; (ii) Numbers in the last column are the p-values of theJ-tests; (iii) GMM estimators are based on the sample from PSID in year 2003, 2005, 2007 and 2009; (iv) Four yeardummy variables are included in the moment functions and they are used as their own IVs in the GMM estimation.

where "�" denotes the �rst order di¤erence operator. The unknown parameter � can be estimated

by GMM estimator b�1 with the moment functionsg1(�ki;t+1;�Xi;t; Z1;t; �) = [�ki;t+1 ��f(ki;t; hi;t; �)] Z1;t (7.5)

where Z1;t = (X 0i;t�1; Z

0�;t)

0 is a set of IVs including Xi;t�1 and

Z�;t =�ci;t�1, c2i;t�1, ci;t�1li;t�1, li;t�1, l

2i;t�1

�0; (7.6)

where ci;t�1 = logCi;t�1, li;t�1 = logLi;t�1, and Ci;t�1 and Li;t�1 are, respectively, the consumption

and leisure of individual i at period t�1. The lagged consumption and leisure variables are includedto provide extra identi�cation restrictions for the human capital function.

In equation (7.4), the regressors �Xi;t may be endogenous because: (i) ki;t is correlated with

ui;t�1 and hence �ui;t in view of equation (7.3); and (ii) hi;t is partly determined by ki;t through the

individual�s labor decision. As a result, the LS estimator based on the following moment function

g�(�ki;t+1;�Xi;t; �) = [�ki;t+1 ��f(ki;t; hi;t; �)]�Xi;t (7.7)

may be inconsistent. The aggressive GMM estimator b�2 is constructed using the moment conditionsin both (7.5) and (7.7).

Table 1 reports the estimation results on the conservative and the aggressive estimators. The

conservative and aggressive GMM estimators of � di¤er substantially. The J-test strongly rejects the

validities of the moment conditions in (7.7), while it supports the validities of the moment conditions

in (7.5). On the other hand, the aggressive GMM estimator b�2 has much smaller standard errorthan the conservative estimator b�1.

Next, we consider the averaging GMM estimator under the quadratic loss function with � = Id� .

The empirical weight e!eo on the aggressive GMM estimator is 0:0770. It is interesting that the

21

Figure 5. Simulated Asymptotic Risk of the Averaging Estimatorof the Human Capital Function

averaging estimator assigns nontrivial weight to b�2, even though the J-test indicates misspeci�cationof the moment conditions in (7.7).

To evaluate the performance of the averaging GMM estimator, we simulate its asymptotic risk

following the formula in (5.4). This exercise is the same as that for Figure 2, which shows that this

simulated asymptotic risk is a good approximation to the �nite-sample risk. As there are 5 moment

conditions in (7.7), the risk of the averaging GMM estimator is a function of a 5-dimensional vector

of location parameters d = (d1; d2; d3; d4; d5)0 2 R5: We parameterize it as

d1 =pr cos�1;

d2 =pr sin�1 sin�2 sin�3; d3 =

pr sin�1 sin�2 cos�3;

d4 =pr sin�1 cos�2 sin�4; d5 =

pr sin�1 cos�2 cos�4 (7.8)

for some r 2 [0;+1) and �1; �2; �3; �4 2 [0; 2�] such thatP5k=1 d

2k = r: To simulate the risk, we

consider 1001 equally spaced grid points for r between 0 and 100, and for each grid point of r,

we consider 30 equally spaced grid points for �1, �2, �3 and �4, respectively, between 0 and 2�

(starting at 0). For each grid point of r, this gives 304 values for the simulated asymptotic risk

and we record the minimum and maximum values. As in the Monte Carlo simulation studies, the

asymptotic risk of the conservative GMM estimator is normalized to be 1 for each DGP.

8 Conclusion

This paper studies the averaging GMM estimator that combines the conservative estimator and

the aggressive estimator with a data-dependent weight. The averaging weight is the sample analog

22

of an optimal non-random weight. We provide a su¢ cient class of drifting DGPs under which

the pointwise asymptotic results combine to yield uniform approximations to the �nite-sample

risk di¤erence between two estimators. Using this asymptotic approximation, we show that the

proposed averaging GMM estimator uniformly dominates the conservative GMM estimator.

Inference based on the averaging estimator is an interesting and challenging problem. In addition

to the uniform validity, a desirable con�dence set should have smaller volume than that obtained

from the conservative moments alone. We leave the inference issue to future investigation.

References

[1] Andrews, D. W. K. (1999): "Consistent Moment Selection Procedures for Generalized Methodof Moments Estimation," Econometrica, 67(3), 543-563.

[2] Andrews, D. W. K., X. Cheng and P. Guggenberger (2011): �Generic Results for Establishingthe Asymptotic Size of Con�dence Sets and Tests," Cowles Foundation Discussion Paper,No.1813.

[3] Andrews, D. W. K., P. Guggenberger (2006): "On the Asymptotic Maximal Risk of Estimatorsand the Hausman Pre-Test Estimator," unpublished manuscript.

[4] Andrews, D. W. K. and B. Lu (2001): "Consistent Model and Moment Selection Procedures forGMM Estimation with Application to Dynamic Panel Data Models," Journal of Econometrics,101(1), 123-164.

[5] Ashley, R. (2009): "Assessing the Credibility of Instrumental Variables Inference with Imper-fect Instruments via Sensitivity Analysis," Journal of Applied Econometrics, 24, 325�337.

[6] Berkowitz, D. M. Caner and Y. Fang (2012): "The Validity of Instruments Revisited," Journalof Econometrics, 166(2), 255-266.

[7] Buckland, S. T., K. P. Burnham, and N. H. Augustin (1997): "Model Selection: An IntegralPart of Inference." Biometrics, 53, 603-618.

[8] Cheng X. and B. Hansen (2015): "Forecasting with Factor-Augmented Regression: A Fre-quentist Model Averaging Approach," Journal of Econometrics, 186(2), 280-293.

[9] Cheng X. and Z. Liao (2015): "Select the Valid and Relevant Moments: An Information-BasedLASSO for GMM with Many Moments," Journal of Econometrics, 186(2), 443-464.

[10] Claeskens, G. and R. J. Carroll (2007): "An Asymptotic Theory for Model Selection Inferencein General Semiparametric Problems," Biometrika, 94, 249-265.

[11] Conley T. G., C. B. Hansen, and P. E. Rossi (2012): "Plausibly Exogenous," Review of Eco-nomics and Statistics, 94(1), 260-272.

23

[12] Doko Tchatoka, F. and J.-M. Dufour (2008): "Instrument Endogeneity and Identi�cation-robust Tests: Some Analytical Results," Journal of Statistical Planning and Inference, 138(9),2649�2661.

[13] Doko Tchatoka, F. and J.-M. Dufour (2014): "Identi�cation-robust Inference for EndogeneityParameters in Linear Structural Models," Econometrics Journal, 17, 165�187.

[14] DiTraglia, F. (2014): "Using Invalid Instruments on Purpose: Focused Moment Selection andAveraging for GMM," University of Pennsylvania, Working Paper.

[15] Eichenbaum, M., L. Hansen and K. Singleton (1988): "A Time Series Analysis of Representa-tive Agent Models of Consumption and Leisure Choice under Uncertainty," Quarterly Journalof Economics, 103(1), 51-78.

[16] Guggenberger, P. (2012): "On the Asymptotic Size Distortion of Tests when InstrumentsLocally Violate the Exogeneity Assumption," Econometric Theory, 28(2), 387-421.

[17] Hansen, B. E. (2007): "Least Squares Model Averaging," Econometrica, 75, 1175-1189.

[18] Hansen, B. E. (2015): �A Stein-Like 2SLS Estimator,�Econometric Reviews, forthcoming.

[19] Hansen, B. E. (2016): �E¢ cient Shrinkage in Parametric Models,�Journal of Econometrics,190(1), 115�132.

[20] Hansen, B.E., and Racine, J. (2012): "Jackknife Model Averaging," Journal of Econometrics,167, 38�46.

[21] Hansen, L. P. (1982): "Large Sample Properties of Generalized Method of Moments Estima-tors," Econometrica, 50, 1029-1054.

[22] Heckman, J. J. (1976): "Estimation of A Human Capital Production Function Embeddedin A Life-cycle Model of Labor Supply," In Household Production and Consumption, NBER,225-264.

[23] Hong, H., B. Preston and M. Shum (2003): "Generalized Empirical Likelihood-Based ModelSelection Criteria for Moment Condition Models," Econometric Theory, 19(6), 923-943.

[24] Hjort, N. L. and G. Claeskens (2003): "Frequentist Model Average Estimators," Journal ofthe American Statistical Association, 98, 879-899.

[25] Hjort, N. L. and G. Claeskens (2006): "Focused Information Criteria and Model Averagingfor the Cox Hazard Regression Model," Journal of the American Statistical Association, 101,1449-1464.

[26] Hokayem, C. and J. P. Shum (2014): "Health, Human Capital, and Life Cycle Labor Supply,"American Economic Review: Papers & Proceedings 2014, 104(5), 127-131.

[27] Imai, S. and M. P. Kean (2004): "Intertemporal Labor Supply and Human Capital Accumu-lation,�International Economic Review, 45, 601-641.

24

[28] James W. and C. Stein (1961): �Estimation with Quadratic Loss,� Proc. Fourth BerkeleySymp. Math. Statist. Probab., vol(1), 361-380.

[29] Leeb, H. and B. M. Pötscher (2008): "Sparse Estimators and the Oracle Property, or theReturn of the Hodges Estimator," Journal of Econometrics, 142(1), 201-211.

[30] Leung, G., and Barron, A. (2006): "Information Theory and Mixing Least-square Regressions,"IEEE Transactions on Information Theory, 52, 3396�3410.

[31] Kolesar, M., R. Chetty, J. Friedman, E. Glaeser, G. Imbens (2014): "Identi�cation and In-ference with Many Invalid Instruments," Journal of Business Economics and Statistics, forth-coming

[32] Liao, Z. (2013): "Adaptive GMM Shrinkage Estimation with Consistent Moment Selection,"Econometric Theory, 29, 1�48.

[33] Liu, C-A. (2015): �Distribution Theory of the Least Squares Averaging Estimator,�Journalof Econometrics, 186(1), 142�159.

[34] Lu, X. and L. Su (2015): "Jackknife Model Averaging for Quantile Regressions,�Journal ofEconometrics, forthcoming.

[35] Moon. H. R. and F. Schorfheide (2009): "Estimation with Overidentifying Inequality MomentConditions," Journal of Econometrics, 153(2), 136�154.

[36] Nevo, A. and A. Rosen (2012): "Identi�cation with Imperfect Instruments," Review of Eco-nomics and Statistics, 93(3), 659-671.

[37] RudinW (1976): Principles of Mathematical Analysis, International Series in Pure and AppliedMathematics, McGraw-Hill, New York.

[38] Sargan, J. (1958): "The estimation of economic relationships using instrumental variables,"Econometrica, 26(3), 393-415

[39] Shaw, K. (1989): "Life-cycle Labor Supply with Human Capital Accumulation," InternationalEconomic Review, 30, 431-456.

[40] Small, D. S. (2007): "Sensitivity Analysis for Instrumental Variables Regression with Overi-dentifying Restrictions," Journal of the American Statistical Association, 102(479), 1049�1058.

[41] Small, D. S., Cai, T. T., Zhang, A. and Hyunseung, K. (2015): "Instrumental VariablesEstimation with Some Invalid Instruments and Its Application to Mendelian Randomization,"Journal of the American Statistical Association, forthcoming.

[42] Stein, C. M. (1956): "Inadmissibility of the Usual Estimator for the Mean of A MultivariateNormal Distribution," Proc. Third Berkeley Symp. Math. Statist. Probab., vol(1), 197-206.

[43] Yang, Y. (2005): "Can the Strengths of AIC and BIC be Shared?�A Con�ict Between ModelIdenti�cation and Regression Estimation," Biometrika, 92, 937�950.

25

A Appendix

Proof of Lemma 3.1. By Assumption 3.4.(i), EF [g1(Wi; �0)] = EF � [Z1;iUi] = 0 for the truevalue �0 in Yi = X 0

i�0 + Ui. Also �0 2 int(�) is assumed in Condition (i) of the Lemma. Thisveri�es Assumption 3.1.(i). For any � 2 � with jj� � �0jj � ",

kEF [g1(Wi; �)]k = kG1 (�0 � �)k � (�min(G01G1))1=2 k�0 � �k � C�1=2" > 0 (A.1)

by Assumption 3.4.(ii), which veri�es Assumption 3.1.(ii). To show Assumption 3.1.(iii), note that

QF (�) = EF [Z2;i(Yi �X 0i�)]

0�12 EF [Z2;i(Yi �X 0i�)]

= �0G02�12 G2� � 2�0G02

�12 EF � [Z2;iYi] + EF � [YiZ

02;i]

�12 EF � [YiZ2;i]; (A.2)

which is uniquely identi�es �� by Assumption 3.4.(ii). Moreover �� 2 int(�) by Condition (i) ofthe Lemma. Assumption 3.1.(iv) is imposed in Assumption 3.4.(iv).

To verify Assumption 3.2, note that g2(Wi; �) = Z2;i[X0i(�0 � �) + Ui], g2�(Wi; �) = �Z2;iX 0

i

and g2��(Wi; �) = 0. Assumption 3.2.(i) follows from Assumption 3.4.(iii), and � and [0; C]r�are

bounded. Assumptions 3.2(ii)-(iii) follow from Assumptions 3.4.(ii)-(iii).Note that M2;F (�) �M2;F (�0) = EF [g2(Wi; �)] � EF [g2(Wi; �0)] = G2(� � �0), which implies

thatv0(�) =

�vec[G2]

0; vech[2]0; (� � �0)0G02

�, (A.3)

In this example, we have � = [0; C]r� � V; where V = fv0(�) : F � 2 F� and � 2 �g. Assumption

3.3.(i) holds because � is a product space. Assumption 3.3.(ii) holds by Assumption 3.4.(v) andCondition (i) of the Lemma.

A.1 Proofs for the asymptotic distribution of estimators in Section 4

Let �n(g2(W; �)) = n�1=2Pn

i=1(g2(Wi; �)� EFn [g2(Wi; �)]).

Lemma A.1 Suppose that Assumption 3.2.(i) holds and � is compact. Then we have(i) sup�2� kg2(�)� EFn [g2(Wi; �)]k = op(1);(ii) sup�2�

n�1Pni=1 g2(Wi; �)g2(Wi; �)

0 � EFn [g2(Wi; �)g2(Wi; �)0] = op(1);

(iii) sup�2� n�1Pn

i=1 g2;�(Wi; �)� EFn [g2;�(Wi; �)] = op(1);

(iv) �n(g2(W; �)) is stochastic equicontinuous over � 2 �;(v) �1=22;Fn

(�n)�n(g2(W; �n))!d N(0r2�1; Ir2).

Proof of Lemma A.1. See Lemma 11.3-11.5 of Andrews and Cheng (2013).

Proof of Lemma 4.1. De�ne v0(�) � (vec[G2;F (�)]0; vech[2;F (�)]

0;M2;F (�)0). By the Arzela-

Ascoli Theorem, fv0(�) : � 2 � and F 2 Fg is compact because it is equicontinuous and boundedby Assumption 3.2 and the compactness of � and it is closed by Assumption 3.3. Hence, for theconvergent sequences G2;Fn(�)! G2(�), 2;Fn(�)! 2(�) andM2;Fn(�)!M2(�), there exists F 2 Fsuch that

G2(�) = G2;F (�), 2(�) = 2;F (�) and M2(�) =M2;F (�); (A.4)

26

which proves the Lemma.

Proof of Lemma 4.2. First, we provide formulae for e�1 and 2; which are used in the constructionof b�1 and b�2; and show consistency of e�1 and 2 under fFng: In the proof below, we use G2;Fn(�)!G2;F (�), 2;Fn(�) ! 2;F (�), M2;Fn (�) ! M2;F (�) following Lemma 4.1 and the fact that G2;F (�),2;F (�) and M2;F (�) are continuous functions of � for any F 2 F . The continuity follows fromAssumption 3.2.(i) and the dominated convergence theorem (DCT).

The preliminary estimator e�1 is de�ned ase�1 � argmin

�2�g1(�)

0g1(�): (A.5)

which is based on the �rst r1 valid moments and an identity weighting matrix. For ease of notations,we write

Gk;Fn � Gk;Fn(�n) and k;Fn � k;Fn(�n) for k = 1; 2: (A.6)

By Lemma A.1.(i),

g2(�) =M2;Fn(�) +

"n�1

nXi=1

g2(Wi; �)�M2;Fn(�)

#=M2;Fn(�) + op(1); (A.7)

uniformly over � 2 �. As g1(W; �) is a subvector of g2(W; �), by (A.7) and Assumption 3.2.(i),

g1(�)0g1(�) =M1;Fn(�)

0M1;Fn(�) + op(1) (A.8)

uniformly over � 2 �.By Assumption 3.1.(i) and Fn 2 F , M1;Fn(�)

0M1;Fn(�) is uniquely minimized at �n, whichtogether with the uniform convergence in (A.8) implies that e�1 � �n !p 0. By M1;Fn(�)!M1;F (�)and Assumption 3.2.(i),

M1;Fn(�)0M1;Fn(�) =M1;F (�)

0M1;F (�) + o(1): (A.9)

Because �n and �0 are uniquely identi�ed by M1;Fn(�) = 0 and M1;F (�) = 0; respectively, we alsohave �n = �0 + o(1).

To show the consistency of 2; note that

2 = n�1nXi=1

g2(Wi;e�1)g2(Wi;e�1)0 � g2(e�1)g2(e�1)0= EFn [g2(W;e�1)g2(W;e�1)0]�M2;Fn(

e�1)0M2;Fn(e�1) + op(1)

= 2;Fn(e�1) + op(1)

= 2;Fn + op(1); (A.10)

where the �rst equality is the de�nition, the second equality holds by (A.7), Lemma A.1.(ii) andAssumption 3.2.(i), the third equality follows from the de�nition of 2;F (�); and the last equality

27

holds because

2;Fn(e�1)� 2;Fn(�n) (A.11)

=h2;Fn(

e�1)� 2;F (e�1)i+ h2;F (e�1)� 2;F (�n)i+ [2;F (�n)� 2;Fn(�n)] = op(1)by 2;Fn(�)! 2;F (�), 2;F (�) is continuous, and the consistency of e�1. This shows the consistencyof 2:

We next show that under fFng,

b�1 � �n !p 0 and b�2 � ��n !p 0; (A.12)

where ��n denotes pseudo true value that uniquely minimizes of the criterion function QFn(�) =M2;Fn(�)

0�12;FnM2;Fn(�). By (A.7), (A.10) and Assumptions 3.2.(i)-(ii), we have

g2(�)0(2)

�1g2(�) =M2;Fn(�)0�12;FnM2;Fn(�) + op(1) (A.13)

uniformly over � 2 �. In addition, M2;Fn(�)0�12;FnM2;Fn(�) uniquely identi�es �

�n. The consistency

result b�2 � ��n !p 0 follows from standard arguments for the consistency of an extremum estima-tor. As g1(�) is a subvector of g2(�), and 1;n is a submatrix of 2;n, using (A.7), (A.10) andAssumptions 3.2.(i)-(ii), we have

g1(�)0(1)

�1g1(�) =M1;Fn(�)0�11;FnM1;Fn(�) + op(1); (A.14)

uniformly over � 2 �. AsM1;Fn(�)0�11;FnM1;Fn(�) is uniquely minimized at �n, b�1��n !p 0 follows

from the unique identi�cation of �n and the uniform convergence in (A.14).Next, we derive the asymptotic distribution under fFng: We start with the case that �n = o(1)

and show b�2 � �n = op(1) in this case. First, recall the pseudo true value ��n for n and thepseudo true value �� in the limit are the unique minimizers of QFn(�) and QF (�); respectively, bothare continuous functions of �: We have ��n ! �� by QFn(�) ! QF (�); which further follows fromM2;Fn(�)!M2;F (�);

2;Fn(�n)! 2;F (�0); (A.15)

�n ! �0; and the continuity of 2;F (�): The convergence in (A.15) follows from 2;Fn(�)! 2;F (�);the continuity of 2;F (�); and �n ! �0: Second, we show if �n = M2;Fn(�n) ! 0; we have �� = �0because (i) QF (�) = M2;F (�)

0�12 M2;F (�) has a unique minimizer called �� and (ii) M2;F (�0) = 0:

To see (ii), note that M2;Fn(�n)! 0r2�1 and

M2;Fn(�n)!M2;F (�0) = 0r2�1 (A.16)

where the convergence follows from the same arguments for (A.15): Putting together (i) b�2��n !p 0;

(ii) ��n ! ��; (iii) �� = �0; and (iv) �n ! �0; we have b�2 � �n !p 0:

28

Using b�2 � �n = op(1), Lemma A.1.(iv) and Assumption 3.2.(i),g2(b�2) = g2(�n) +

hM2;Fn(

b�2)�M2;Fn(�n)i+ op(n

�1=2)

= g2(�n) + [G2;Fn(�n) + op(1)] (b�2 � �n) + op(n�1=2): (A.17)

We have

n�1nXi=1

g2;�(Wi;b�2) = G2;Fn(b�2) + op(1) = G2;Fn(�n) + op(1); (A.18)

where the �rst equality follows from Lemma A.1.(iii) and the second equality follows from the sameargument used to show (A.11). From the �rst order condition for the GMM estimator b�2, we deducethat

0 =

"n�1

nXi=1

g2;�(Wi;b�2)#0 (2)�1g2(b�2)= (G02;Fn

�12;Fn

+ op(1))hg2(�n) + (G2;Fn + op(1))(

b�2 � �n) + op(n�1=2)i (A.19)

where the second equality follows from (A.10), (A.17) and (A.18). By (A.19) and Assumption 3.2,

n1=2(b�2 � �n) = (�2;n + op(1))n�n(g2(W; �n)) + n1=2EFn [g2(W; �n)]o+ op(1); (A.20)

where �2;n = ��G02;Fn

�12;Fn

G2;Fn

��1G02;Fn

�12;Fn

. Similarly, we can show that

n1=2(b�1 � �n) = �1;n�n(g1(W; �n)) + op(1); (A.21)

where �1;n = ��G01;Fn

�11;Fn

G1;Fn

��1G01;Fn

�11;Fn

. Combining (A.20) and (A.21) yields

0@ n1=2(b�1 � �n)n1=2(b�2 � �n)

1A =

0@ �1;n�n(g1(W; �n))

(�2;n + op(1))��n(g2(W; �n)) + n

1=2�0;n1A+ op(1); (A.22)

where �0;n = (0r1�1; �0n)0:

Suppose that n1=2�n ! d for some d 2 Rr� . We have �1;n = �1 + o(1) and �2;n = �2 + o(1)

by arguments used to show (A.15), which together with n1=2�n = d+ o(1), (A.15), (A.22), LemmaA.1.(v) and the continuous mapping theorem (CMT) deduces that0@ n1=2(b�1 � �n)

n1=2(b�2 � �n)1A!d

0@ ��1

�2

1A (Z2 + d0) , where Z2 � N(0r2�1;2); (A.23)

where ��1 � [�1;0d��r� ]. Part (i) of the Lemma follows from (A.23) and the de�nitions of ��1 andZ1.

29

Now we consider the case that �n = o(1) and jjn1=2�njj ! 1. Note that (A.15) and (A.21) donot depend on �n. By �1;n = �1 + o(1), (A.15), (A.21), Lemma A.1.(v) and the CMT,

n1=2(b�1 � �n)!d �1Z1: (A.24)

Assumption 3.2, (A.20) and Lemma A.1.(v) imply that

n1=2(b�2 � �n) = (�2;n + op(1))n1=2�n +Op(1): (A.25)

By Assumption 3.1.(iv) and jjn1=2�njj ! 1,

n�0n�02;n�2;n�n � C�2n�0n�n !1 (A.26)

which together with (A.25) implies that jjn1=2(b�2 � �n)jj !p 1.Finally, we consider the case where k�nk > C�1. By Assumption 3.1.(iv), jjG02;Fn

�12;Fn

�njj >C�1 k�nk > C�2 which implies that G02;F�12 M2;F

> 2�1C�2 (A.27)

by G2;Fn ! G2;F and 2;Fn ! 2;F as in (A.15). Because �� is the unique minimizer ofM2;F (�)0�12

M2;F (�), we have G2;F (��)0�12 M2;F (�

�) = 0d� , which combined with (A.27) and the continuity ofG2;F (�) and M2;F (�); implies that

k�� 0k > " (A.28)

for some " > 0. Using b�2 = ��n + op(1), ��n = �� + o(1), �n = �0 + o (1), (A.28) and the triangle

inequality, we obtain

n1=2 b�2 � �n � n1=2

hk�� 0k � jjb�2 � ��njj � k��n � ��k � k�n � �0ki

= n1=2 k�� 0k (1 + op(1))!p 1 (A.29)

which �nishes the proof.

Proof for the claim in equation (4.3).Consider the case n1=2�n ! d 2 Rr� . By Lemma 4.2,

n1=2hb�(!)� �ni = n1=2(b�1 � �n) + ! hn1=2(b�2 � �n)� n1=2(b�1 � �n)i

!d ��1Zd;2 + !(�2 � ��1)Zd;2: (A.30)

This implies that

`(b�(!)) = n hb�n(!)� �ni0� hb�n(!)� �ni!d �(!) (A.31)

where �(!) = Z 0d;2��01 ��1Zd;2 + 2!Z 0d;2(�2 � ��1)0��1Zd;2 + !2Z 0d;2(�2 � ��1)0�(�2 � ��1)Zd;2.

30

Now we consider E[�(!)] using the equalities in Lemma A.2 below this proof. First,

E[Z 0d;2��01 ��1Zd;2] = tr(��1) (A.32)

because ��1Zd;2 = �1Z1 and �1E(Z1Z 01)�01 = �1 by de�nition. Second,

E�Z 0d;2(�2 � ��1)0��1Zd;2

�= tr(��1E

�Zd;2Z 0d;2

�(�2 � ��1)0)

= tr(��1�d0d

00 +2

�(�2 � ��1)0)

= tr(�(�2 � �1)); (A.33)

where the last equality holds by Lemma A.2. Third,

E�Z 0d;2(�2 � ��1)0�(�2 � ��1)Zd;2

�= tr(�(�2 � ��1)

�d0d

00 +2

�(�2 � ��1)0)

= d00�02��2d0 + tr(�(�1 � �2)) (A.34)

by Lemma A.2. Combining the results in (A.32)-(A.34), we obtain

E[�(!)] = tr(��1)� 2!tr (� (�1 � �2)) + !2�d00�

02��2d0 + tr (� (�1 � �2))

�. (A.35)

Note that d00�02��2d0 = d00(�2 � ��1)0�(�2 � ��1)d0 because ��1d0 = 0d� . The optimal weight !

�

minimizes the quadratic function of ! in (A.35).

Lemma A.2 (a) ��1d0 = 0d� ; (b) ��12�

�01 = �1; (c) �

�12�

02 = �2; (d) �22�

02 = �2.

Proof of Lemma A.2. By construction, ��1d0 = 0d� : For the ease of notation, we write 2 andG2 as

2 =

0@ 1 1r

r1 r

1A and G2 =

0@ G1

Gr

1A : (A.36)

To prove part (b), we have

��12��01 = [�1;0d��r� ]

0@ 1 1r

r1 r

1A [�1;0d��r� ]= �11�

01 =

�G01

�11 G1

��1= �1: (A.37)

To show part (c), note that

��12�02 = � [�1;0d��r� ] 2

�12 G2

�G02

�12 G2

��1= ��1G1

�G02

�12 G2

��1=�G02

�12 G2

��1= �2 (A.38)

31

because ��1G1 = Id��d� : Part (d) follows from the de�nition of �2.

Proof of Lemma 4.3. We �rst prove the consistency of bk, bGk and b�k for k = 1; 2. By Lemma4.2, we see that b�1 satis�es b�1 = �n + op(1) and b�1 = �0 + op(1). Using the same arguments inshowing (A.10) and (A.15), we can show that

b2 = 2;Fn + op(1) = 2 + op(1): (A.39)

As b1 is a submatrix of b2, by (A.39) we haveb1 = 1;Fn + op(1) = 1 + op(1): (A.40)

By the consistency of b�1 and the same arguments used to show (A.18), we haven�1

nXi=1

g2;�(Wi;b�1) = G2;Fn(�n) + op(1) = G2;F + op(1): (A.41)

As n�1Pni=1 g1;�(Wi;b�1) is a subvector of n�1Pn

i=1 g2;�(Wi;b�1), by (A.39) we haven�1

nXi=1

g1;�(Wi;b�1) = G1;Fn(�n) + op(1) = G1;F + op(1): (A.42)

From (A.39), (A.40), (A.41) and (A.42), we see that bk and bGk are consistent estimators of k andGk respectively for k = 1; 2. By the Slutsky theorem, we know that b�k is a consistent estimator of�k for k = 1; 2.

In the case n1=2�n ! d 2 Rr� , the desired result follows from Lemma 4.2, the consistency of b�1and b�2; and the CMT. In the case jjn1=2�njj ! 1; e!eo !p 0 because n1=2jb�2 � b�1j !p 1;

n1=2(b�eo � �n) = n1=2(b�1 � �n) + e!eon1=2(b�2 � b�1)= n1=2(b�1 � �n) + n1=2(b�2 � b�1)tr h�(b�1 � b�2)i

n(b�2 � b�1)0�(b�2 � b�1) + tr h�(b�1 � b�2)i !d �1 (A.43)

by Lemma 4.2.

A.2 Proofs for the results in Section 5

We �rst present some generic results on the bounds of asymptotic risk di¤erence between twoestimators under some high-level conditions. Then we apply these generic results to the two speci�cestimators we consider in this paper: b�eo and b�1.

Letvn(�) = (vec(G2;Fn(�))0; vech(2;Fn(�))0; [M2;Fn(�)�M2;Fn(�n)]

0) (A.44)

32

De�ne �2;0 = [0r1�1; �00]0. Note that (i) �n ! �0 and vn(�)! (vec(G2(�))0; vech(2(�))0; [M2(�)��2;0]0

and (ii) G2;Fn(�) ! G2(�); 2;Fn(�) ! 2(�); M2;Fn(�) ! M2(�) are equivalent statements, and weinterchange them below for convenience.

Under such convergence sequences, de�ne h = (d0; vec(G2)0; vech(2)0), where G2 and 2 areevaluated at the true value in the limit, as de�ned in (4.1). De�ne

H� = fh : n1=2�n ! d 2 Rr� ; �n ! 0; vn(�)! (vec(G2(�))0; vech(2(�))0;M2(�)0) for some fFngg.(A.45)

Similarly, de�ne

H�1 = fh : jjn1=2�njj !1; �n!�0; vn(�)! (vec(G2(�))0; vech(2(�))0; [M2(�)� �2;0]0) for some fFngg

(A.46)where �2;0 = [0r1�1; �

00]0 and we de�ne d =1 if jjn1=2�njj ! 1.

Lemma A.3 Suppose the following conditions hold.(i) If n1=2�n ! d for d 2 Rr� [ f1g, �n ! �0; vn(�) ! v0(�) = (vec(G2(�))0; vech(2(�))0; [M2(�) ��2;0]

0, then

limn!1

EFn [`�(b�)] = R�(h) 2 (0;1) and limn!1

EFn [`�(e�)] = eR�(h) 2 (0;1);(ii) � = f(�0; v0(�)) : � 2 � and F 2 Fg is a compact set.(iii) For some " > 0; (�; v(�)) 2 � with jj�jj < " implies that (e�; v(�)) 2 � for all e� 2 Rr� withjje�jj < ":

Then the lower and upper bounds of the asymptotic risk di¤erence between b� and e� have thefollowing representations:

AsyRD(b�;e�)= lim�!1

�min

�infh2H�

hR�(h)� eR�(h)i ; inf

h2H�1

hR�(h)� eR�(h)i�� ;

AsyRD(b�;e�)= lim�!1

max

(suph2H�

hR�(h)� eR�(h)i ; sup

h2H�1

hR�(h)� eR�(h)i

)!:

Proof of Lemma A.3. The proof uses the subsequence techniques used to show the asymptoticsize of a test in Andrews, Cheng, and Guggenberger (2011) but we adapt the proof and notationsto the current setup and extend results for test to estimators.

We �rst show that

AsyR�(b�) � lim supn!1

supF2F

EF [`�(b�)] = max(suph2H�

R�(h); suph2H�

1

R�(h)

): (A.47)

For this purpose, we �rst show that

AsyR�(b�) � maxmax(suph2H�

R�(h); suph2H�

1

R�(h)

): (A.48)

33

Let fFng be a sequence such that

lim supn!1

EFn [`�(b�)] = lim supn!1

�supF2F

EF [`(b�)]� : (A.49)

Such a sequence always exists by the de�nition of supremum. The sequence fEFn [`(b�)] : n � 1gmay not converge. Now let fwn : n � 1g be a subsequence of fng such that fEFwn [`(b�)] : n � 1gconverges and its limit equals AsyR�(b�): Such a subsequence always exists by the de�nition oflimsup. Below we show that there exists a subsequence fpng of fwng such that

EFpn [`(b�)]! R�(h) for some h 2 H� (A.50)

orEFpn [`(b�)]! R�(h) for some h 2 H�

1: (A.51)

Provided (A.50) or (A.51) holds, we obtain the desired result in (A.48).To show that there exists a subsequence fpng of fwng such that either (A.50) or (A.51) holds,

it su¢ ces to show claims (1) and (2): (1) for any sequence fFng and any subsequence fwng of fng,there exists a subsequence fpng of fwng for which

p1=2n �pn ! d 2 Rr� , �pn ! 0; vpn(�)! v0(�) such that h 2 H� (A.52)

or p1=2n �pn

!1; �pn ! �0; and vpn(�)! v0(�) such that h 2 H�1; (A.53)

and (2) for any subsequence fpng of fng and any sequence fFpn : n � 1g, (A.52) together withcondition (i) implies (A.50), and (A.53) combined with condition (i) implies (A.51). Below we writev(�) as v for notational simplicity.

To show (1), let �wn;j denote the j-th component of �wn and p1;n = wn 8n � 1. For j = 1,either (i) lim supn!1 jp

1=2j;n �pj;n;j j < 1 or (ii) lim supn!1 jp

1=2j;n �pj;n;j j = 1. If (i) holds, then for

some subsequence fpj+1;ng of fpj;ng, p1=2j+1;n�pj+1;n;j ! dj for some dj 2 R. If (ii) holds, then forsome subsequence fpj+1;ng of fpj;ng, p1=2j+1;n�pj+1;n;j !1 or �1. As r� is a �xed positive integer,we can apply the same arguments successively for j = 1; :::; r� to obtain a subsequence fp�ng offwng such that (p�n)1=2�p�n ! d� 2 Rr� or jj(p�n)1=2�p�n jj ! 1. Finally, there exists a subsequencefpng of fp�ng such that �pn ! �� and vpn ! v� because � is a compact set.

We have constructed the subsequence fpng of fng such that either (i) (pn)1=2�pn ! d� 2 Rr� ;�pn ! �� = 0; vpn ! v�; or (ii) jj(pn)1=2�pn jj ! 1; �pn ! ��;and vpn ! v�. To conclude (A.52)holds in case (i), it remains to show h� = (d; vec(G�2)

0; vech(�2)0) 2 H�; where G�2 and

�2 are

de�ned with the limit �� and v�:. Similarly, to show (A.53) holds in case (ii), it remains to showh� = (1; vec(G�2)0; vech(�2)0) 2 H�

1. This step is necessary because d�; ��;and v� are the limits

along a subsequence, whereas H� and H�1 are de�ned using limits of the full sequence. To close this

gap, we show that for the subsequence fpng constructed above there exists a full sequence with thesame limit. For case (i), such a full sequence of DGP fF �k 2 F : k � 1g can be constructed as follows.

34

First, consider the case where d� 2 Rr� . (i) 8k = pn; de�ne F �k = Fpn and (ii) 8k 2 (pn; pn+1),de�ne F �k to be a true distribution such that

�F �k = (pn=k)1=2�pn and vF �k = vpn ; (A.54)

where �F �k and vF �k denote the values of �0 and v0 under F�k in this case. There exists F

�k 2 F

for which (A.54) holds for large n by Condition (iii). To see it, we �rst note that (�pn ; vpn) 2 �because Fpn 2 F . Moreover, we have pn=k < 1, and jj�pn jj < " for large n because �pn ! 0r� .Condition (iii) ensures the existence of F �k for any k 2 (pn; pn+1). Along this constructed sequencefF �k 2 F : k � 1g, we have k1=2�F �k ! d�; �F �k ! �� = 0; vF �k ! v� as desired. This shows that h� 2H� in case (i). For case (ii), de�ne F �k = Fpn for k 2 [pn; pn+1). Then, k1=2jj�F �k jj � (pn)

1=2jj�pn jj8k 2 [pn; pn+1). In consequence, (pn)1=2jj�pn jj ! 1 as n ! 1 implies that k1=2jj�F �k jj ! 1 ask !1. In addition, �Fk� ! �� and vF �k ! v� as k !1. Hence, in case (ii), h� 2 H�

1. Combinedthe results for case (i) and (ii), we have completed the proof of (1).

To show (2), note that we have proved that for any subsequence fpng of fng and any sequencefFpn : n � 1g such that (A.52) holds, there exists a full sequence fF �k 2 F : k � 1g such thatn1=2�F �k ! d� 2 Rr� , �F �k ! ��; vF �k ! v�, and F �pn = Fpn 8n � 1. Similarly, if (A.53) holds, thereexists a full sequence fF �k 2 F : k � 1g such that jjn1=2�F �k jj ! 1, �F �k ! ��; vF �n ! v�, andF �pn = Fpn 8n � 1. This together with Condition (i) of the Lemma implies (2). This proves either(A.50) or (A.51) holds, which in turn implies (A.48).

Next, we show that

AsyR�(b�) � max(suph2H�

R(h); suph2H�

1

R(h)

): (A.55)

For any h = (d0; vec(G2)0; vech(2)0) 2 H�, there exists a sequence fFn 2 F : n � 1g such thatn1=2�n ! d, �n ! 0;and vn ! v = (vec(G2)

0; vech(2)0;M 02). Moreover,

AsyR�(b�) = lim supn!1

supF2F

EF [`�(b�)] � lim supn!1

EFn [`(b�)] = R�(h); (A.56)

where the last equality holds by Condition (i). Similarly, for any h 2 H�1, there exists a sequence

fFn 2 F : n � 1g such that n1=2jj�njj ! 1; �n ! �0; and vFn ! v = (vec(G2)0; vech(2)0;M 0

2),which together with Condition (i) implies that

AsyR�(b�) = lim supn!1

supF2F

EF [`�(b�)] � lim supn!1

EFn [`�(b�)] = R�(h): (A.57)

(A.56) combined with (A.57) immediately yields (A.55).Finally, (A.47) is implied by (A.48) and (A.55). The desired result follows from (A.47) with

EF [`�(b�)] replaced by EF [`�(b�)� `�(e�)]:Proof of Theorem 5.1. We obtain the two equalities by applying Lemma A.3 with b� = �eoand e� = b�1 and H = H�. Now we verify the conditions in Lemma A.3 under Assumptions 3.1-3.3.Condition (i) holds with R�(h) = minfE[�

0��]; �g and eR�(h) = minfE ��01��1� ; �g for d 2 Rr� and

35

R�(h) = minfE��01��1

�; �g and eR�(h) = minfE ��01��1� ; �g for d = 1, following Lemma 4.3 and

the CMT. The set � is compact because (i) VF = fv0(�) : F 2 Fg is compact as shown in the proofof Lemma 4.1, (ii)M2;F (�) is continuous, (iii) �n ! �0. This veri�es Condition (ii). Condition (iii)is assumed in Assumption 3.3.(i).

In addition, we show H = H�, where H and H� are de�ned in (5.2) and (A.45), respectively.Let fFng be the sequence in (A.45) under Fn. By Conditions (ii)-(iii) of the Lemma A.3, Lemma4.1 implies

G2;Fn(�)! G2;F (�);2;Fn ! 2;F (�);M2;Fn(�)!M2;F (�) (A.58)

for some F 2 F . Therefore, G2 = G2;F (�0) and 2 = 2;F (�0) for F 2 F by de�nition, whichimplies that H� � H.

To show H � H�, we need to show that for all d 2 Rr� and all F 2 F with �0 = 0, there existsa sequence fFn 2 F : n � 1g such that

n1=2�n ! d, G2;Fn(�)! G2;F (�), 2;Fn(�)! 2;F (�) and M2;Fn(�)!M2;F (�): (A.59)

Take �n = d=n1=2 for n large enough such that �n < ". The Condition (iii) in Lemma A.3 impliesthat (�n; v0(�)) 2 � for all large n, where v0(�) is de�ned in (3.2). By the de�nition of �, we knowthat there exists sequence fFng such that �n = d=n1=2 and vn(�) = v0(�), which shows (A.59) andhence H � H�.

Next, we show the inequalities hold for � large enough. As minf�0��; �g � �0��,

Ehminf�0��; �g

i� E

h�0��i: (A.60)

The expectation E[�0��] is �nite, because

Eh�0��i� 2E

�Z 0d;2��01 ��1Zd;2 + !2Z 0d;2 (�2 � ��1)

0�(�2 � ��1)Zd;2�

= 2E

"Z 01�01��1Z1 + tr(A)

tr(A)

Z 0d;2BZd;2 + tr(A)Z 0d;2BZd;2

Z 0d;2BZd;2 + tr(A)

#� 2E

�Z 01�01��1Z1

�+ 2tr(A) � C (A.61)

where the �rst inequality is by the Cauchy-Schwarz inequality, the third inequality is by

tr(A)

Z 0d;2BZd;2 + tr(A)� 1 and

Z 0d;2BZd;2Z 0d;2BZd;2 + tr(A)

� 1 with probability 1, (A.62)

and the last inequality is by the regularity conditions in Assumption 3.2. Similarly, we also have

E��01��1

�� CE[ k�1Z1k]2 � C2: (A.63)

We can writeg(h) = E[�0��]� E

��01��1

�= 2tr(A)J1 + tr(A)

2J2;

36

where

J1 = E

"Z 0d;2DZd;2

Z 0d;2BZd;2 + tr(A)

#and J2 = E

"Z 0d;2BZd;2

(Z 0d;2BZd;2 + tr(A))2

#: (A.64)

From the de�nition of g�(h) and g(h), we use (A.60) to deduce that

g�(h) = Ehminf�0��; �g

i� E

�minf�01��1; �g

�� E

h�0��i� E

�minf�01��1; �g

�= E[�01��1 �minf�01��1; �g] + g(h) for any h 2 H: (A.65)

Next we showlim�!1

suph2H

�E[�01��1 �minf�01��1; �g]

= 0: (A.66)

Recall that �1 is a function of G1 and 1. De�ne

q(Z; G1;1) � Z 01=21 �01��11=21 Z, where Z � N(0r1 ; Ir1�r1). (A.67)

Then we can write

f�(G1;1) � E[�01��1 �minf�01��1; �g]

= E [q(Z; G1;1)�minfq(Z; G1;1); �g] : (A.68)

Let�1 = f(G1;1) : G1;Fn ! G1 and 1;Fn ! 1 for some fFngg: (A.69)

We now havelim�!1

suph2H

f�(G1;1) � lim�!1

sup(G1;1)2�1

f�(G1;1) (A.70)

because h 2 H requires the convergence listed in (A.69) as well as the convergence of some otherfunctions.

We next show lim�!1 sup(G1;1)2�1 f�(G1;1) = 0. First, lim�!1 f�(G1;1) = 0 for any(G1;1) 2 �1 by the DCT because

0 � q(Z; G1;1)�minfq(Z; G1;1); �g � q(Z; G1;1) (A.71)

and E [q(Z; G1;1)] = tr(��1) � C. Second, this convergence is uniform in (G1;1) 2 �1 by theDini�s Theorem (see, Rudin (1976)) because (i) f�(G1;1) is monotonically decreasing in �; (ii) �1 iscompact; and (iii) f�(G1;1) is continuous in (G1;1). The continuity of f�(G1;1) in (G1;1) is bythe DCT because (a) q(Z; G1;1) is continuous in (G1;1); and (b) E[ sup(G1;1)2�1 q(Z; G1;1)] <1. To see (b), note that

sup(G1;1)2�1

q(Z; G1;1) �"

sup(G1;1)2�1

�max

�1=21 �01��1

1=21

�#Z 0Z � CZ 0Z (A.72)

37

by Assumption 3.2.This completes the veri�cation of (A.66). It follows from (A.66) that

lim�!1

suph2H

g�(h) � suph2H

g(h) and lim�!1

infh2H

g�(h) � infh2H

g(h), (A.73)

which proves the inequalities in Theorem 5.1.

Proof of Theorem 5.2. We provide a upper bound for J1 de�ned in (A.64). Let

�(x) =x

x0Bx+ tr(A); where x = Zd;2 and B = (�2 � ��1)0�(�2 � ��1): (A.74)

Its derivative is@�(x)0

@x=

1

x0Bx+ tr(A)Ir2 �

2

(x0Bx+ tr(A))2Bxx0: (A.75)

De�neD = (�2 � ��1)0��1; (A.76)

which satis�es DZd;2 = DZ2 by construction because the last r� rows of ��1 are zeros. ApplyingLemma A.2 yields

tr (D2) = tr�(�2 � ��1)0��12

�= tr(� (��12�2 � ��12��1))

= tr(� (�2 � �1)) = �tr(A): (A.77)

By Lemma 1 of Hansen (2016), which is a matrix version of the Stein�s Lemma (Stein, 1981),

J1 = E��(Zd;2)0DZd;2

�= E

��(Zd;2)0DZ2

�= E

�tr

�@�(Zd;2)0@x

D2

��: (A.78)

Plugging (A.74)-(A.76) into (A.78), we have

J1 = E

"tr (D2)

Z 0d;2BZd;2 + tr(A)

#� 2E

264 tr�BZd;2Z 0d;2D2

��Z 0d;2BZd;2 + tr(A)

�2375

= E

"�tr(A)

Z 0d;2BZd;2 + tr(A)

#+ 2E

264 �Z 0d;2D2BZd;2�Z 0d;2BZd;2 + tr(A)

�2375 (A.79)

38

where the second equality is by (A.77). By de�nition,

�Z 0d;2D2BZd;2 = �Z 0d;2(�2 � ��1)0��12(�2 � ��1)0�(�2 � ��1)Zd;2= Z 0d;2(�2 � ��1)0�(�1 � �2)�(�2 � ��1)Zd;2� �max(�

1=2(�1 � �2)�1=2)(Z 0d;2(�2 � ��1)0�(�2 � ��1)Zd;2)= �max(A)Z 0d;2BZd;2 (A.80)

where the last equality is by �max(�1=2(�1��2)�1=2) = �max(�(�1��2)). Combining the results

in (A.79) and (A.80), we get

J1�E"

�tr(A)Z 0d;2BZd;2 + tr(A)

#+ 2E

264 �max(A)Z 0d;2BZd;2�Z 0d;2BZd;2 + tr(A)

�2375

=E

"�tr(A)

Z 0d;2BZd;2 + tr(A)

#+ 2E

264h�Z 0d;2BZd;2

�+ tr(A)

i�max(A)� tr(A)�max(A)�

Z 0d;2BZd;2 + tr(A)�2

375=E

"2�max(A)� tr(A)Z 0d;2BZd;2 + tr(A)

#� E

264 2�max(A)tr(A)�Z 0d;2BZd;2 + tr(A)

�2375 : (A.81)

Next, note that

J2=E

264 Z 0d;2BZd;2��Z 0d;2BZd;2 + tr(A)��2375 = E

264Z 0d;2BZd;2 + tr(A)� tr(A)��Z 0d;2BZd;2 + tr(A)��2375

=E

"1

Z 0d;2BZd;2 + tr(A)

#� E

264 tr(A)��Z 0d;2BZd;2 + tr(A)��2375 : (A.82)

39

Combining (A.81) and (A.82), we obtain that

g(h)=2tr(A)J1 + tr(A)2J2

�2tr(A)

0B@E" 2�max(A)� tr(A)Z 0d;2BZd;2 + tr(A)

#� E

264 2tr(A)�max(A)��Z 0d;2BZd;2 + tr(A)��23751CA

+tr(A)2

0B@E" 1

Z 0d;2BZd;2 + tr(A)

#� E

264 tr(A)��Z 0d;2BZd;2 + tr(A)��23751CA

=E

"tr(A) (4�max(A)� tr(A))Z 0d;2BZd;2 + tr(A)

#� E

264tr(A)2 (4�max(A) + tr(A))��Z 0d;2BZd;2 + tr(A)��2375 : (A.83)

For all G2 and 2 such that h = (d; vec(G2)0; vech(2)0) 2 H; we have G2 = G2;F and 2 = 2;Ffor some F 2 F because V is compact, which is veri�ed before. Hence, we have desired result iftr(A) > 0 and 4�max(A)� tr(A) � 0 for 8F 2 F .

A.3 Asymptotic risk of the pre-test GMM estimator with the J-test statistic

To simulate the asymptotic risk of the pre-test estimator in Figure 2, we consider the asymptoticrisk under (i) n1=2�n ! d 2 Rr� and (ii) jjn1=2�njj ! 1. We consider case (i) �rst. By (A.17) and(A.20),

g2(b�2) = g2(�n) + [G2;Fn(�n) + op(1)] (

b�2 � �n) + op(n�1=2)= g2(�n) +G2;Fn�2;ng2(�n) + op(n

�1=2)

= (Ir2 +G2;Fn�2;n)g2(�n) + op(n�1=2); (A.84)

which together with (A.39), G2;Fn = G2 + o(1) and �2;n = �2 + o(1) implies that

Jn� ng2(b�2)0(b2)�1g2(b�2) (A.85)

= ng2(�n)0h�12 � �12 G2

�G02

�12 G2

��1G02

�12

ig2(�n) + op(1):

By Lemma A.1.(v), n1=2�n ! d 2 Rr� and the Slutsky Theorem, n�1=2Pni=1 g2(Wi; �n) !d Z2 +

d0 = Zd;2, which combined with the CMT implies that

Jn !d Z 0d;2h�12 � �12 G2

�G02

�12 G2

��1G02

�12

iZd;2 � J1(h) (A.86)

under n1=2�n ! d 2 Rr� .Let � be a pre-speci�ed signi�cance level. The critical value c� is the 1 � � quantile of the

chi-square distribution with (r2� d�) degree of freedom, which is the asymptotic distribution of Jn

40

when d0 = 0r2 . The pre-test estimator based on Jn can be written as

b�p = (1� e!�;p)b�1 + e!�;pb�2 with e!�;p = I fJn < c�g ; wheree!�;p !D I fJ1(G2;2; d0) < c�g � e!�;1 and

n1=2(b�p � �n) = b�1 + e!�;p(b�2 � b�1) ! D ��1Zd;2 + e!�;1(�2 � ��1)Zd;2 (A.87)

under n1=2�n ! d. Furthermore, by the CMT,

`(b�p) = n(b�p � �n)0�(b�p � �n)!De��;1, wheree��;1 = Z 0d;2��01 ��1Zd;2 + 2e!�;1Z 0d;2(�2 � ��1)0��1Zd;2

+e!2�;1Z 0d;2(�2 � ��1)0�(�2 � ��1)Zd;2. (A.88)

Its expectation is

E[e��;1]= tr(��1) + 2E �e!�;1Z 0d;2(�2 � ��1)0��1Zd;2�+ E

�e!�;1Z 0d;2(�2 � ��1)0�(�2 � ��1)Zd;2� : (A.89)

This veri�es Condition (i) in Lemma A.3 for the pre-test estimator with R�(h) = minfE[e��;1]; �g:For jjn1=2�njj ! 1, the J-test is consistent and the pretest estimator is the conservative GMMestimator w.p.a.1.

The asymptotic risk of the pre-test estimator b�p in Figure 2 is simulated based on the formulain (A.89).

41

Documents

An Averaging GMM Estimator Robust to … simulation results support our theoretical –ndings. The proposed averaging estimator is applied to estimate the human capital production