Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Determining the Number of Groups in Latent Panel Structures
with an Application to Income and Democracy∗
Xun Lu and Liangjun Su
Department of Economics, Hong Kong University of Science & TechnologySchool of Economics, Singapore Management University, Singapore
September 15, 2016
Abstract
We consider a latent group panel structure as recently studied by Su, Shi, and Phillips
(2016), where the number of groups is unknown and has to be determined empirically. We
propose a testing procedure to determine the number of groups. Our test is a residual-based
LM-type test. We show that after being appropriately standardized, our test is asymptotically
normally distributed under the null hypothesis of a given number of groups and has power
to detect deviations from the null. Monte Carlo simulations show that our test performs
remarkably well in finite samples. We apply our method to study the effect of income on
democracy and find strong evidence of heterogeneity in the slope coefficients. Our testing
procedure determines three latent groups among seventy-four countries.
Key words: Classifier Lasso; Dynamic panel; Latent structure; Penalized least square; Num-
ber of groups; Test
JEL Classification: C12, C23, C33, C38, C52.
1 Introduction
Recently latent group structures have received much attention in the panel data literature; see,
e.g., Sun (2005), Lin and Ng (2012), Deb and Trivedi (2013), Bonhomme and Manresa (2015,
∗The authors express their sincere appreciation to the co-editor, Frank Schorfheide, and three anonymous referees
for their many constructive comments on the previous version of the paper. They also thank Stéphane Bonhomme,
Xiaohong Chen, Han Hong, Cheng Hsiao, Hidehiko Ichimura, Peter C. B. Phillips, and Katsumi Shimotsu for
discussions on the subject matter and valuable comments on the paper. Su gratefully acknowledges the Singapore
Ministry of Education for Academic Research Fund under Grant MOE2012-T2-2-021 and the funding support
provided by the Lee Kong Chian Fund for Excellence. All errors are the authors’ sole responsibilities. Address
correspondence to: Liangjun Su, School of Economics, Singapore Management University, 90 Stamford Road,
Singapore, 178903; Phone: +65 6828 0386; e-mail: [email protected].
1
BM hereafter), Sarafidis and Weber (2015), Ando and Bai (2016), Bester and Hansen (2016), and
Su, Shi, and Phillips (2016, SSP hereafter). In comparison with some other popular approaches
to model unobserved heterogeneity in panel data models such as random coefficient models (see,
e.g., Hsiao (2014, chapter 6)), one important advantage of the latent group structure is that it
allows flexible forms of unobservable heterogeneity while remaining parsimonious. In addition, the
group structure has sound theoretical foundations from game theory or macroeconomic models
where multiplicity of Nash equilibria is expected (c.f. Hahn and Moon (2010)). The key question
in latent group structures is how to identify each individual’s group membership. Bester and
Hansen (2016) assume that membership is known and determined by external information, say,
external classification or geographic location, while others assume that it is unrestricted and
unknown and propose statistical methods to achieve classification. Sun (2005) uses a parametric
multinomial logit regression to model membership. Lin and Ng (2012), BM, Sarafidis and Weber
(2015), and Ando and Bai (2016) extend K-means classification algorithms to the panel regression
framework. Deb and Trivedi (2013) propose EM algorithms to estimate finite mixture panel data
models with fixed effects. Motivated by the sparse feature of the individual regression coefficients
under latent group structures, SSP propose a novel variant of the Lasso procedure, i.e., classifier
Lasso (C-Lasso), to achieve classification. While these methods make important contributions
by empirically grouping individuals, to implement these methods, we often need to determine
the number of groups first. Some information criteria have been proposed to achieve this goal
(see, e.g., BM and SSP), which often rely on certain tuning parameters. This paper provides a
hypothesis-testing-based solution to determine the number of groups.
Specifically, we consider the panel data structure as in SSP:
= 00 + + = 1 and = 1 (1.1)
where and are the vector of regressors, individual fixed effect, and idiosyncratic error
term, respectively, and 0 is the slope coefficient that can depend on individual . We assume that
the individuals belong to groups and all individuals in the same group share the same slope
coefficients That is, 0 ’s are homogeneous within each of the groups but heterogeneous across
the groups. For a given we can apply SSP’s C-Lasso procedure or the K-means algorithm
to determine the group membership and estimate 0 ’s. However, in practice, is unknown and
has to be determined from data. This motivates us to test the following hypothesis:
H0 (0) : = 0 versus H1 (0) : 0 ≤ max
where 0 and max are pre-specified by researchers. We can sequentially test the hypotheses
H0 (1) H0 (2) until we fail to reject H0(∗) for some ∗ ≤ max and conclude that the
number of groups is ∗ Onatski (2009) applies a similar procedure to determine the number of
latent factors in panel factor structures.
2
In addition to helping to determine the number of groups, testingH0 (0) itself is also useful for
empirical research. When0 = 1 the test becomes a test for homogeneity in the slope coefficients,
which is often assumed in empirical applications. When 0 is some integer greater than 1, we
test whether the group structure is correctly specified. Although the group structure is flexible in
terms of modeling unobserved slope heterogeneity, it could still be misspecified. Inferences based
on misspecified models are often misleading. Thus conducting a formal specification test is highly
desirable. Although we work in the same framework as SSP, the questions of interest are different.
SSP study the classification and estimation problem, while we consider the specification testing.
To avoid overlapping with SSP’s paper, we omit the detailed discussions on the estimation in this
paper.
Our test is a residual-based LM-type test. We estimate the model under the null hypothesis
H0 (0) to obtain the restricted residuals, and the test statistic is based on whether the regressors
have predictive power for the restricted residuals. Under the null of the correct number of latent
groups, the regressors should not contain any useful information about the restricted residuals. We
show that after being appropriately standardized, our test statistic is asymptotically normal under
the null. The p-values can be obtained based on the standard normal approximation, and thus
the test is easy to implement. Our test is related to the literature on testing slope homogeneity
and poolability for panel data models in which case 0 = 1. See, e.g., Pesaran, Smith, and Im
(1996), Phillips and Sul (2003), Pesaran and Yamagata (2008), and Su and Chen (2013), among
others. Nevertheless, none of the existing tests can be directly applied to test = 0 where
0 1 Also, the test proposed in this paper substantially differs from the existing tests in
technical details, as we need to apply the C-Lasso method or K-means algorithm to estimate the
model under the null.
When there are no time fixed effects in the model, both SSP’s C-Lasso method and the K-
means algorithm can be used to estimate the model under H0 (0). So the residuals that are used
to construct our LM statistic can be obtained from either method. When time fixed effects are
present in the model, we extend SSP’s method to allow for time fixed effects and show that the
LM statistic behaves asymptotically equivalent to that in the absence of time fixed effects.
We conduct Monte Carlo simulations to show the excellent finite sample performance of our
test. Both the levels and powers of our test perform well in finite samples. With a high probability,
our method can determine the number of groups correctly.
Our method is applicable to a wide range of empirical studies. In this paper, we provide
detailed empirical analysis on the relationship between income and democracy. Specifically,
is a measure of democracy for country in period . includes its income (the logarithm of
its real GDP per capita) and lagged The main parameter of interest is the effect of income
on democracy. In an influential paper, Acemoglu, Johnson, Robinson, and Yared (2008, AJRY
hereafter) use the standard panel data model with common slope coefficients and find that the
3
effect of income on democracy is insignificant. We find that the slope coefficients (the effects
of income and lagged democracy on democracy) are actually heterogeneous. The hypothesis of
homogeneous slope coefficients is strongly rejected with p-values being less than 0.001. This
suggests that the common coefficient model is likely to be misspecified. Further, we determine the
number of heterogeneous groups to be three and find that the slope coefficients of the three groups
differ substantially. In particular, for one group, the effect of income on democracy is positive and
significant, and for the other two groups, the income effects are also significant, but negative with
different magnitudes. Therefore, our method provides the new insight that the effect of income
on democracy is heterogeneous and significant, in sharp contrast to the existing finding that
the income effect is insignificant. Our classification of groups is completely data-driven and the
result does not show any apparent pattern, like in geographic locations. We further investigate
the determinants of the group pattern by running a cross-country multinomial logit regression
on some country specific-covariates. We find that two variables, namely, the constraints on the
executive at independence and the long-run economic growth, are important determinants.
There are numerous potential applications of our method. For example, SSP study the de-
terminants of saving rates across countries. That is, is the ratio of saving to GDP, and
includes the lagged saving rates, inflation rates, real interest rates, and per capita GDP growth
rates. Using an information criterion, they find a two-group structure in the slope coefficients.
Our method is perfectly applicable to their setting. Another example is Hsiao and Tahmiscioglu
(1997, HT) who study the effect of firms’ liquidity constraints on their investment expenditure.
Specifically, their is the ratio of firm ’s capital investment to its capital stock at time and
includes its liquidity (defined as cash flow minus dividends), sales and Tobin’s HT clas-
sify firms into two groups, less-capital-intensive firms and more-capital-intensive firms, using the
capital intensity ratio of 0.55 as a cutoff point. Our method can be applied to their setting by
classifying the firms into heterogeneous groups in a data-driven way, which allows unobservable
heterogeneity in the slope coefficients. In general, our method provides a powerful tool to detect
unobserved heterogeneity, which is an important phenomenon in panel data analysis. It can be
applied to any linear model for panel data where the time dimension is relatively long.
The remainder of the paper is organized as follows. In Section 2, we introduce the hypotheses
and the test statistic for panel data models with individual fixed effects only. In Section 3 we
derive the asymptotic distribution of our test statistic under the null and study the global power
of our test. In Section 4, we extend the analysis to allow for both individual and time fixed effects
in the models. We conduct Monte Carlo experiments to evaluate the finite sample performance
of our test in Section 5 and apply it to the income-democracy dataset in Section 6. Section 7
concludes. All proofs are relegated to the Appendix.
Notation. For an × real matrix we denote its transpose as 0 and its Frobenius norm as
kk (≡ [tr(0)]12) where ≡ means “is defined as”. Let ≡ (0)−10 and ≡ −
4
where denotes an × identity matrix. When = is symmetric, we use max ()and min () to denote its maximum and minimum eigenvalues, respectively, and denote diag()
as a diagonal matrix whose ( )th diagonal element is given by . Let 0 ≡ −1i i0 and
0 ≡ − −1i i0 where i is a × 1 vector of ones. Moreover, the operator −→ denotes
convergence in probability, and−→ convergence in distribution. We use ( ) → ∞ to denote
the joint convergence of and when and pass to infinity simultaneously. We abbreviate
“positive semidefinite”, “with probability approaching 1”, and “without loss of generality” to
“p.s.d.”, “w.p.a.1”, and “wlog”, respectively.
2 Hypotheses and test statistic
In this section we introduce the hypotheses and test statistic.
2.1 Hypotheses
We consider the panel structure model
= 00 + + = 1 and = 1 (2.1)
where is a ×1 vector of strictly exogenous or predetermined regressors, an individual fixedeffect, and the idiosyncratic error term. The model with both individual and time fixed effects
will be studied in Section 4. We assume that 0 has the following group structure:
0 =
⎧⎪⎪⎨⎪⎪⎩01 if ∈ 01...
...
0 if ∈ 0
(2.2)
where is an integer and©01
0
ªforms a partition of 1 such that ∪=10 =
1 and 0 ∩ 0 = ∅ for any 6= Further, 0 6= 0 for any 6= Let = #0
denote the cardinality of the set 0 We assume that G0 ≡ ©01 0ª α0 ≡ (01 0)and β0 ≡ (01
0) are all unknown. One key step in estimating all these parameters is to
first determine as once is determined, we can readily apply SSP’s C-Lasso method or the
K-means algorithm. This motivates us to test the following hypothesis:
H0 (0) : = 0 versus H1 (0) : 0 ≤ max (2.3)
Here we assume that max and are fixed, i.e., they do not increase with the sample size
or
The testing procedure developed below can be used to determine Suppose that we have
a priori information such that min ≤ ≤ max where min is typically 1. Then we can first
5
test: H0(min) against H1(min) If we fail to reject the null, then we conclude that = min
Otherwise, we continue to test H0(min + 1) against H1(min + 1) We repeat this procedure
until we fail to reject the null H0(∗) and conclude that = ∗ If we reject = max then
we can use the random coefficient model to estimate the model, i.e., = . This procedure is
also used in other contexts, e.g., to determine the number of lags in autoregressive (AR) models,
the cointegration rank, the rank of a matrix, and the number of latent factors in panel factor
structures (Onatski, 2009).
In theory, max can be any finite number. However, in practice, we do not suggest choosing
too large a value for max Our method is powerful when the number of groups is small. When the
number of groups is large, the classification errors can be large. Therefore, if we reject = max
for a reasonably large max, then we can simply adopt the random coefficient model to avoid
classification errors.
2.2 Estimation under the null and test statistic
Our test is a residual-based test and so we only need to estimate the model under H0 (0) : =
0 In the special case where 0 = 1 the panel structure model reduces to a homogeneous panel
data model so that 0 = 0 for all = 1 and we can estimate the homogeneous slope
coefficient using the usual within-group estimator
In the general case where 0 1 we can consider two popular ways to estimate the unknown
group structure and the group-specific parameters. For example, we can apply SSP’s C-Lasso
procedure. Let = − · where · = −1P
=1 Define and · analogously. Let
β ≡ (1 ) and α0≡ (1 0
) be the C-Lasso estimators proposed in SSP, which are
defined as the minimizer of the following criterion function:
(0)
1 (βα0) = 1 (β) +
X=1
Π0
=1 || − || (2.4)
where ≡ is a tuning parameter and
1 (β) =1
X=1
X=1
³ − 0
´2
Let =n ∈ 1 2 : =
ofor = 1 0 Let 0 = 1 2 \(∪0
=1) Al-
though SSP demonstrate that the number of elements in 0 shrinks to zero as → ∞ in finite
samples, 0 may not be empty. To fully impose the null hypothesis H0 (0) we can force all the
estimates of the slope coefficients to be grouped into 0 groups. Specifically, we classify member
in 0 to group ∗ where ∗ ≡ argminn|| − || = 1 0
o Our final estimators of 00 s
are the post-Lasso estimators. Specifically, for each of the classified groups, we re-estimate the
6
homogeneous slope coefficient using the usual within-group estimator. Let ( = 1 0) be
the estimator for group . Then the final estimators of 00 s are β ≡ (1 ) where =
for ∈
Alternatively, one can apply the K-means algorithm as advocated by Lin and Ng (2012), BM,
Sarafidis and Weber (2015), and Ando and Bai (2016). Let g = 1 denote the groupmembership such that ∈ 1 0 The K-means algorithm estimates of α0
and g can be
obtained as the minimizer of the following objective function
(0) (α0
g) =1
0X=1
X: =
X=1
³ − 0
´2 (2.5)
Let g ≡ 1 denote the K-means algorithm estimate of g With a little abuse of notation,
we also use α0≡ (1 0
) and β ≡ (1 ) to denote the K-means algorithm estimate
of α00and β0 where = Define = ∈ 1 2 : = for = 1 0 Note
that the K-means algorithm forces all individuals to be classified into one of the 0 groups
automatically so that 0 becomes an empty set in this case.
Given we can estimate the individual fixed effects using = 1
P=1( −
0)
1 The
residuals are obtained by
≡ − 0 − (2.6)
It is easy to show that
= ( − ·)−¡ − ·
¢0
= − · +¡ − ·
¢0(0 − ) (2.7)
where · = −1P
=1 Under the null hypothesis, is a consistent estimator of 0 2 Hence,
should be close to By the assumption that the regressors are predetermined, should not
have any predictive power for Specifically, we assume that the error term is a martingale
difference sequence (m.d.s.) (Assumption A.1(v) below), which implies that (|) = 0. If
we write in the linear regression form, we have
= + 0 + = 1 = 1
where and are the intercept and slope coefficients, respectively, and is the regression
error. Then the m.d.s. assumption implies that = 0 for all ’s. This motivates us to run the
following auxiliary regression model
= + 0 + = 1 = 1 (2.8)
1 If 0 = 1 we set = the within-group estimator of the homogeneous slope coefficient. Note that we also
suppress the dependence of on 02Strictly speaking, is a consistent estimator of
0 under the null. But because the cardinality of the set 0
shrinks to zero under the null as →∞ the difference between and is asymptotically negligible.
7
and test the null hypothesis
H∗0 : = 0 for all = 1
We construct an LM-type test statistic by concentrating the intercept out in (2.8). Consider
the Gaussian quasi-likelihood function for :
(φ) =
X=1
( −0)0 ( −0)
where φ ≡ (1 )0 ≡ (1 )0 ≡ (1 )0 0 ≡ − −1i i0 and i is a
× 1 vector of ones. Define the LM statistic as
1 (0) =
µ−12
(0)
φ
¶0 µ−−1
2 (0)
φ φ0
¶−1 µ−12
(0)
φ
¶ (2.9)
where we make the dependence of 1 (0) on 0 explicit. We can verify that
1 (0) =
X=1
00
¡ 00
¢−1 00 (2.10)
where the dependence of 1 (0) on 0 is through that of on 0 We will show that
after being appropriately scaled and centered, 1 (0) is asymptotically normally distributed
under H0 (0) and diverges to infinity under H1 (0)
Remark 2.1. We have included a constant term in the regression in (2.8). Under the assumption
that () = 0 and and pass to infinity jointly, one can also omit the constant term and
obtain the following LM test statistic:
1 (0) =
X=1
0
¡ 0
¢−1 0 (2.11)
The asymptotic distribution of (0) can be similarly studied with little modification. In
case is not very large as in our empirical applications, we recommend including a constant term
in the auxiliary regression in (2.8) and thus only focus on the study of 1 (0) below.
3 Asymptotic properties
In this section we first present a set of assumptions that are necessary for asymptotic analyses,
and then study the asymptotic distributions of 1 (0) under both H0 (0) and H1 (0).
8
3.1 Assumptions
Let kk ≡ [(kk)]1 for ≥ 1. Let Ω ≡ −1 00 and Ω ≡ (Ω) Define F ≡
(+1 −1 −1 =1) Let ∞ be a generic constant that may vary across
lines. Following SSP, we define the following two types of classification errors:
=n ∈ | ∈ 0
oand =
n ∈ 0 | ∈
o (3.1)
where = 1 and = 1 0 Let = ∪∈0 and = ∪∈
We make the following assumptions.
Assumption A.1. (i) max1≤≤ max1≤≤ kk8+4 ≤ for some 0 for = and
(ii) There exist positive constants Ω and Ω such that Ω ≤ min1≤≤ min (Ω) ≤ max1≤≤
max (Ω) ≤ Ω
(iii) For each = 1 ( ) : = 1 2 is a strong mixing process with mixingcoefficients (·). (·) ≡ (·) ≡ max1≤≤ (·) satisfies () = (
−) where
= 3(2 + ) + for some 0. In addition, there exist integers 0 ∗ ∈ (1 ) such that (0) = (1) ( +12) (∗)(1+)(2+) = (1) and 12−12∗ = (1)
(iv) Let ≡ (1 )0 ( ) = 1 are mutually independent of each other.
(v) For each = 1 (|F−1) = 0 a.s.
Assumption A.2. Under H0 (0) we have
(i) → ∈ (0 1) for each = 1 0 as →∞
(ii)P0
=1
°° − 0°°2 =
³( )−1 + −2
´
(iii) ³∪0
=1
´= (1) and
³∪0
=1
´= (1)
Assumption A.3. There exist finite nonnegative numbers 1 and 2 such that lim sup( )→∞ log( ) 2 = 1 and lim sup( )→∞ log( ) (3+)(4+2)−(5+3)(4+2) = 2
A.1(i) imposes moment conditions on and . A.1(ii) requires that Ω be positive definite
uniformly in A.1(iii) requires that each individual time series ( ) : = 1 2 be strongmixing. This condition can be verified if does not contain lagged dependent variables no
matter whether one treats the fixed effects ’s as random or fixed. In the case of dynamic
panel data models, Hahn and Kuersteiner (2011) assume that ’s are nonrandom and uniformly
bounded, in which case the strong mixing condition can also be verified. In the case of random
fixed effects, they suggest adopting the concept of conditional strong mixing where the mixing
coefficient is defined by conditioning on the fixed effects. The dependence of the mixing rate
on defined in A.1(i) reflects the trade-off between the degree of dependence and the moment
bounds of the process ( ) ≥ 1 The last set of conditions in A.1(iii) can easily be met.In particular, if the process is strong mixing with a geometric mixing rate, the conditions on (·)
9
can be met simply by specifying 0 = ∗ = b log c for some sufficiently large where bcdenotes the integer part of . A.1(iv) rules out cross sectional dependence among ( ) and
greatly facilitates our asymptotic analysis. A.1(v) requires that the error term be a martingale
difference sequence (m.d.s.) with respect to the filter F which allows for lagged dependent
variables in and conditional heteroskedasticity, skewness, or kurtosis of unknown form in 3
A.2(i) is typically imposed in the literature on panel data models with latent group structure;
see BM, Ando and Bai (2016) and SSP who have rigorous asymptotic analysis for clustering
estimators in different contexts. It implies that each group has asymptotically non-negligible
members as →∞ A.2(ii)-(iii) can be verified for either the SSP’s C-Lasso estimators or the K-
means algorithm estimators under suitable conditions. Please note neither BM nor Ando and Bai
(2016) study the same model as ours. But it is not difficult to extend their analysis to our model
and verify A.2(ii)-(iii). Our model is a special case considered by SSP. We can readily verify A.2(ii)-
(iii) under Assumption A.1 and the following additional conditions: (1) 12−1(ln )9 → 0 and
2−(3+2) → ∈ [0∞) as ( ) → ∞ and (2) 2(ln )6+2 → ∞ and (ln ) → 0 for
some 0 as ( )→∞ Note that we allow to include the lagged dependent variables, in
which case, both the SSP’s post Lasso-estimators ’s or the K-means algorithm estimators have
asymptotic bias of order ¡−1
¢. In the absence of dynamics in the model, one would expect
that − 0 = (( )−12).
A.3 imposes conditions on the rates at which and pass to infinity, and the interaction
between ( ) and . Note that we allow and to pass to infinity at either identical or
suitably restricted different rates. The appearance of the logarithm terms is due to the use of a
Bernstein inequality for strong mixing processes. If the mixing process ( ) ≥ 1 has ageometric decay rate, one can take an arbitrarily small in A.1(i) . In this case, A.3 puts the
most stringent restrictions on ( ) by passing → 0 : 35 → 0 as ( ) → ∞ ignoring
the logarithm term. On the other hand, if ≥ 1 in A.1(i), then the second condition in A.3
becomes redundant given the first condition. In the case of conventional panel data models with
strictly exogenous regressors only, Pesaran and Yamagata (2008) require that either√ → 0
or√ 2 → 0 for two of their tests; but for stationary dynamic panel data models, they prove
the asymptotic validity of their test only under the condition that → ∈ [0∞)3 If the error terms are serially correlated in a static panel, it is well known that by adding lagged dependent and
independent variables in the model, one can potentially ameliorate problems caused by such serial correlation. See
Su and Chen (2013, p.1090). For dynamic panel data models (e.g., panel AR(1) model), we cannot allow for serial
correlation in the error terms (e.g., AR(1) errors); otherwise, the error terms will be correlated with the lagged
dependent variables, causing the endogeneity issue which we do not address in this paper.
10
3.2 Asymptotic null distribution
Let denote the ( )’th element of ≡0(00)
−1 00 Let
† ≡ −−1
P=1 ()
and ≡ Ω−12 †. Define
≡ −12X=1
X=1
2 and ≡ 4−2−1X=1
X=2
"
0
−1X=1
#2 (3.2)
The following theorem studies the asymptotic null distribution of the infeasible statistic 1
Theorem 3.1 Suppose Assumptions A.1-A.3 hold. Then under H0 (0)
1 (0) ≡³−121 (0)−
´p
−→ (0 1) as ( )→∞
The proof of the above theorem is tedious and relegated to the appendix. To implement the
test, we need consistent estimates of both and . Let = Ω−12 ( − −1
P=1)
and ≡ (1 )0 Note that =0Ω−12 We propose to estimate by
(0) = −12X=1
X=1
2 (3.3)
and by
(0) = 4−2−1
X=1
X=2
"
0
−1X=1
#2 (3.4)
Define
1 (0) ≡³−121 (0)− (0)
´
q (0) (3.5)
The following theorem establishes the consistency of (0) and (0) and the asymptotic
distribution of 1 (0) under H0 (0)
Theorem 3.2 Suppose Assumptions A.1-A.3 hold. Then under H0 (0) (0) = +
(1) (0) = + (1) and 1 (0)−→ (0 1) as ( )→∞
Theorem 3.2 implies that the test statistic 1 (0) is asymptotically pivotal under H0 (0)
and we reject H0 (0) for a sufficiently large value of 1 (0).
We obtain the distributional results in Theorems 3.1 and 3.2 despite the fact that the individual
effects ’s can only be estimated at the slower rate −12 than the rate ( )−12 or ( )−12+
−1 at which the group-specific parameter estimates = 1 0 converge to their truevalues under H0 (0). The slow convergence rate of these individual effect estimates does not
have adverse asymptotic effects on the estimation of the bias term the variance term
and the asymptotic distribution of 1 (0) Nevertheless, they can play an important role in
finite samples, which we verify through Monte Carlo simulations.
11
3.3 Consistency
Let G = (1 ) : ∪=1 = 1 and ∩ = ∅ for any 6= That is, Gdenotes the class of all possible -group partitions of 1 . To study the consistency of ourtest, we add the following assumption.
Assumption A.4. (i) −1P=1
°°0°°2 = (1)
(ii) inf(10)∈G0 min(10)
−1P0
=1
P∈
°°0 − °°2 → 0
0 as →∞
A.4(i) is trivially satisfied if 0 ’s are uniformly bounded or random with finite second moments.
A.4(ii) essentially says that one cannot group the parameter vectors©0 1 ≤ ≤
ªinto 0
groups by leaving out an insignificant number of unclassified individuals. It is satisfied for a variety
of global alternatives:
1. The number of groups is = 0+ for some positive integer such that → ∈ (0 1)for each = 1 0 +
2. There is no grouped pattern among©0 1 ≤ ≤
ªsuch that we have a completely het-
erogeneous population of individuals.
3. The regression model is actually a random coefficient model: 0 = 0 + where 0 is
a fixed parameter vector, and ’s are independent and identical draws from a continuous
distribution with zero mean and finite variance.
4. The regression model is a hierarchical random coefficient model:
0 =
⎧⎪⎪⎨⎪⎪⎩01 + 1 if ∈ 01
......
0 + if ∈ 0
where 01 0 are defined as before, ’s (for = 1 ) are independent and identical
draws from a continuous distribution with zero mean and finite variance, and may be
different from 0
The following theorem establishes the consistency of 1 .
Theorem 3.3 Suppose Assumptions A.1 and A.3-A.4 hold. Then under H1 (0) with possible
diverging max and random coefficients, (1 (0) ≥ ) → 1 as ( ) → ∞ for any
non-stochastic sequence = ¡12
¢
The above theorem indicates that our test statistic 1 (0) is divergent at 12 -rate under
H1 (0) and thus has power to detect any alternatives such that A.4 is satisfied.
12
Remark 3.1. Our asymptotic theories here are “pointwise”, as the unknown parameters are
treated as fixed. We do not consider the uniformity issue, and so our procedures may suffer from
the same problem as the other post-selection inference (see, e.g., Leeb and Pöscher, 2005, 2008,
2009 and Schneider and Pöscher, 2009). This seems to be a well-known challenge in the literature
of model selection or pretest estimation. Although it is a very important question, developing a
thorough theory on uniform inference is beyond the scope of this paper.
4 Time fixed effects
In applications, we may also consider the model with both individual and time fixed effects:
= 00 + + + = 1 and = 1 (4.1)
where is the time fixed effects, the other variables are defined as above, and we assume that
0 ’s have the unknown group structure defined in (2.2). We treat and as unknown fixed
parameters that are not separately identifiable. Despite this, it is possible to estimate the group-
specific parameters 0’s and identify each individual’s group membership consistently.
4.1 Estimation under the null and test statistic
When 0 = 0 for all = 1 one can follow Hsiao (2014, Chapter 3.6) and sweep out both
the individual and time effects from (4.1) via suitable transformation. We do a similar thing in
the presence of heterogeneous slope coefficients 0 ’s. As before, we first eliminate the individual
effects in (4.1) via the within-group transformation:
= 0 + + (4.2)
where = − · = − and = −1P
=1 Then we eliminate from the above
model to obtain
= 0 − 1
X=1
0 + (4.3)
where = − · − · + · = 1
P=1 =
1
P=1
P=1 and · and are
similarly defined. Under H0 (0) : = 0 we can follow SSP and estimate β ≡(01 0) andα0
≡ (01 00) by minimizing the following criterion function:
(0)
2 (βα0) = 2 (β) +
X=1
Π0
=1 || − || (4.4)
where
2 (β) =1
X=1
X=1
⎛⎝ − 0 +1
X=1
0
⎞⎠2 13
Let β≡(1 ) and α0≡ (1 0
) denote the estimates of β and α0 respectively. If
necessary, we can estimate bye = 1
P=1( −
0) for = 1 Given β and α0
, we
classify the individuals into 0 groups as in Section 2. As before, we use 1 0to denote
the 0 estimated groups and to denote the cardinality of
The post-Lasso estimates can be obtained in two ways. One is to pool all the observations
within each estimated group and estimate the group-specific parameters for each group separately
after we demean over time and across individuals within the group. In this way, one can easily
work out the standard error for each group-specific estimate as usual. Alternatively, we consider
estimation based on (4.3) by using the estimated group membership. Assuming all individuals
are classified into one of the 0 groups, we obtain the post-Lasso estimates of 0 as thesolution to the following minimization problem:
min
1
0X=1
X∈
X=1
⎛⎝ − 0 +1
0X=1
X∈
0
⎞⎠2 (4.5)
Let = for ∈ and = 1 0 Let α0= (1 0
)
Given the above post-Lasso estimates, we define the restricted residual as
b = − 0 +
1
X=1
0 (4.6)
Following the analysis in Section 2.2, we propose to test H0 (0) by using the following LM-type
test statistic
2 (0) =
X=1
b00
¡ 00
¢−1 00
bwhere b = (b1 b )0We will show that after being appropriately centered, −122 (0)
follows the same asymptotic distribution as −121 (0) under the null.
4.2 Asymptotic properties of the C-Lasso estimates
Since SSP do not allow the fixed time effects in their model, their asymptotic theory does not
apply to our C-Lasso estimates of the parameters and group identity. Fortunately, we can study
the preliminary convergence rates of various estimates under Assumptions A.1, A.2(i) and A.3.
Theorem 4.1 Suppose that Assumptions A.1, A.2(i), and A.3 hold. Then under H0 (0)
(i) 1
P=1
°°° − 0
°°°2 =
¡−1
¢
(ii)¡(1) (0)
¢− (01 00) =
¡−12
¢
(iii) max1≤≤°°° − 0
°°° = ( + )
where = max( )1(4+2) log ( ) (log ( ) )12 and ((1) (0)) is some suit-
able permutation of (1 0)
14
Theorems 4.1(i)-(iii) establish the mean square convergence rate of the convergence rateof the group-specific estimates and the uniform convergence rate of ’s, respectively. The
findings are similar to those in SSP. In particular, the estimates of the group-specific parameters
do not depend on Note that = ((log ( ) )12) under the additional restriction that
lim sup( )→∞−(1+) = ∈ [0∞) For notational simplicity, hereafter we simply write for () as the consistent estimator of
0’s.
Define the two types of classification errors and as before. To study the consistency
of our classification method, we add the following assumption.
Assumption A.5. (i) 12−1(ln )9 → 0 and lim sup( )→∞−(1+) = ∈ [0∞)(ii) 2(ln )6+2 →∞ and (ln ) → 0 for some 0 as ( )→∞
Assumptions A.5(i)-(ii) strengthen the conditions on and . Note that A.5(ii) implies
that it suffices to choose ∝ − for some ∈ (14 12)The following theorem establishes the uniform consistency for our classification method.
Theorem 4.2 Suppose that Assumptions A.1, A.2(i), A.3 and A.5 hold. Then under H0 (0)
(i) ³∪0
=1
´≤P0
=1 ³
´→ 0 as ( )→∞
(ii) ³∪0
=1
´≤P0
=1 ³
´→ 0 as ( )→∞
Given the results in Theorems 4.1 and 4.2, we will show that the post-Lasso estimators isasymptotically oracally efficient in the sense that they are as efficient as the infeasible estimators
under some regularity conditions:
≡ argmin1
0X=1
X∈0
X=1
⎛⎝ − 0 +1
0X=1
X∈0
0
⎞⎠2 (4.7)
Let α0= (1 0
) Let Q =1
P∈0
X0 Q =
1
P∈0
P∈0
X0 and
V =1
P∈0
X0 for = 1 0 where X = −−1P
∈0 and = (1 )
0
Define
Q =
⎛⎜⎜⎜⎜⎜⎝Q1 −Q11 −Q12 · · · −Q10
−Q21 Q2 −Q22 · · · −Q20
......
. . ....
−Q01 −Q02 · · · Q0 −Q00
⎞⎟⎟⎟⎟⎟⎠ and V =
⎛⎜⎜⎜⎜⎜⎝V1
V2
...
V0
⎞⎟⎟⎟⎟⎟⎠
(4.8)
Let Q and V be analogously defined as Q and V with 0’s and ’ replaced by ’s
and ’s. We can verify that
vec(α0) = Q−1V and vec(α0
) = Q−1 V
15
Let B ≡ 1 2
P∈0
P=1
P=1 U ≡ 1
P∈0
P=1
⊥ and Ω =
2P
∈0
P=1
¡⊥
⊥0
2
¢for = 1 0where
⊥ = −(()
· −()) ()· = 1
P∈0
and () = 1
P=1
()· 4 Let B =
¡B01 B
00
¢0 U =
¡U01 U
00
¢0 and
Ω =diag(Ω1 Ω0 )
We add the following assumption.
Assumption A.6. (i) Q→ Q0 0 as ( )→∞
(ii)√U
→ (0Ω0) as ( )→∞ where Ω0 = lim( )→∞Ω
Assumption A.6 imposes conditions to ensure the asymptotic normality of the post-Lasso
estimators of the group-specific parameters. When 0 = 1 Q reduces to Q1 −Q11 which isrequired to be asymptotically nonsingular for the usual fixed effects estimator in a two-way error
component model to be asymptotically normal.
Theorem 4.3 Suppose Assumptions A.1, A.2(i), A.3, A.5 and A.6 hold. Then under H0 (0)
(i) − 0 = − 0 + (( )−12) =
³( )−12 + −1
´for = 1 0
(ii)√
£vec¡α0−α00
¢+Q−1B
¤ → ¡0Q−10 Ω0Q
−10
¢
Theorem 4.3(i) indicates that is asymptotically equivalent to the infeasible estimator It
also reports the convergence rate of and implies that1
P=1
°°° − 0
°°°2 = (( )−1+−2)
Theorem 4.3(ii) reports the asymptotic distribution of α0. By the Davydov inequality, we can
readily show that Q−1B is
¡−1
¢ which suggests that the order of the asymptotic bias of
vec(α0) is given by
¡−1
¢ Note that this bias term has the exact form as the bias term in
the linear panel data model with individual fixed effects alone considered by SSP. As in SSP, the
bias is asymptotically vanishing in the case where only contains strictly exogeneous regressors
or = ( )
For inference, one needs to estimateQ0 Ω0 andQ−1B When necessary, one can follow SSP
to estimate Q−1B A consistent estimate of Q0 is given by Q Similarly, we can consistently
estimate Ω0 by a sample analogue of Ω For brevity, we do not give the details.
4.3 Asymptotic null distribution of 2 (0)
Despite the presence of time fixed effects, we can show that 2 (0) shares the same asymp-
totic null distribution as 1 (0) This is summarized in the following theorem.
Theorem 4.4 Suppose Assumptions A.1, A.2(i), A.3, A.5 and A.6 hold. Then under H0 (0)
2 (0) ≡³−122 (0)−
´p
−→ (0 1) as ( )→∞
4 If ≥ 1 is mean-stationary, then ⊥ =
16
That is, −122 (0) share the same asymptotic bias ( ) and asymptotic variance
( ) as −121 (0) As before, we can consistently estimate both and by
(0) and (0). The major difference is that we now need to use the residuals b toreplace throughout. We omit the details for brevity.
5 Monte Carlo simulations
In this section, we conduct Monte Carlo simulations to examine the finite sample performance of
our proposed testing method.
5.1 Data generating processes and implementation
We consider five generating processes (DGPs). The first four DGPs only have individual fixed
effects:
DGP 1: = 011 + 022 + +
DGPs 2-4: = 011 + 02−1 + +
where = + = 1 2 and 1 2 and are IID (0 1) variables, and mutually
independent of each other. DGP 5 has both individual and time fixed effects:
DGP 5: = 011 + 02−1 + 05 + 05 +
where 1 = 1++ and 1 and are mutually independent IID (0 1) variables.
DGP 1 is a static panel structure, while DGPs 2-5 are dynamic panel structures. In DGPs 1,
2 and 5,¡01
02
¢has a group structure:
¡01
02
¢=
⎧⎪⎪⎨⎪⎪⎩(05−05) with probability 0.3
(−05 05) with probability 0.3
(0 0) with probability 0.4
Therefore in DGPs 1, 2 and 5, the true number of groups is 3. In DGP 3, we consider a com-
pletely heterogeneous (random coefficient) panel structure where 01 and 02 follow (05 1) and
U(−05 05) respectively. In principle, the true number of groups is the cross-sectional dimension in this case. In DGP 4, (01
02) is similar to that in DGPs 1, 2 and 5 except that it has some
additional small disturbance. Specifically,
¡01
02
¢=
⎧⎪⎪⎨⎪⎪⎩(05 + 011−05 + 012) with probability 0.3
(−05 + 011 05 + 012) with probability 0.3
(011 012) with probability 0.4
where 1 and 2 are each IID (0 1) mutually independent, and independent of , and
DGP 4 can be thought of as a small deviation from a group structure.
17
For each DGP, we first test the null hypotheses: H0(1) H0(2) and H0(3) to examine the level
and power of our test.
We then use our tests to determine the number of groups as described in Section 2.1. We set
max = 8 and let the nominal size decrease with the time series dimension to ensure that the
type I error decreases with Specifically, we let the nominal size be 1 which equals 0.10 and
0.025 for = 10 and 40, respectively.5 If all eight hypotheses, H0(1) and H0(8) are rejected,
then we stop and conclude that the number of groups is greater than 8 and we can adopt the
random coefficient model.
For the combination of and , we consider the typical case in empirical applications that
is smaller than or comparable to and let ( ) = (40 10) (80 10) and (40 40) The number
of replications in the simulations is 1000.
One important step in implementing our testing procedure is to choose the tuning parameter
Following the theory in SSP, we let = ·2 ·−13 where is the sample standard deviationof and is some constant. We use three different values of (025 05 and 075) to examine
the sensitivity of our results to (thus ).
As a comparison, we also consider the information criterion (IC) proposed in SSP to determine
the number of groups. Specifically, we choose to minimize the following IC:
() = ln³2( )
´+ 1 = 1 max (5.1)
where 2( ) =1
P=1
P=1
2 and 0s are the residuals under the specification of groups
with the tuning parameter . 0s are defined in (2.6) or (4.6) in the presence of both individual
and time fixed effects. 1 is a tuning parameter and we set 1 =23( )−12 as in SSP. Note
that SSP’s IC can also be used to choose the tuning parameter and jointly by minimizing
() over both and
5.2 Simulation results
Table 1 shows the level and power behavior of our test statistics for testing the three null hypothe-
ses: H0(1) H0(2) and H0(3) We choose three conventional nominal levels: 0.01, 0.05 and 0.10.
For DGPs 1, 2 and 5, the true number of groups is 3. For H0(1) the rejection frequencies are
all almost 1 for all combinations of and and all three DGPs at all three nominal levels. For
H0(2) the power of the test increases rapidly with both and For example, when = 40 and
= 40 the rejection frequencies are all 1 or nearly 1 at all three nominal levels. For H0(3) we
examine the level of our test and find that the rejection frequencies are fairly close to the nominal
levels.
5We also try fixing the nominal level at 0.05. The results are similar and available upon request.
18
For the heterogeneous DGP 3, our test rejects all three hypotheses with the frequencies being 1
or nearly 1 at all three nominal levels. This reflects the power of our test against global alternatives.
For DGP 4 which represents a small deviation from the group structure, our test shows reasonable
power for large , though it rejects = 3 with a small frequency when is small, as expected.
Also note that all the testing results are quite robust to the values of (thus ).
Table 2A shows how our method performs in terms of determining the number of groups.
Specifically, the numbers in the main entries are the proportions of the replications in which the
number of groups determined by our method is equal to, less than (5 in DGP 3), or greater
than (5 in DGPs 1, 2, 4, and 5, 8 in DGP 3) a number. For DGPs 1, 2 and 5, our method
determines the correct number of groups (3) with a large probability. For example, when = 40
and = 40 for DGP 5, the number of groups determined by our testing procedure equals the true
number (3) with probabilities ranging from 0986 to 0991. For DGP 3 where the true number of
groups is our method determines a large number of groups (greater than 8) with probabilities
exceeding 0.99 when = 40 and = 40. DGP 4 represents a small deviation from a 3-group
structure. When the sample size is small ( = 40 and = 10) with a high probability, the
number of groups determined by our method is 2 or 3. When the sample size increases to = 40
and = 40 our method determines 4 groups with a probability larger than 0.25. These results
are reasonable. Intuitively, if and are small, the data can only provide limited information
on the underlying DGP, and it is reasonable to apply a small group structure to serve as a good
approximation to the true model. As and become large, more information on the underlying
DGP is revealed, and it is sensible to adopt a larger number of groups to approximate the true
model more accurately.
Table 2B shows the performance of SSP’s information criterion (IC). Comparing Tables 2A
and 2B, we find that for DGPs 1, 2 and 5, the performances of our method and SSP’s IC are
comparable. However, for DGPs 3 and 4, our method dominates SSP’s. Specifically, for DGP 3
where the number of groups is , SSP tend to determine a small number of groups (less than
6 even when = 40 and = 40), while our method determines that the number of groups is
greater than 8 for most cases. For DGP 4, in large samples ( = 40 and = 40), SSP determine
3 groups with a high probability, while our method determines a relatively large number of groups.
To examine the performance of the C-Lasso estimator following our method to determine the
number of groups, we compare the following four estimators: () the common coefficient estimator,
() the random coefficient estimator, () the post C-Lasso estimator with the number of groups
determined by our method, and () the infeasible estimator where the true number of groups
and group classification are known. We report the mean squared errors (MSE), calculated as the
average values of 1
P=1
°°° − 0
°°°2 over 1000 replications, where denotes the estimator. Table
19
3 presents the results for DGP 1.6 It is clear that our estimator dominates the common coefficient
estimator and the random coefficient estimator. However it performs worse than the infeasible
estimator, especially when the sample size is small. This reflects the effect of classification errors
in finite samples.
6 Empirical application: income and democracy
The relationship between income and democracy has attracted much attention in empirical re-
search; see, e.g., Lipset (1959), Barro (1999), AJRY, and BM. To the best of our knowledge, none
of the existing studies allows for heterogeneity in the slope coefficients in their model specifica-
tions. As discussed in AJRY, “societies may embark on divergent political-economic development
paths”. Thus, ignoring the heterogeneity in the slope coefficients may result in model misspeci-
fication and invalidate subsequent inferences. Hence, it is important to know whether the data
support the assumption of homogeneous slope coefficients. If not, then we need to determine the
number of heterogeneous groups and classify the countries using statistic methods. We apply our
new method to study this important question.
6.1 Data and implementation
We let be a measure of democracy for country in period and be the logarithm of its real
GDP per capita. The measure of democracy and real GDP per capita are from the Freedom House
and Penn World Tables, respectively.7 Note that the Freedom House measures of democracy ()
are normalized to be between 0 and 1.
We consider the specification with both individual and time fixed effects
= 1−1 + 2−1 + + +
and assume that (1 2) has a group structure to account for possible heterogeneity. In a closely
related paper, BM consider a group structure in the interactive fixed effects and assume (1 2)
is constant for all ’s
We use a balanced panel dataset similar to that in BM. The number of countries () is 74. The
time index is = 1 8 where each period corresponds to a five-year interval over the period 1961-
2000. For example, = 1 refers to years 1961-1965. Because the lagged is used as a regressor,
the number of time periods ( ) is 7 i.e., = 2 8 The choice of countries is determined by data
availability. In addition, we exclude the countries whose measures of democracy remain constant
6The results for other DGPs are available upon request.7All the data are directly from AJRY: http://economics.mit.edu/faculty/acemoglu/data/ajry2008.
20
Table 1: Empirical rejection frequency
= 025 = 05 = 075
0.01 0.05 0.10 0.01 0.05 0.10 0.01 0.05 0.10
40 10 0.998 0.999 1 0.998 0.999 1 0.998 0.999 1
= 1 80 10 1 1 1 1 1 1 1 1 1
40 40 1 1 1 1 1 1 1 1 1
40 10 0.126 0.312 0.431 0.121 0.297 0.414 0.115 0.291 0.410
DGP 1 = 2 80 10 0.277 0.530 0.665 0.277 0.523 0.663 0.262 0.524 0.652
40 40 0.999 1 1 0.999 1 1 0.999 1 1
40 10 0.000 0.014 0.035 0.005 0.025 0.060 0.014 0.057 0.099
= 3 80 10 0.001 0.013 0.036 0.003 0.029 0.067 0.015 0.059 0.124
40 40 0.003 0.021 0.041 0.008 0.032 0.069 0.016 0.065 0.118
40 10 1 1 1 1 1 1 1 1 1
= 1 80 10 1 1 1 1 1 1 1 1 1
40 40 1 1 1 1 1 1 1 1 1
40 10 0.059 0.205 0.317 0.057 0.184 0.294 0.048 0.170 0.285
DGP 2 = 2 80 10 0.164 0.383 0.539 0.143 0.362 0.513 0.136 0.350 0.497
40 40 1 1 1 1 1 1 1 1 1
40 10 0.001 0.005 0.013 0.001 0.009 0.018 0.001 0.017 0.035
= 3 80 10 0.000 0.003 0.015 0.002 0.008 0.019 0.001 0.022 0.048
40 40 0.001 0.013 0.028 0.005 0.023 0.040 0.011 0.046 0.088
40 10 1 1 1 1 1 1 1 1 1
= 1 80 10 1 1 1 1 1 1 1 1 1
40 40 1 1 1 1 1 1 1 1 1
40 10 0.998 1 1 0.998 1 1 0.999 1 1
DGP 3 = 2 80 10 1 1 1 1 1 1 1 1 1
40 40 1 1 1 1 1 1 1 1 1
40 10 0.975 0.996 0.998 0.982 0.998 0.998 0.993 0.999 1
= 3 80 10 1 1 1 1 1 1 1 1 1
40 40 1 1 1 1 1 1 1 1 1
40 10 1 1 1 1 1 1 1 1 1
= 1 80 10 1 1 1 1 1 1 1 1 1
40 40 1 1 1 1 1 1 1 1 1
40 10 0.124 0.299 0.436 0.117 0.283 0.420 0.099 0.273 0.407
DGP 4 = 2 80 10 0.340 0.598 0.717 0.322 0.565 0.700 0.306 0.533 0.687
40 40 1 1 1 1 1 1 1 1 1
40 10 0.001 0.018 0.043 0.004 0.018 0.050 0.004 0.034 0.080
= 3 80 10 0.010 0.045 0.077 0.015 0.061 0.103 0.024 0.084 0.148
40 40 0.262 0.488 0.624 0.317 0.537 0.668 0.382 0.617 0.742
40 10 1 1 1 1 1 1 1 1 1
= 1 80 10 1 1 1 1 1 1 1 1 1
40 40 1 1 1 1 1 1 1 1 1
40 10 0.380 0.610 0.748 0.364 0.604 0.737 0.370 0.611 0.735
DGP 5 = 2 80 10 0.718 0.889 0.938 0.711 0.892 0.939 0.709 0.887 0.936
40 40 1 1 1 1 1 1 1 1 1
40 10 0.010 0.035 0.056 0.010 0.034 0.058 0.010 0.042 0.070
= 3 80 10 0.008 0.036 0.070 0.012 0.047 0.085 0.025 0.069 0.127
40 40 0.002 0.020 0.052 0.003 0.022 0.058 0.006 0.030 0.066
21
Table 2A: Frequency of number of groups determined by our method
= 1 = 2 = 3 = 4 = 5 5
40 10 0 0.570 0.395 0.034 0.001 0
= 025 80 10 0 0.336 0.628 0.034 0.001 0.001
40 40 0 0 0.987 0.012 0.001 0
40 10 0 0.586 0.354 0.049 0.008 0.003
DGP 1 = 05 80 10 0 0.338 0.596 0.051 0.012 0.003
40 40 0 0 0.981 0.015 0.004 0
40 10 0 0.590 0.313 0.068 0.018 0.011
= 075 80 10 0 0.349 0.529 0.076 0.038 0.008
40 40 0 0 0.963 0.029 0.008 0
= 1 = 2 = 3 = 4 = 5 5
40 10 0 0.683 0.304 0.013 0 0
= 025 80 10 0 0.461 0.524 0.014 0.001 0
40 40 0 0 0.996 0.004 0 0
40 10 0 0.707 0.275 0.016 0.002 0
DGP 2 = 05 80 10 0 0.487 0.494 0.013 0.006 0
40 40 0 0 0.991 0.009 0 0
40 10 0 0.716 0.251 0.025 0.007 0.001
= 075 80 10 0 0.506 0.447 0.036 0.011 0
40 40 0 0 0.976 0.024 0 0
5 = 5 = 6 = 7 = 8 8
40 10 0.024 0.027 0.035 0.048 0.011 0.855
= 025 80 10 0 0.002 0 0.001 0 0.997
40 40 0 0 0 0 0.002 0.998
40 10 0.009 0.012 0.013 0.021 0.009 0.936
DGP 3 = 05 80 10 0 0.001 0 0 0 0.999
40 40 0 0 0 0 0.000 1
40 10 0.004 0.004 0.006 0.009 0.006 0.971
= 075 80 10 0 0 0.001 0 0 0.999
40 40 0 0 0 0 0.001 0.999
= 1 = 2 = 3 = 4 = 5 5
40 10 0 0.564 0.393 0.038 0.005 0
= 025 80 10 0 0.283 0.640 0.071 0.006 0
40 40 0 0 0.620 0.264 0.110 0.006
40 10 0 0.581 0.369 0.038 0.012 0
DGP 4 = 05 80 10 0 0.301 0.597 0.085 0.016 0.001
40 40 0 0 0.576 0.283 0.132 0.009
40 10 0 0.594 0.327 0.057 0.019 0.003
= 075 80 10 0 0.314 0.539 0.111 0.031 0.005
40 40 0 0 0.489 0.343 0.162 0.006
= 1 = 2 = 3 = 4 = 5 5
40 10 0 0.252 0.692 0.053 0.002 0.001
= 025 80 10 0 0.062 0.868 0.063 0.006 0.001
40 40 0 0 0.991 0.007 0.002 0
40 10 0 0.263 0.679 0.054 0.004 0
DGP 5 = 05 80 10 0 0.062 0.853 0.081 0.003 0.001
40 40 0 0 0.991 0.006 0.003 0
40 10 0 0.265 0.665 0.062 0.008 0
= 075 80 10 0 0.064 0.811 0.118 0.007 0
40 40 0 0 0.986 0.012 0.002 0
22
Table 2B: Frequency of number of groups determined by the information criterion
= 1 = 2 = 3 = 4 = 5 5
40 10 0 0.486 0.500 0.012 0.002 0
= 025 80 10 0 0.128 0.807 0.059 0.006 0
40 40 0 0.001 0.999 0 0 0
40 10 0 0.665 0.327 0.008 0 0
DGP 1 = 05 80 10 0 0.280 0.680 0.037 0.003 0
40 40 0 0.001 0.999 0 0 0
40 10 0 0.795 0.201 0.004 0 0
= 075 80 10 0 0.463 0.519 0.017 0.001 0
40 40 0 0.001 0.998 0.001 0 0
= 1 = 2 = 3 = 4 = 5 5
40 10 0 0.548 0.437 0.014 0.001 0
= 025 80 10 0 0.174 0.786 0.039 0.001 0
40 40 0 0 1 0 0 0
40 10 0 0.697 0.298 0.004 0.001 0
DGP 2 = 05 80 10 0 0.310 0.660 0.024 0.006 0
40 40 0 0 1 0 0 0
40 10 0 0.809 0.189 0.002 0 0
= 075 80 10 0 0.510 0.458 0.030 0.002 0
40 40 0 0 0.999 0.001 0 0
5 = 5 = 6 = 7 = 8 8
40 10 0.962 0.033 0.005 0 0 0
= 025 80 10 0.952 0.048 0 0 0 0
40 40 0.666 0.210 0.101 0.019 0.004 0
40 10 0.986 0.012 0.002 0 0 0
DGP 3 = 05 80 10 0.979 0.021 0 0 0 0
40 40 0.808 0.128 0.049 0.012 0.003 0
40 10 0.992 0.006 0.001 0.001 0 0
= 075 80 10 0.985 0.014 0 0.001 0 0
40 40 0.848 0.114 0.022 0.013 0.003 0
= 1 = 2 = 3 = 4 = 5 5
40 10 0 0.512 0.455 0.032 0.001 0
= 025 80 10 0 0.155 0.738 0.096 0.011 0
40 40 0 0.002 0.996 0.002 0 0
40 10 0 0.645 0.337 0.017 0.001 0
DGP 4 = 05 80 10 0 0.298 0.648 0.047 0.007 0
40 40 0 0.001 0.995 0.004 0 0
40 10 0 0.770 0.216 0.013 0.001 0
= 075 80 10 0 0.503 0.453 0.039 0.005 0
40 40 0 0.002 0.985 0.013 0 0
= 1 = 2 = 3 = 4 = 5 5
40 10 0 0.159 0.785 0.056 0 0
= 025 80 10 0 0.010 0.888 0.096 0.005 0.001
40 40 0 0 1 0 0 0
40 10 0 0.181 0.752 0.066 0.001 0
DGP 5 = 05 80 10 0 0.016 0.826 0.155 0.002 0.001
40 40 0 0 1 0 0 0
40 10 0 0.220 0.699 0.079 0.002 0
= 075 80 10 0 0.031 0.697 0.267 0.004 0.001
40 40 0 0 1 0 0 0
23
Table 3: Comparison of MSE of various estimators, DGP 1
Common Random Post C-Lasso
coefficient coefficient = 025 = 05 = 075 Infeasible
40 10 0.308 0.334 0.175 0.178 0.183 0.017
80 10 0.304 0.334 0.160 0.166 0.173 0.008
40 40 0.302 0.056 0.019 0.024 0.030 0.004
Table 4: Summary statistics ( = 74)
Time period Years : democracy : logarithm of real GDP per capita
mean median s.d. mean median s.d.
1 1961 - 1965 0.558 0.540 0.261 7.694 7.657 0.811
2 1966 - 1970 0.367 0.333 0.333 7.832 7.759 0.853
3 1971 - 1975 0.340 0.333 0.311 7.926 7.958 0.886
4 1976 - 1980 0.419 0.333 0.317 8.031 8.022 0.927
5 1981 - 1985 0.455 0.333 0.349 8.055 8.072 0.942
6 1986 - 1990 0.505 0.500 0.340 8.098 8.158 1.009
7 1991 - 1995 0.543 0.500 0.331 8.143 8.233 1.102
8 1996 - 2000 0.604 0.667 0.325 - - -
over the seven periods ( = 2 8). A list of the 74 countries can be found in Table 8. Table 4
presents the summary statistics. The implementation details of our method are the same as in
the simulations.
6.2 Testing and estimation results
We first test the hypothesis H0(1) i.e., we test whether (1 2) is constant for all We soundly
reject this hypothesis with a p-value being less than 0.001. This provides strong evidence that
the slope coefficients are not homogeneous. We then test the hypothesis H0(2) for three values of
the tuning parameter = · 2 · −13 ( = 025 05 and 075). We reject this hypothesis at
the 5% level with p-values being 0.027, 0.027 and 0.020 for = 025 0.5 and 0.75, respectively.
We continue to test H0(3) and find that the p-values are 0.423, 0.204 and 0.143 for = 025
0.5 and 0.75, respectively. Considering that the p-values are above or close to the recommended
nominal level 1 (0.143), we stop the testing procedure and conclude that the number of groups
is 3. Table 5 presents all the testing results and Table 8 shows the country membership of the
three groups. To take into account of the issue of multiple testing, we also report Holm’s (1979)
adjusted -values in the last row in Table 5,8 which also lend strong support to the conclusion of
three groups in the data.
8Suppose that we have tested ∗ individual hypotheses, and we order the individual p-values from the smallest
to the largest as (1) ≤ (2) ≤ ≤ (∗) with their corresponding null hypotheses labeled accordingly as H0(1)
H0(2) H0(∗) We calculate the step-down Holm adjusted -values for testing H0() as
adjusted-() = min(
∗ − + 1) () 1
24
We also consider SSP’s IC (see equation (5.1) above) to determine the number of groups.
Again, we try three choices of as above with = 025 05 and 075 When = 025 and 0.5,
the IC determines 3 groups, as shown in Figure 1. When = 075 the IC determines 2 groups.9
Given that the majority of our results determine 3 groups, we conclude that = 3
Table 6 presents the estimation results. We report both C-Lasso estimates and post C-Lasso
estimates. C-Lasso estimates are defined in Section 4.1. The post C-Lasso is implemented as in
(4.5). Both estimates are bias-corrected and the standard errors are obtained using the asymptotic
theory developed in SSP. The C-Lasso and post C-Lasso estimation results are similar for different
values of = 025 0.5 and 0.75. In the following discussion, we focus on the post C-Lasso estimates
with = 05 It is clear that the estimated slope coefficients exhibit substantial heterogeneity. For
1 the estimates for the three groups are 0.257, -0.160 and -0.235. All of them are significant at
the 1% level. It is interesting to note that not only the magnitudes but also the signs of estimates
are different among the three groups. For Group 1, income has a positive effect on democracy,
while for Groups 2 and 3, the effects are negative with different magnitudes. For 2 the three
group estimates are 0.097, -0.237 and 0.412. The first estimate is not significant, while the second
and third estimates are significant at the 1% level. We also present the point estimates of the
cumulative income effect (CIE), which is defined as 1 (1− 2) The estimates of CIE for the
three groups are 0.284, -0.129 and -0.400, which imply that a 10% increase in income per capita is
associated with increases of 0.0284, -0.0129 and -0.0400 in the measures of democracy, respectively.
Note that if we assume that 1 and 2 are homogeneous across then the common esti-
mates of 1 and 2 are -0.017 and 0.273 with t-statistics being -0.450 and 5.950, respectively.
This suggests that the effect of income on democracy is not statistically significant. This find-
ing is consistent with that in AJRY. Nevertheless, the insignificance could be due to the model
misspecification which ignores slope heterogeneity. Note that the common estimate falls in the
middle of the group estimates. Here all group estimates are significant, but with different signs.
Therefore the common estimate, which can be thought of as a certain weighted average of the
group estimates, could be close to zero and statistically insignificant.
To understand the heterogeneity in the data intuitively, we select three countries (Malaysia,
Indonesia, and Nepal) and show their time-series data in Table 7. We simply calculate the cor-
relations between the dependent variable and the key explanatory variable −1 Even the
simple correlations exhibit substantial heterogeneity with the values being -0.863, 0.069 and 0.658.
This suggests that it is implausible for the slope coefficients to be the same for all countries even
without performing any sophisticated analysis.
This application shows that ignoring the heterogeneity in the slope coefficients can mask the
true underlying relationship among economic variables.
9The values of IC for 2 groups and 3 groups are -3.3888 and -3.3862, respectively, which are actually quite close.
25
Table 5: Test statistics
= 025 = 05 = 075
Null hypothesis = 1 = 2 = 3 = 1 = 2 = 3 = 1 = 2 = 3
Statistics 3.566 1.930 0.195 3.566 1.925 0.829 3.566 2.060 1.065
p-values 0.0002 0.027 0.423 0.0002 0.027 0.204 0.0002 0.020 0.143
Holm adjusted 0.0006 0.054 0.423 0.0006 0.054 0.204 0.0006 0.040 0.143
p-values
1 2 3 4 5 6 7 8−3.45
−3.4
−3.35
−3.3
−3.25
−3.2
Number of groups
Information Criteria (IC) for different numbers of groups
IC with c=0.25IC with c=0.5IC with c=0.75
Information criteria
6.3 Explaining the group pattern
As discussed above, the group estimates of 1 and 2 show substantial heterogeneity, though
most of them are statistically significant at the 5% level. So far, our classification of the groups
is completely statistical and does not use any a priori information. One natural question is how
to explain the group membership. Apparently, the group membership shown in Table 8 does
not match the countries’ geographic locations. We further investigate the group pattern using a
cross-sectional multinomial logit model.
We let the dependent variable be group membership, which takes one of three values: 1,
2 or 3.10 The explanatory variables include () a measure of constraints on the executive at
independence, () the 500-year change in income per capita over 1500-2000, () the 500-year
change in democracy over 1500-2000, () independence year/100, () initial education level in
10We only report the results for = 05. The results for = 025 and 075 are similar and available upon request.
26
Table 6: Estimation results
1 2 CIE
estimates s.e. t-stat estimates s.e. t-stat estimates
common estimation -0.017 0.038 -0.450 0.273 0.046 5.950 -0.023
Group 1 0.222 0.051 4.402 0.119 0.058 2.042 0.252
C-Lasso Group 2 -0.221 0.032 -6.920 -0.305 0.055 -5.519 -0.169
= 025 Group 3 -0.250 0.102 -2.443 0.540 0.052 10.423 -0.543
Group 1 0.249 0.046 5.375 0.091 0.058 1.586 0.273
Post Group 2 -0.189 0.033 -5.670 -0.240 0.041 -5.896 -0.152
C-Lasso Group 3 -0.281 0.095 -2.966 0.493 0.049 9.991 -0.554
Group 1 0.273 0.041 6.594 0.122 0.067 1.826 0.311
C-Lasso Group 2 -0.176 0.031 -5.662 -0.324 0.068 -4.767 -0.133
= 05 Group 3 -0.206 0.084 -2.451 0.480 0.065 7.420 -0.396
Group 1 0.257 0.045 5.663 0.097 0.064 1.510 0.284
Post Group 2 -0.160 0.033 -4.795 -0.237 0.047 -5.051 -0.129
C-Lasso Group 3 -0.235 0.078 -3.020 0.412 0.061 6.782 -0.400
Group 1 0.309 0.042 7.458 0.052 0.069 0.751 0.326
C-Lasso Group 2 -0.152 0.032 -4.752 -0.280 0.058 -4.800 -0.119
= 075 Group 3 -0.145 0.077 -1.900 0.502 0.059 8.449 -0.292
Group 1 0.266 0.047 5.714 0.037 0.067 0.552 0.276
Post Group 2 -0.159 0.033 -4.764 -0.237 0.047 -5.002 -0.128
C-Lasso Group 3 -0.191 0.070 -2.721 0.432 0.054 8.061 -0.336
Note: CIE stands for cumulative income effect, which is defined as (1 (1− 2))
Table 7: Correlation between and −1 for selected countriesMalaysia Indonesia Nepal
Time period
1 0.800 7.823 0.100 6.798 0.290 6.652
2 0.833 7.967 0.333 6.992 0.167 6.704
3 0.667 8.186 0.333 7.257 0.167 6.785
4 0.667 8.492 0.333 7.547 0.667 6.756
5 0.667 8.603 0.333 7.731 0.667 6.908
6 0.333 8.783 0.167 7.955 0.500 6.991
7 0.500 9.072 0.000 8.201 0.667 7.125
8 0.333 9.202 0.667 8.200 0.667 7.286
Correlation between
and −1 -0.863 0.069 0.658
27
1965, () initial income level in 1965, and () initial democracy level in 1965. Acemoglu,
Johnson and Robinson (2005) argue that () is an important determinant of democracy. () and
() present long-run changes in income and democracy levels, respectively. () measures how
recently a country became independent. () () and () are the initial key economic variables.
All the data are taken directly from AJRY.
Table 9 provides summary statistics for each of the three groups. The sample averages of
() the constraints on the executive at independence are clearly substantially different among the
three groups. For this variable, the average value of Group 1 is only about half of that of Group 3.
The variable of () the 500-year change in income per capita differs noticeably among the three
groups. Group 2 has the largest sample average of 2.258, while Group 3 has the smallest value of
1.501.
Table 10 presents the multinomial logit regression results for various model specifications. We
choose Group 3 as the base group. Compared with Group 3, at the 5% level, a higher value of ()
the constraints on the executive at independence leads to a reduced likelihood of being in Group
1. On the other hand, a higher value of () the 500-year change in income per capita leads to
a higher likelihood of being in Group 2. For the case = 05 reported here, the variable ()
(independence year/100) is also significant at the 5% level. However, this result is not robust to
other choices of In summary, we find that the constraints on the executive at independence and
the long-run economic growth are important determinants of our group pattern.
7 Conclusion
We develop a data-driven testing procedure to determine the number of groups in a latent group
panel structure proposed in SSP. The procedure is based on conducting hypothesis testing on the
model specifications. The test is a residual-based LM-type test and is asymptotically normally
distributed under the null. We apply our new method to study the relationship between income
and democracy and find strong evidence that the slope coefficients are heterogeneous and form
three distinct groups. Further, we find that the constraints on the executive at independence and
long-run economic growth are important determinants of the group pattern.
There are several interesting topics for further research. Here we apply our testing procedure
to determine the number of groups for slope coefficients. The same idea can be applied to other
group structures, such as those considered in BM where fixed effects have a grouped pattern. We
may also extend our methods to non-linear panel data models such as discrete choice models.
28
Table 8: Classification of countries
Group 1 (“positive 1 and insignificant 2”) (1 = 21)
Brazil Cyprus Algeria Ecuador Spain
Ghana Greece Jordan Korea Rep. Malawi
Nepal Panama Philippines Portugal Rwanda
Thailand Taiwan Uruguay Venezuela RB Congo Dem. Rep
Zambia
Group 2 (“negative 1 and negative 2”) (2 = 16)
Argentina Congo Rep. Dominican Republic Finland Gabon
Indonesia Japan Luxembourg Mexico Nigeria
Singapore El Salvador Togo Tunisia Turkey
Uganda
Group 3 (“negative 1 and positive 2”) (3 = 37)
Burundi Benin Burkina Faso Bolivia Central African Republic
Chile China Cameroon Colombia Egypt Arab Rep
Guinea Guatemala Guyana Honduras India
Iran Israel Jamaica Kenya Sri Lanka
Morocco Madagascar Mali Mauritania Malaysia
Niger Nicaragua Peru Paraguay Romania
Sierra Leone Sweden Syrian Arab Republic Chad Trinidad and Tobago
Tanzania South Africa
Table 9: Summary statistics by group
Group 1 Group 2 Group 3
Variables Variable description mean s.d. mean s.d. mean s.d.
constraint constraints on the executive 0.148 0.239 0.334 0.298 0.392 0.364
at independence
growth 500 year income per 2.004 1.182 2.258 1.086 1.501 0.951
capita change
democ 500 year democracy change 0.803 0.231 0.660 0.295 0.654 0.288
indcent year of independence/100 18.913 0.726 18.947 0.710 19.104 0.667
edu65 education level in 1965 2.778 1.516 2.805 2.170 2.621 1.931
inc65 logarithm of real GDP 7.715 0.822 7.925 0.968 7.582 0.729
per capita in 1965
dem65 measure of democracy in 1965 0.507 0.271 0.634 0.245 0.554 0.261
29
Table 10: Determinants of the group pattern
Group 1constraints -3.011** -3.259*** -3.044*** -5.201*** -5.546*** -5.924*** -5.990***
(1.200) (1.183) (1.132) (1.526) (1.682) (1.813) (1.887)growth 0.599** 0.580* 0.958** 1.157* 1.668** 1.693**
(0.294) (0.350) (0.405) (0.605) (0.723) (0.744)demco 1.651 3.936** 4.947* 4.694* 4.478*
(1.399) (1.854) (2.661) (2.704) (2.661)indcent 1.772** 2.116** 1.917** 1.973**
(0.765) (0.893) (0.942) (0.957)edu65 -0.344 -0.136 -0.080
(0.340) (0.366) (0.414)inc65 -1.406 -1.360
(0.932) (0.955)dem65 -0.533
(2.019)
Group 2constraints -0.498 -1.032 -0.989 -1.391 -1.578 -1.991 -2.436
(0.911) (0.946) (0.976) (1.294) (1.616) (1.740) (1.829)growth 0.724** 0.889** 1.016** 1.193** 1.760** 1.915**
(0.324) (0.372) (0.406) (0.604) (0.766) (0.805)demco -0.977 -0.713 -0.674 -0.730 -0.558
(1.219) (1.420) (2.272) (2.386) (2.315)indcent 0.345 0.242 0.055 0.102
(0.698) (0.923) (0.990) (0.998)edu65 -0.286 -0.089 -0.252
(0.353) (0.377) (0.421)inc65 -1.398 -1.618
(0.992) (1.045)dem65 1.966
(2.167)Note: *, ** and *** denote significance at the 10%, 5% and 1% levels, respectively. The results arebased on a multinomial logit regression where Group 3 is taken as the reference group. The standarderrors are calculated without taking into account the fact that the dependent variables are estimated.
30
APPENDIX
In this appendix we prove the main results in the paper. The proof relies on some technical lemmas
given in Appendix B.
A Proof of the main results
Proof of Theorem 3.1. Let = (1 )0 and
= (00)
−1 0 Then by (2.7) and the fact
that 0 = , we have
=0 +0(0 − ) (A.1)
andp1 (0) =
Ã−12
X=1
000 −
!+−12
X=1
(0 − )0 0
0(0 − )
+2−12X=1
00(0 − )
≡ (1 − ) +2 + 23 , say. (A.2)
We complete the proof by showing that under H0 (0), (i) 1 −→ (0 0) (ii) 2 = (1)
and (iii) 3 = (1), where 0 = lim( )→∞ We prove (i)-(iii) in Propositions A.1-A.3 below.
Proposition A.1 1 −→ (0 0) under H0 (0)
Proof. Recall that = 00 and that denotes the ( )’th element of : =
−1P
=1
P=1 0
¡−1 0
0
¢−1 where = 1 − −1 and 1 = 1 = Let ≡
−1P
=1
P=1 0
Ω−1 Observe that 1 − = 2−12
P=1
P1≤≤ +
2−12P
=1
P1≤≤
¡ −
¢ ≡ 11 + 12 It suffices to show that: (i) 11→
(0 0) and (ii) 12 = (1)
First, we show (i). Using = 1 − −1 we have
11 =2
√
X=1
X1≤≤
X=1
X=1
0Ω−1
=2
√
X=1
X1≤≤
0Ω−1 − 2
2√
X=1
X1≤≤
X=1
0Ω−1
− 2
2√
X=1
X1≤≤
X=1
0Ω−1 +
2
3√
X=1
X1≤≤
X=1
X=1
0Ω−1
=2
√
X=1
X1≤≤
†0Ω−1
† −
2
2√
X=1
X1≤≤
X=1
[ − ()]0Ω−1
− 2
2√
X=1
X1≤≤
X=1
0Ω−1 [ − ()]
+2
3√
X=1
X1≤≤
X=1
X=1
[ − ()]0Ω−1 [ − ()]
31
+4
3√
X=1
X1≤≤
X=1
X=1
[ − ()]0Ω−1 ()
≡ 111 +112 +113 +114 +115 say.
By Lemmas B.4(ii)-(v), 112+113+114+115 = (1) We are left to show that 111→ (0 0) Observe that
111 =2
√
X=1
X1≤≤
†0Ω−1
† =
X=2
where ≡ 2−1−12P
=1
P−1=1
0 and ≡ Ω−12
† By Assumption A.1(v)
(|F−1) ≡ 2−1−12X=1
−1X=1
0(|F−1) = 0
That is, F is an m.d.s. By the martingale CLT (e.g., Pollard (1984, p. 171)), it suffices to
show that
Z ≡X=2
F−1h||4
i= (1) and
X=2
2 − = (1) (A.3)
where F−1 denotes expectation conditional on F−1 Observing that Z ≥ 0 it suffices to show thatZ = (1) by showing that (Z) = (1) by Markov inequality. By Assumptions A.1(iv)-(v),
(Z) =16
42
X=2
X=1
X=1
X=1
X=1
X1≤≤−1
¡0
0
0
0
¢= 48Z1 + 16Z2
where
Z1 ≡ 1
42
X=2
X=1
X=1 6=
X1≤≤−1
¡0
0
2
¢¡0
0
2
¢ and(A.4)
Z2 ≡ 1
42
X=2
X=1
X1≤≤−1
¡0
0
0
0
4
¢ (A.5)
For the moment we assume that = 1 so that we can treat the ×1 vector as a scalar. (The general casefollows from Slutsky lemma and the fact that 0
0 =
P=1
P=1 where denotes
the ’th element of ) To bound the summation in (A.4), we consider three cases for the time indices in
≡ − 1 : (a) # = 5 (b) # = 4 and (c) # ≤ 3 We use 1 1 and 1 to denote
the corresponding summations when the time indices are restricted to cases (a), (b), and (c) respectively.
In case (a), applying Davydov inequality (e.g., Hall and Heyde (1980, p. 278)) yields¯¡
2
¢¯ ≤ 8 ( ) (− 1− ( ∨ ))(1+)(2+) (A.6)
where ∨ ≡ max ( ) and 1 ≡ max°°°°4+2 °°22°°4+2 A similar inequality holds
for ( 2) By the repeated use of Cauchy-Schwarz’s and Jensen’s inequalities and As-
sumption A.1(i),
|1| ≤ 1
2
h°°°°24+2 + °°22°°24+2i≤ 1
4
n°°°°28+4 + °°°°28+4 + 2°°22°°24+2o ≤ 1 (A.7)
32
for some 1 ∞ With this, we can readily show that under Assumption A.1(iii),
1 ≤ 6421
4
X=2
⎧⎨⎩ X1≤≤−1
(− 1− ( ∨ ))(1+)(2+)⎫⎬⎭2
= ¡−1
¢
In case (b), we consider two subcases: (b1) one and only one of equals − 1 (b2) # = 3We use 11 and 12 to denote the corresponding summations when the individual indices are restricted
to subcases (b1) and (b2), respectively. In subcase (b1), wlog we assume that = − 1 and apply¯¡ −1−12
¢¯ ≤ 82 (− 1− )(1+)(2+)
for 2 ≡°°°°8+4D °°2−1−12°°(8+4)3D ≤ 2 for some 2 ∞ and (A.6)-(A.7) to obtain
11 ≤ 6412
3
X=2
⎧⎨⎩ 1 X1≤≤−1
(− 1− ( ∨ ))(1+)(2+)⎫⎬⎭⎧⎨⎩ X1≤≤−1
(− 1− )(1+)(2+)
⎫⎬⎭=
¡−2
¢
In subcase (b2), wlog we assume that = and − 1 We consider two subsubcases: (b21) either− 1− ∗ or − ∗, and (b22) − 1− ≤ ∗ and − ≤ ∗ In the first case, we have
¯¡
2
¢¯ ≤ ( 83 (∗)(1+)(2+)
if − 1− ∗84 ( ) (∗)
(1+)(2+)if − ∗
where 3 ≡°°2°°4+2 °°°°4+2 ≤ 3 ∞ and 4 ≡
°°2°°(8+4)3 °°°°8+4≤ 4 ∞ These results, in conjunction with the fact that the total number of terms in the summation in
subcase (b22) is of order ¡2 32∗
¢and Assumption A.1(iii), imply that
12 ≤ h 2 (∗)
(1+)(2+)+ −4−22 32∗
i=
³ 2 (∗)
(1+)(2+)+ −12∗
´= (1) .
Consequently, 1 = (1) In case (c), we have 1 = ¡−1
¢as the number of terms in the summation
is ¡2 3
¢and each term in absolute value has a bounded expectation. It follows that Z1 = (1)
To bound Z2 we consider two cases for the set of indices ≡ − 1: (a) # = 5, and (b)all the other cases. We use 2 and 2 to denote the corresponding summations when the individual
indices are restricted to subcases (a) and (b), respectively. In the first case, letting = max( ) we
have ¯¡4
4
¢¯ ≤ 85 ( ) (− 1− )(2+)
where 5 ≡°° °°2+ °°44°°2+ ≤ 5 ∞ Then2 ≤ 8−1
P=1 ()
(2+)
= ¡−1
¢ In case (b), we have 2 =
¡−1
¢ It follows that Z2 =
¡−1
¢and thus Z = (1)
Consequently the first part of (A.3) follows.
For the second part of (A.3), by Assumptions A.1(iv)-(v) we have
X=2
(2) = 4−2−1X=2
"X=1
−1X=1
0
#2
= 4−2−1X=2
X=1
−1X=1
−1X=1
(2 0
0) =
33
In addition, we can show by straightforward moment calculations that (P
=2 2)
2 = 2 + (1)
Thus Var(P
=2 2) = (1) and the second part of (A.3) follows. This completes the proof of (i).
In addition, by Lemma B.4(i), 12 = (1)
Proposition A.2 2 = (1) under H0 (0)
Proof. Noting that 1 ∈ 0 = 1 ∈ + 1 ∈ 0\− 1 ∈ \0 under H0 (0) we have
2 = −120X=1
X∈0
(0 − )0 0
0(0 − )
= −120X=1
X∈
¡0 −
¢0 00
¡0 −
¢+−12
0X=1
X∈0
\
(0 − )0 0
0(0 − )
−−120X=1
X∈\0
¡0 −
¢0 00
¡0 −
¢≡ 21 +22 −23 say.
Let k·ksp denote the spectral norm. Note that kk ≤rank() kksp and kksp ≤ kk for any matrix
By these properties, the submultiplicative property of the spectral norm, the fact that k0ksp = 1 and
Assumptions A.1(i) and A.2(ii),
21 ≤ −120X=1
°°0 − °°2 X
∈
k 00k ≤ −12
0X=1
°°0 − °°2 X
∈
kk2
= −12
¡( )−1 + −2
¢ ( ) = −12
¡1 +−1
¢= (1)
By Assumption A.2(iii), for any 0
(22 ≥ ) ≤ ³∪0
=1
´→ 0 and (23 ≥ ) ≤
³∪0
=1
´→ 0
It follows that 22 = (1) and 23 = (1) Consequently 2 = (1) under H0 (0)
Proposition A.3 3 = (1) under H0 (0)
Proof. As in the proof of Proposition A.2, we make the following decomposition:
3 = −120X=1
X∈0
00(0 − )
= −120X=1
X∈
00
¡0 −
¢+−12
0X=1
X∈0
\
00(0 − )
−−120X=1
X∈\0
00
¡0 −
¢≡ 31 +32 −33 say.
34
Using the same arguments as those used in the study of 22 and 23 we can show that 32 =
(1) and 33 = (1) Noting that 0 − =
¡( )−12 + −1
¢for = 1 0 under H0 (0)
by Assumption A.2(ii), it suffices to prove that 31 = (1) by showing that
31 ≡ −12X∈
00 =
³min(( )12 )
´for = 1 0
By the fact that 1 ∈ = 1 ∈ 0 + 1 ∈ \0 − 1 ∈ 0\ and the arguments usedin the study of 22 and 23 we can show that 31 ≡ 31 + (1) where 31 =
−12P
∈0 00 Using
00 =
P=1( − ·) we can decompose 31 as follows
31 = −12X∈0
X=1
−−12−1X∈0
X=1
X=1
=¡1− −1
¢−12
X∈0
X=1
−−12−1X∈0
X1≤≤
−−12−1X∈0
X1≤≤
≡ 311 − 312 − 313 say.
Using Chebyshev inequality, we can readily show that 311 =
¡ 12
¢under Assumptions A.1(i), (iv)
and (v). Let = (1 )0 be an arbitrary × 1 nonrandom vector with kk = 1 By Assumptions
A.1(i), (iv) and (v), and Jensen inequality,
∙³0312
´2¸= −1−2
X∈0
X∈0
X1≤≤
X1≤≤
¡0
0
¢= −1−2
X∈0
X1≤≤
X1≤≤
¡0
0
¢= −1−2
X∈0
X1≤≤
¡0
0
2
¢= ( )
Then 312 =
¡ 12
¢by Chebyshev inequality. Next,
∙³0313
´2¸= −1−2
X∈0
X1≤≤
X1≤≤
¡0
0
¢+−1−2
X∈0
X∈0
6=
X1≤≤
X1≤≤
¡0
¢¡0
¢≡ + say.
Let ≡ To bound we consider two cases: (a) # = 4 and (b) # ≤ 3 and denote the
corresponding summations as and such that = + Apparently, = ( ) For wlog we
consider three subcases: (a1) (a2) (a3) , and denote the
corresponding summations as 1 2 and 3 respectively. (Note = 2(1 + 2 + 3)) In subcase
(a1), we apply Davydov inequality to obtain
|1| ≤ 8−1−2X∈0
X1≤≤
(− )(1+)(2+) ≤ 8
∞X=1
()(1+)(2+)
= ( )
35
where = kk8+4°°0
0
°°(8+4)3
≤ ∞ by Assumption A.1(i) and Jensen inequality.
Analogously, we can show that 2 = ( ) and 3 = ( ) It follows that = ( ) For we apply
Davydov inequality to obtain
|| ≤ −1−2
⎧⎨⎩X∈0
X1≤≤
¯¡0
¢¯⎫⎬⎭2
≤ −1−2
⎧⎨⎩X∈0
X1≤≤
(− )(3+2)(4+2)
⎫⎬⎭2
= −1−2¡2 2
¢= ()
where =°°0
°°8+4
°°0
°°8+4
≤ ∞ by Assumption A.1(i). Consequently, [0311 (3)]2
= ( + ) and 313 = (12 + 12)
In sum, we have 31 =
¡12 + 12
¢ It follows that 31 = (min(( )12 ))
Proof of Theorem 3.2. By Theorem 3.1 and the Slutsky lemma, it suffices to prove the first two parts
of the theorem.
Step 1. We prove (i) (0) = + (1) under H0 (0) Let denote a ×1 vector with 1 inthe ’th position and zeros everywhere else. Then = 00
0 =P
=1
P=1
0 (0)
−1
Using 2 − 2 = ( − )
2+ 2 ( − ) we decompose − as follows:
(0)− =1√
X=1
X=1
( − )2 +
2√
X=1
X=1
( − )
≡ 1 + 22 say.
Noting that diag() is p.s.d., we have by (A.1) and Cauchy-Schwarz inequality
1 = −12X=1
( − )0diag () ( − )
≤ 2−12X=1
00diag ()0 + 2−12
X=1
( − 0 )0 0
0diag ()0( − 0 )
≡ 211 + 212 say.
We will show that 1 = (1) for = 1 and 2 By the fact thatP
=1 0 = and 0 is idempotent,
we have i0diag() i =tr[i0diag () i ] =
P=1tr
¡00
0¢=tr
¡0
0
¢=tr[0(
00)
−1
× 00] = This, in conjunction with Davydov inequality, implies that
¯11
¯= −2−12
X=1
[0i (i0diag () i ) i
0] = −2−12
X=1
(0i i0)
= −2−12X=1
X=1
¡2¢+ 2−2−12
X=1
X1≤≤
()
= (12−1) +(12−1) = (12−1).
36
Consequently 11 = (12−1) = (1) by Markov inequality. Using 1 ∈ 0 = 1 ∈ +
1 ∈ 0\− 1 ∈ \0 following similar arguments as those used in the proof of Proposition A.2,and by Assumptions A.1(i) and A.2(ii), we can show that under H0 (0)
12 = −120X=1
X∈
( − 0)0 0
0diag ()0( − 0) + (1)
≤ −120X=1
°° − 0°°2 X
∈
k 00diag ()0k+ (1)
= −12
³( )
−1+ −2
´ () + (1) = (1)
based on the fact that 00 ≤ for any p.s.d. matrix and
X∈
k 00diag ()0k ≤
X∈
k 0diag ()k ≤
X∈
X=1
°° 000
0
°°≤
X∈
X=1
°° 00
°° = X∈
X=1
°°° 0 (
00)
−1 0
°°°≤ max
1≤≤
°°°Ω−1 °°°−1 X∈
X=1
kk4 = () by Lemma B.3(v).
Consequently, we have shown that 1 = (1)
For 2 we first apply (A.1) to decompose it as follows
2 =1√
X=1
( − )0diag()
=1√
X=1
00diag() +1√
X=1
( − 0 )0 0
0diag() ≡ 21 + 22
Observe that
21 =1
2√
X=1
X=1
X=1
00Ω
−1 0
0
=1
2√
X=1
X=1
X=1
00Ω
−1 0
0
+1
2√
X=1
X=1
X=1
00
hΩ−1 −Ω−1
i 00 ≡ 211 + 212 say.
Noting that k00k ≤ k0k = kk we apply Lemma B.3(v) to obtain¯212
¯≤ max
1≤≤
°°°Ω−1 −Ω−1 °°° 1
2√
X=1
¯¯X=1
¯¯
X=1
kk2 kk
= ( )
³12−12
´= (1)
37
For 211 we have
211 =1
2√
X=1
X=1
X=1
X=1
X=1
0Ω−1
=1
2√
X=1
X=1
X=1
0Ω−1 − 2
3√
X=1
X=1
X=1
X=1
0Ω−1
+1
4√
X=1
X=1
X=1
X=1
X=1
0Ω−1
≡ 211 − 2211 + 221 say.
We further decompose 211 as follows 211 =1
2√
P=1
P=1
0Ω−1
2 +
1
2√
P=1
P1≤≤
0Ω−1 +
1
2√
P=1
P1≤≤
0Ω−1 = 211 (1) + 211 (2) + 211 (3) Ap-
parently, 211 (1) =
¡12−1
¢by Markov inequality. Noting that [211 (2)] = 0 by Davy-
dov inequality we can readily show that
h211 (2)
i2=
1
4
X=1
X1≤≤
X1≤≤
£
0Ω−1
0Ω−1
¤=
¡−1
¢
It follows that 211 (2) =
¡−12
¢ Similarly, 211 (3) =
¡−12
¢ Then 211 =
¡12−1 + −12
¢= (1) Analogously, we can show that 211 = (1) for = Then
we have 211 = (1) and 21 = (1)
For 22 using the same arguments as those used in the proof of Proposition A.2, we can show that
under H0 (0)
22 =1√
0X=1
( − 0)0 X∈
00diag() + (1)
=1√
0X=1
( − 0)0 X∈0
00diag() + (1) ≡ 22 + (1)
Let =1√
P∈0
00diag() Then as in the proof of Proposition A.2 and analysis of 211 (2),
we can show that
=1
√
X∈0
X=1
X=1
0Ω−1 0
0
=1
√
X∈0
X=1
X=1
0Ω−1 0
0 + (1)
=1
√
X∈0
X=1
X=1
X=1
X=1
0Ω−1 + (1) =
³12 + 12
´
It follows that 22 =
¡( )−12 + −1
¢
¡12 + 12
¢= (1) This completes the proof of
(i1).
38
Step 2. We prove (ii) (0) = + (1) Observe that (0)− = 41 +42
where
1 = −2−1X=2
X=1
⎧⎨⎩"
0
−1X=1
#2−"
0
−1X=1
#2⎫⎬⎭ and
2 = −2−1X=2
X=1
⎧⎨⎩"
0
−1X=1
#2−
"
0
−1X=1
#2⎫⎬⎭
Noting that (2) = 0 and Var(2) = (1) by direct moment calculations, we have 2 = (1)
by Chebyshev’s inequality. Thus we are left to show that 1 = (1) Again, using 2− 2 = (− )
2+
2 (− ) we have
1 = −2−1X=2
X=1
"
0
−1X=1
− 0
−1X=1
#2
+2−2−1X=2
X=1
"
0
−1X=1
− 0
−1X=1
#
0
−1X=1
≡ 11 + 212
Let 12 ≡ −2−1P
=2
P=1[
0
P−1=1 ]
2 By Cauchy-Schwarz inequality 12 ≤ 1112©12
ª12 It is straightforward to show that 12 = (1) so that we can prove that 1 = (1)
by showing that 11 = (1) Using = ( − ) + and Cauchy-Schwarz inequality,
11 ≤ 3−2−1X=2
X=1
"³ −
´0 −1X=1
#2
+3−2−1X=2
X=1
"
0
−1X=1
³ −
´#2
+3−2−1X=2
X=1
"³ −
´0 −1X=1
³ −
´#2≡ 3111 + 3112 + 3113
We complete the proof of (ii) by showing that (ii1) 111 = (1) (ii2) 112 = (1) and (ii3)
113 = (1)
We first show (ii1) 111 = (1). Using − = ( − ) +(−)+( − ) (−) and Cauchy-Schwarz inequality, we have
111 ≤ 3−2−1X=2
X=1
"( − )
0
−1X=1
#2
+3−2−1X=2
X=1
"
³ −
´0 −1X=1
#2
+3−2−1X=2
X=1
"( − )
³ −
´0 −1X=1
#2≡ 3111 + 3111 + 3111
39
By Markov and Davydov inequalities we can show that −2−1P
=2
P=1[
0
P−1=1 ]
2 = (1)
By Boole inequality, Doob inequality (e.g., Hall and Heyde (1980, pp.14-15)) for m.d.s., and then Davydov
inequality, for any 0 we have
Ãmax1≤≤
max2≤≤
°°°°°−12−1X=1
°°°°° 18
!≤
X=1
Ãmax2≤≤
°°°°°−12−1X=1
°°°°° 18
!
≤ 1
48
X=1
°°°°°−1X=1
°°°°°8
= (1)
It follows that
max1≤≤
max2≤≤
°°°°°−1X=1
°°°°° =
³ 1218
´ (A.8)
The same conclusion follows when one replaces by or 1. Let =diag(°°1°°2 °°°°2) By (A.1),
we can readily show that
−1X=1
k − k2 ≤ 2−1X=1
k0k2 + 2−1X=1
°°°0(0 − )
°°°2= (1) + (1) = (1) (A.9)
and
−1X=2
X=1
( − )2°°°°2 = −1
X=1
( − )0 ( − )
≤ 2−1X=1
000 + 2−1
X=1
(0 − )0 0
00(0 − )
= (1) + (1) = (1) (A.10)
By (A.8), (A.10), and Assumption A.3,
111 = −2−1X=2
X=1
( − )2
"0
−1X=1
#2
≤ −2 max1≤≤
max2≤≤
°°°°°−1X=1
°°°°°2(
−1X=2
X=1
( − )2°°°°2)
= −2
³14
´ (1) = (1)
To determine the probability order of 111 and 111 we use the uniform probability order of −We decompose − as follows:
− = Ω−12
" − −1
X=1
#−Ω−12
" − −1
X=1
()
#
= − −1
X=1
−Ω−12 −1X=1
[ − ()]
≡ 1 − 2 − 3 (A.11)
40
where ≡ Ω−12 −Ω−12 By Lemma B.3 and the fact thatmax1≤≤ max1≤≤ kk = (( )1(8+4)
)
by Boole and Markov inequalities, we have max1≤≤ max1≤≤ k1k = ( × ( )1(8+4)
) Follow-
ing the proof of Lemma B.3(v), we can show that
max1≤≤
°°°°°−1X=1
°°°°° ≤ max1≤≤
°°°°°−1X=1
[ − ()]
°°°°°+ max1≤≤
°°°°°−1X=1
()
°°°°°=
³max( )
1(8+4)log ( ) (log ( ) )12
´+ (1)
= (1) (A.12)
It follows that max1≤≤ max1≤≤ k2k = ( ) Also, max1≤≤ max1≤≤ k3k = ( )
Thus max1≤≤ max1≤≤°°° −
°°° = ( ( )1(8+4)
) In addition, using (A.11) and the above
bounds, we have
−1−1X=2
X=1
2
°°° −
°°°2≤ 3−1−1
X=2
X=1
2 k1k2 + 3−1−1X=2
X=1
2 k2k2 + 3−1−1X=2
X=1
2 k3k2
≤ 3−1−1X=2
X=1
2 k1k2 + (2 ) + (
2 ) = (
2 )
where the last equality follows from the fact that −1−1P=2
P=1
2 k1k2 ≤ max1≤≤ kk2 −1−1P=2
P=1
2 kk2 = (2 ) Then by Assumption A.3,
111 = −2−1X=2
X=1
"
³ −
´0 −1X=1
#2
≤ −1 max1≤≤
max1≤≤
"−1X=1
#2 "−1−1
X=2
X=1
2
°°° −
°°°2 #= −1
³14
´ (
2 ) = (
142 ) = (1)
and
111 = −2−1X=2
X=1
"( − )
³ −
´0 −1X=1
#2
= −2 max1≤≤
max1≤≤
°°° −
°°°2 max1≤≤
max1≤≤
°°°°°−1X=1
°°°°°2
−1X=2
X=1
( − )2
= −2 (2 ( )1(4+2)
)
³14
´ (1) = (
142 ) = (1)
It follows that 111 = (1)
To show (ii2) and (ii3), we find that it is convenient to bound ≡ −12P−1
=1(− ) Using
− = ( − ) + ( − ) + ( − ) ( − ), we have
≡ −12−1X=1
( − ) + −12−1X=1
( − ) + −12−1X=1
( − ) ( − )
≡ 1 + 2 + 3 say.
41
By (A.11), 1 = −12P−1
=1 (1 − 2 − 3) ≡ 11 − 12 − 13 say. By the remark after
(A.8), (A.12), and Lemma B.3,
max1≤≤
max1≤≤
k11k ≤ max1≤≤
kk max1≤≤
max1≤≤
°°°°°−12−1X=1
°°°°° = ( ) (14) = (1)
max1≤≤
max1≤≤
k12k ≤ max1≤≤
kk max1≤≤
°°°°°−1X=1
°°°°° max1≤≤max1≤≤
°°°°°−12−1X=1
°°°°°= ( ) (1) (
18) = (1)
and
max1≤≤
max1≤≤
k13k ≤ max1≤≤
°°°Ω−12
°°° max1≤≤
°°°°°−1X=1
[ − ()]
°°°°° max1≤≤max1≤≤
°°°°°−12−1X=1
°°°°°= (1) ( ) (
18) = (1)
It follows that max1≤≤ max1≤≤ k1k = (1) Similarly we can show that max1≤≤ max1≤≤k2k = (1) and max1≤≤ max1≤≤ k3k = (1). Hence max1≤≤ max1≤≤ kk = (1) It
follows that
112 = −2−1X=2
X=1
"
0
−1X=1
³ −
´#2
≤½max1≤≤
max1≤≤
kk2¾(
−1−1X=2
X=1
°°0°°2)= (1) (1) = (1)
and
113 = −2−1X=2
X=1
"³ −
´0 −1X=1
³ −
´#2
≤½max1≤≤
max1≤≤
kk2¾(
−1−1X=2
X=1
°°° −
°°°2) = (1) (1) = (1)
as one can readily show that −1−1P
=2
P=1
°°° −
°°°2 = (1). Thus 11 = (1) This
completes the proof of () ¥
Proof of Theorem 3.3. Observe that
q (0)1 (0) = 1 − (0) + 2 + 23
where 1 2 and 3 are as defined in (A.2). We study the probability order of each term in the
last expression.
Noting that k 00k2 ≤ 2 k 0
k2 + 2 k 00k2 we have −1−2
P=1 k 0
0k2 ≤ 21 + 22where 1 = 2
−1−2P
=1
P=1
P=1
0 and 2 = −1−4
P=1
P=1
P=1
P=1
P=1
0
× By Assumption A.1 and Markov inequality, we can readily show that 1 = (−1) and
2 = (−1) It follows that −1−2
P=1 k 0
0k2 = (−1) Then by Lemma B.3(v),
−12−11 = −1−1X=1
000 ≤ max
1≤≤max
³
´−1−2
X=1
0000
= (1)
¡−1
¢=
¡−1
¢
42
By (A.1) and Cauchy-Schwarz inequality
−12−1 = −1−1X=1
X=1
2 = −1−1X=1
0diag ()
≤ 2−1−1X=1
00diag ()0
+2−1−1X=1
(0 − )0 0
0diag ()0(0 − )
≡ 21 + 22 say.
By Cauchy-Schwarz inequality, 1 ≤ 2−1−1P
=1 0diag() + 2
−1−1P
=1 00diag()0 ≡
211+212 say. By the fact =00 ≤ [min()]
−1−1000 and Lemma B.3(v), we have
11 ≤ [min()]−1−1−2
X=1
0diag (000)
= [min()]−1−1−2
X=1
X=1
X=1
X=1
0 =
¡−1
¢
as we can readily show that −1−2P
=1
P=1
P=1
P=1
0 = (
−1) based on thefact that = 1 = − −1 and Markov inequality. As in the analysis of 11 we can readily apply
the fact that i0diag() i = to obtain
12 = −1−3X=1
0i i0diag () i i
0 = −1−3
X=1
0i i0
= −1−3X=1
X=1
X=1
=
¡−2
¢
It follows that 1 =
¡−1
¢ By Cauchy-Schwarz inequality, 2 ≤ 2−1−1
P=1(
0−)0 0
diag()
×(0 − ) + 2−1−1
P=1(
0 − )
0 00diag()0(
0 − ) ≡ 221 + 222 say. Noting that
diag(000) ≤diag(
0) we have
21 ≤ [min()]−1−1−2
X=1
(0 − )0 0
diag (000)(
0 − )
≤ [min()]−1−1−2
X=1
(0 − )0 0
diag (0)(
0 − )
= [min()]−1−1−2
X=1
(0 − )0
0
0(
0 − )
≤ −1[min()]−1 max
1≤≤−1
X=1
kk4−1X=1
°°°0 −
°°°2= −1 (1) (1) (1) =
¡−1
¢
43
Using i0diag() i = we have
22 = −1−3X=1
(0 − )0 0
i i0diag () i i
0(
0 − )
= −1−3X=1
(0 − )0i i
0(
0 − )
≤ −1 max1≤≤
°°·°°2−1 X
=1
°°°0 −
°°°2 =
¡−1
¢
It follows that 2 =
¡−1
¢and −12−1 (0) =
¡−1
¢
By Assumption A.4(ii) and Lemma B.3(v), w.p.a.1
−12−12 = −1−1X=1
³0 −
´0 00
³0 −
´≥ min
³
´−1
0X=1
X∈
°°0 − °°2 ≥ 1
2min () 0
Now, we decompose −12−13 as follows: −12−13 = −1−1
P=1
0(
0−)−−1−2
×P=1
0i i
0(
0 − ) ≡ 31 + 32 say. For the first term, we apply the Cauchy-Schwarz
inequality to obtain
|31| ≤ −12(−1−1
X=1
k0k2)12
×⎧⎨⎩−1
0X=1
X∈
°°0 − °°2⎫⎬⎭
12
= −12 (1) (1) = (1)
where we use the fact that −1−1P
=1 k0k2 = (1) under Assumption A.1. Similarly,
|32| ≤ max1≤≤
°°−10i°° max1≤≤
°°−1 0i°°−1 X
=1
°°°0 −
°°°≤ max
1≤≤
°°−10i°° max1≤≤
°°−1 0i°°⎧⎨⎩−1
0X=1
X∈
°°0 − °°2⎫⎬⎭
12
= ( ) (1) (1) = (1)
It follows that −12−13 = (1)
In sum, we have −12−1q (0)1 (0) ≥ 1
2min () 0
+ (1) w.p.a.1. In addition, we
can show that (0) has a positive probability limit under H1 (0) It follows that under H1 (0)
(1 (0) ≥ )→ 1 as ( )→∞ for any = ¡12
¢ ¥
Proof of Theorem 4.1. (i) Let = (1 )0 = (1 )
0 and = (1 )0 The
minimization problem in (4.4) can be rewritten as
min
(0)
2 (βα) =1
X=1
°°°°°° − +1
X=1
°°°°°°2
+
X=1
Π0
=1 k − k (A.13)
44
Let β =β0+ −12v where v =(1 ) is a × matrix. We want to show that for any given ∗ 0there exists a large constant = (∗) such that for sufficiently large and we have
(inf
−1
=1kk2=
(0)
2
³β0 + −12v α
´
(0)
2
¡β0α0
¢) ≥ 1− ∗ (A.14)
where α ≡ α (v) is chosen such that (β0 + −12v α) minimizes (0)
2 (βα) for some given v This
implies that w.p.a.1 there is a local minimum β α such that −1P
=1 ||||2 =
¡−1
¢where =
− 0
Observe that
h(0)
2
³β0 + −12v α
´−
(0)
2
¡β0α0
¢i≥ 1
X=1
⎧⎪⎨⎪⎩°°°°°° −
³0 + −12
´+1
X=1
³0 + −12
´°°°°°°2
− kk2⎫⎪⎬⎪⎭
=1
X=1
⎧⎪⎨⎪⎩°°°°°° − −12
⎛⎝ − 1
X=1
⎞⎠°°°°°°2
− kk2⎫⎪⎬⎪⎭
=1
X=1
°°°°°° − 1
X=1
°°°°°°2
− 2
12
X=1
⎛⎝ − 1
X=1
⎞⎠=
1
X=1
00 − 1
2
X=1
X=1
00 − 2
12
X=1
0
=1
0Ω − 1
0A − 2
12
X=1
0 (A.15)
where Ω ≡diag(Ω1 Ω ) A is an × matrix with a typical × block submatrix ( )−1
0
and we use the fact thatP
=1 = 0 By Lemma B.3(v) and Assumption A.1(ii),
1
0Ω ≥ 1
kk2 min
1≤≤min(Ω) ≥ 1
kk2 2 w.p.a.1. . (A.16)
Define the upper block-triangular matrix
1 =1
⎛⎜⎜⎜⎜⎝ 011 0
12 · · · 01
0 022 · · · 0
2
......
. . ....
0 0 · · · 0
⎞⎟⎟⎟⎟⎠
Note that A = 1 +01 − where = −1Ω By the fact that the eigenvalues of a block upper/lowertriangular matrix are the combined eigenvalues of its diagonal block matrices, Weyl’s inequality, Lemma
B.3(v), and Assumption A.1(ii), we have max (A) ≤ 2max (1)−min() ≤ 2−1max1≤≤ max(Ω) =
(−1) It follows that
1
0A ≤ 1
kk2 max (A) = 1
kk2 (
−1) (A.17)
45
In addition, by Cauchy-Schwarz and Markov inequalities, we can readily show that
1
12
¯¯X=1
0
¯¯ ≤
(1
X=1
0
)12(1
X=1
00
)12= −12 kk (1) (A.18)
Combining (A.15)-(A.18) yields
h(0)
2
³β0 + −12v α
´−
(0)
2
¡β0α0
¢i ≥ 1
kk2 [2− (1)]−−12 kk (1)
The first term dominates the second term in the last display for sufficiently large That is [(0)
2(β0+
−12v α) − (0)
2
¡β0α0
¢] 0 for sufficiently large Consequently, the minimizer β must satisfy
−1P
=1 ||||2 =
¡−1
¢
(ii) Given (i), we can readily argue the pointwise convergence of as in the proof of (iii) below. Let
(βα) =1
P=1Π
0
=1 k − k and (α) = Π0−1=1
°°° −
°°°+Π0−2=1
°°° −
°°°×°°0 − 0
°°++Π0
=2
°°0 − °° By the repeated use of Minkowski’s inequality (see, e.g., Su, Shi, and Phillips (2016,
SSP hereafter)), we have that as ( )→∞¯Π0
=1
°°° −
°°°−Π0
=1
°°°0 −
°°°¯ ≤ (α)°° − 0
°° and (α) ≤ 0 (α)³1 + 2
°°° − 0
°°°´ where 0 (α) = max1≤≤ max1≤≤≤0−1Π
=1
°°0 − °°0−1−
= max1≤≤0max1≤≤≤0−1
Π=1°°0 −
°°0−1−= (1) and ’s are finite integers. It follows that as ( )→∞
¯ (βα)− (β
0α)¯≤ 0 (α)
1
X=1
°°°°°°+ 20 (α)1
X=1
°°°°°°2≤ 0 (α)
(1
X=1
°°°°°°2)12 + (−1) = (
−12) (A.19)
By (A.19) and the fact that
¡β0α0
¢= 0, we have
0 ≥ (β α)− (βα0) =
¡β0 α
¢−
¡β0α0
¢+ (
−12)
=1
X=1
Π0
=1
°°0 − °°+ (
−12)
=1
Π0
=1
°° − 01°°+ +
0
Π0
=1
°° − 00
°°+ (−12) (A.20)
Then by Assumption A.2(i), we have Π0
=1
°° − 0°° = (
−12) for = 1 0 It follows that
((1) (0))− (01 00) = (
−12) for some suitable permutation ((1) (0)) of (1 0).
(iii) We invoke subdifferential calculus (e.g., Bertsekas (1995, Appendix B.5)). A necessary condition for
and to minimize the objective function in (A.13) is that for each = 1 (resp. = 1 0),
0×1 belongs to the subdifferential of (0)
2 (βα) with respect to (resp. ) evaluated at and That is, for each = 1 and = 1 0 we have
0×1 = − 2
− 1
0
⎛⎝ − +1
X=1
⎞⎠+
0X=1
Π0
=16=°°° −
°°° (A.21)
46
where =−k−k if
°°° −
°°° 6= 0 and kk ≤ 1 if°°° −
°°° = 0 Noting that = 0 −
1
P=1
0 + (A.21) implies that
Ω =1
0 +
1
X=1
0 −
2 ( − 1)0X=1
Π0
=1 6=°°° −
°°° (A.22)
where = − 0 Ω is asymptotically nonsingular uniformly in by Lemma B.3(v) and Assumption
A.1(ii). Using the arguments as used in the proof of Lemma B.3, we can readily show that
max1≤≤
°°°° 1 0
°°°° = max1≤≤
°°°°° 1X=1
¡ − ·
¢( − ·)
°°°°°≤ max
1≤≤
°°°°° 1X=1
°°°°°+ max1≤≤
°°°°° 1X=1
·
°°°°°+ max1≤≤
°°·°° max1≤≤
|·|+ max1≤≤
°°·°°
= ( )
where = max( )1(4+2)
log ( ) (log ( ) )12 By Assumption A.1 and Lemma B.3,
1
°°°°°°X=1
0
°°°°°° ≤ max1≤≤
1
12
°°°
°°°⎧⎨⎩ 1
X=1
°°°
°°°2⎫⎬⎭12⎧⎨⎩ 1
X=1
°°°°°°2⎫⎬⎭12
≡ 1 = (−12)
Let =
2(−1)P0
=1 Π0
=16=°°° −
°°° By the fact that ¯Π0
=1 6=°°° −
°°°−Π0
=1 6=°°0 −
°°¯ ≤0
(α)³1 + 2
°°° − 0
°°°´ for some 0(α) = (1) we have
kk ≤
0X=1
Π0
=1 6=°°° −
°°°≤
0X=1
³Π0
=1 6=°°° −
°°°−Π0
=1 6=°°0 −
°°´+
0X=1
Π0
=1 6=°°0 −
°°≤ 200
(α)°°°°°°+
where = 00(α)+max1≤≤
P0
=1Π0
=1 6=°°0 −
°° = () as 0 ’s can only take 0 values.
It follows that
°°°°°° ≤°°°Ω−1 °°°
sp
⎛⎝°°°° 1 0
°°°°+°°°°°° 1
X=1
0
°°°°°°+ 200(α)
°°°°°°+
⎞⎠≤
°°°Ω−1 °°°sp
µ°°°° 1 0
°°°°+ 1 + 200(α)
°°°°°°+
¶ and
max1≤≤
°°°°°° ≤ 2
1− 2200(α)
µmax1≤≤
°°°° 1 0
°°°°+ 1 +
¶= (2 )
½max1≤≤
°°°° 1 0
°°°°+ 1 +
¾= ( + )
where k·ksp denotes the spectral norm and 2 = [min1≤≤ min(Ω)]−1 = (1) by Lemma B.3(v). ¥
47
Proof of Theorem 4.2. (i) Fix ∈ 1 0 By the consistency of and in Theorem 4.1, we have
− → 0−0 6= 0 for all ∈ 0 and 6= and = Π0
=1 6=°°° −
°°° → 0 ≡ Π0
=1 6=°°0 − 0
°° 0for ∈ 0 Now, suppose that
°°° −
°°° 6= 0 for some ∈ 0 Then the first order condition (with respect
to ) for the minimization problem implies that
0×1 = − 2√ 0
⎛⎝ − +1
X=1
⎞⎠+ √
− 10X=1
Π0
=16=°°° −
°°°= − 2√
0
⎡⎣ −
³ − 0
´+1
X=1
³ − 0
´⎤⎦+ √
− 10X=1
Π0
=1 6=°°° −
°°°= − 2√
0 +
⎛⎝ 2 0 +
− 1°°° −
°°°⎞⎠√ ³ −
´+2
0
√¡ − 0
¢
− 2√
X=1
0
³ − 0
´+
√
− 10X
=1 6=Π
0
=16=°°° −
°°°≡ 1 + 2 + 3 + 4 + 5 say, (A.23)
where =−k−k if
°°° −
°°° 6= 0 and kk ≤ 1 if °°° −
°°° = 0Let = −12 (ln )3++ (ln )
Following SSP and using Assumption A.5(i), we can strengthen
the result in 4.1 to obtain that
µmax1≤≤
°°°°°° ≥
¶= (−1) for any 0 (A.24)
This, in conjunction with the proof of Theorem 4.1(ii), implies that
³°° − 0
°° ≥ −12 (ln )´=
¡−1
¢and
µmax∈0
¯ − 0
¯≥ 02
¶=
¡−1
¢ (A.25)
By (A.24)-(A.25), Lemma B.4(v) and Assumption A.1(x),
µmax∈0
°°°3
°°° (ln )3+
¶=
¡−1
¢and
µmax∈0
°°°5
°°° √
¶=
¡−1
¢
Noting that 1√
°°°P=1
0
³ − 0
´°°° ≤ −12°°°
°°°½ 1
P=1
°°°
°°°2¾12½
P=1
°°° − 0
°°°2¾12= (1) we can readily show that
³max∈0
°°°4
°°° (ln )3+
´=
¡−1
¢ It follows that (Ξ ) =
1− ¡−1
¢ where
Ξ ≡½max∈0
¯ − 0
¯≤ 02
¾∩½max∈0
°°°3
°°° ≤ (ln )3+¾∩½max∈0
°°°4
°°° ≤ (ln )3+
¾∩½max∈0
°°°5
°°° ≤ √
¾
Then conditional on Ξ we have that uniformly in ∈ 0°°°°³ −
´0 ³2 + 3 + 4 + 5
´°°°° ≥°°°°³ −
´02
°°°°− °°°°³ −
´0 ³3 + 4 + 5
´°°°°≥
− 1√
°°° −
°°°− °°° −
°°° 2 (ln )3+ +√
≥√0
°°° −
°°° 4 for sufficiently large ( )
48
where the last inequality follows because ≥ 02 on Ξ √ À 3 (ln )
3++√ under
Assumption A.5(ii). It follows that for all ∈ 0
() = ³ ∈ | ∈ 0
´=
³−1 = 2 + 3 + 4 + 5
´≤
³¯( − )
01
¯≥¯( − )
0³2 + 3 + 4 + 5
´¯´≤
³°°° −
°°°°°°1
°°° ≥ √0 °°° −
°°° 4Ξ
´+ (Ξ∗ )
≤ ³°°°1
°°° ≥ √04´+ (Ξ∗ ) = ¡−1
¢
where Ξ∗ denotes the complement of Ξ and the convergence follows by Assumptions A.1(i), A.2(iv)
and A.3(i) (see the remark after Assumption A.3), and the fact that (Ξ∗ ) = ¡−1
¢ Consequently,
we can conclude that with probability 1− ¡−1
¢, − must be in a position where k − k is not
differentiable with respect to for any ∈ 0. That is, ³°°° −
°°° = 0 | ∈ 0
´= 1 −
¡−1
¢as
( )→∞ Then the rest of the proof follows SSP. ¥
Proof of Theorem 4.3. For the post-Lasso estimates we have the following first order conditions:
1
X∈
bX0⎛⎝ − +
1
0X=1
X∈
⎞⎠ = 0 for = 1 0
wherebX = − 1
P∈
for ∈ It follows that vec(α0) = Q−1 V Let Q Q and V
be analogously defined as Q Q and V for = 1 0 By Theorem 4.2 and using similar
arguments as used in the proof of Proposition A.2, we can readily show that Q = Q+ (( )−12
)
Q = Q + (( )−12
) and V = V + (( )−12
) It follows that
vec (α0) = Q−1V + (( )
−12)
That is, α is asymptotically equivalent to the infeasible oracle estimator α0under H (0). Using =
0 − 1
P0
=1
P∈0
0 + we can readily show that
vec¡α0−α00
¢= Q−1U
0 + (( )
−12)
where U0 =¡U001 U
000
¢0and U =
1
P∈0
X0 Noting that = 0 − · + i and
X = − 1
P∈0
for ∈ 0 we have
U0 =1
X∈0
X=1
⎛⎝ − 1
X∈0
⎞⎠ ( − ·)
It is easy to show that U0 = (( )−12
+ −1) by moment calculations and Chebyshev inequalityunder Assumptions A.1 and A.2(i). Then vec
¡α0−α00
¢= (( )
−12+−1) under Assumption A.6.
(ii) We make the following decomposition:
U0 =1
X∈0
X=1
¡ − ·
¢( − ·)− 1
X∈0
X=1
⎛⎝ 1
X∈0
¡ − ·
¢⎞⎠ ( − ·)
≡ U01 −U02
49
Under Assumptions A.1 and A.2(i), we can readily show that
U01 =1
X=1
X=1
½
1© ∈ 0
ª −(
()· − ())
¾− B +
³−1−12
´ and
U02 =1
X=1
X=1
£1© ∈ 0
ª− 1¤(()· − ()) +
³−1−12
´
where B is defined in Section 4.2. It follows that
U0 =1
X∈0
X=1
n −(
()· − ())
o − B +
³−1−12
´= U − B +
³−1−12
´
Then by Assumptions A.6(i)-(ii),√vec
¡α0−α00
+Q−1B
¢= Q−1
√U+ (1)
→ (0Q−10 Ω0Q−10 )
¥
Proof of Theorem 4.4. Let · = (·1 · )0 Noting that b = + (0 − )+
1
P=1 ( −0)
=0 − · + i +0(0 − ) +
1
P=10( − 0 )
0 = and 0i = 0 we have
−122 (0) = −12X=1
b000b
= −12X=1
(0 − ·)00
0 (0 − ·)
+−12X=1
(0 − )0 0
0(0 − )
+−12X=1
⎛⎝ 1
X=1
( − 0 )0 0
⎞⎠00
⎛⎝ 1
X=1
( − 0)
⎞⎠+2−12
X=1
(0 − ·)00(
0 − )
+2−12X=1
(0 − ·)00
0
⎛⎝ 1
X=1
( − 0)
⎞⎠+2−12
X=1
(0 − )00
⎛⎝ 1
X=1
( − 0)
⎞⎠≡ 1 +2 +3 + 24 + 25 + 26 say.
We analyze = 1 2 6 in turn.
(i) For 1 we make the following decomposition
1 = −12X=1
000 +−12
X=1
0·00· − 2−12
X=1
000·
≡ 11 +12 − 213
50
11 is identical to 1 studied in the proof of Theorem 3.1 and thus we have 11 − →
(0 0) Using 0 = − 1i i
0 we can decompose 12 as follows
12 = −12X=1
0·00·
= −12X=1
0·· + −2−12
X=1
0·i i0
i i0 · − 2−1−12
X=1
0·i i
0 ·
≡ 12 +12 − 212 say.
Note that12 = −1−12P
=1 0·Ω
−1 0
· ≤ 2 12 where 2 ≡ [min1≤≤ min(Ω)]−1
= (1) and 12 = −1−12P
=1 0·
0· In view of the fact that 0
· =P
=1 · =1
P=1
P=1 and by straightforward moment calculations and Assumptions A.1(iv)-(v), we can
show that
¯12
¯= −1−12
X=1
(0·0·) = −1−52
X=1
X=1
X=1
X=1
X=1
(0)
= −1−52X=1
X=1
X=1
¡2
0
¢= (−12)
Then 12 =
¡−12
¢by Markov inequality and 12 =
¡−12
¢. For 12 we have
12 = −3−12X=1
0·i i0Ω
−1 0
i i0 · ≤ 3 12
where 3 ≡ max1≤≤
¡−1i0
¢22 = (1) and 12 = −1−12
P=1
0·i i
0 · Noting
that i0 · =1
P=1
P=1 we have
¯12
¯= −1−12
X=1
(0·i i0 ·) = −1−52
X=1
X=1
X=1
X=1
X=1
()
= −1−52X=1
X=1
X=1
¡2¢= (−12)
Then 12 =
¡−12
¢by Markov inequality and 12 =
¡−12
¢. By the Cauchy-Schwarz
inequality, |12| ≤ 1212 1212 =
¡−12
¢ Consequently we have shown that
12 =
¡−12
¢
Now, we study13Recall that denotes the ( ) element of00: = −1
P=1
P=1
0Ω−1 and ≡ −1
P=1
P=1
0Ω−1 where = 1 = − −1 Following
the proof of Lemma B.4(i), we can show that 13 = −12P
=1
P=1
P=1 · = 13 +
(1) where 13 = −12P
=1
P=1
P=1 · Now, in view of the fact that
= −1 0Ω−1 − −2
X=1
0Ω−1 − −2
X=1
0Ω−1 + −3
X=1
X=1
0Ω−1 (A.26)
51
we make the following decomposition
13 = −32X=1
X=1
X=1
X=1
= −1−32X=1
X=1
X=1
X=1
0Ω−1 − −2−32
X=1
X=1
X=1
X=1
X=1
0Ω−1
−−2−32X=1
X=1
X=1
X=1
X=1
0Ω−1
+−3−32X=1
X=1
X=1
X=1
X=1
X=1
0Ω−1
≡ 13 − 13 − 13 + 13
By Assumptions A.1(iv)-(v), we can show that ¡213
¢=
¡−1
¢ implying that 13 =
¡−12
¢ For 13 we have
¡13
¢= −2−32
X=1
X=1
X=1
X=1
¡
0Ω−1
¢= −2−32
X=1
X1≤6=6=≤
¡
0Ω−1
¢+(−12)
= 1 + 2 +(−12)
where 1 = −2−32P
=1
P1≤≤
¡
0Ω−1
¢and 2 = −2−32
P=1
P1≤≤
¡
0Ω−1
¢ By Assumptions A.1(i), (iii) and (v) and Davydov inequality
|1| ≤ −2−12X
1≤≤ (− )
(1+)(2+)= (−12)
Similarly, 2 = (−12) and ¡13
¢= (−12) Analogously, we can show that (2
13) =
(−1) It follows that 13 = (−12) By the same token, we can show that 13 = (1)
Noting that 13 = −3−32P
=1
P=1
P=1
P=1
P=1
P=1
0Ω−1 we have
¡213
¢= −6−3
X1≤1234≤
X1≤128≤
¡11223546
013Ω−11 14
037Ω−13 38
¢
Clearly, the expectation in the last summation is 0 if #1 2 3 4 ≥ 3With this observation, we can read-ily apply Assumptions A.1(i)-(v) and Davydov inequality to show that (2
13) = ¡−1 +−2
¢=
(1) Then 13 = (1) Consequently, we have 13 = (1)
In sum, we have shown that 1 −→ (0 0)
(ii) 2 is the same as 2 in the proof of Theorem 3.1 and thus is (1)
52
(iii) Noting that 00 is a projection matrix with maximum eigenvalue given by 1 and using the
Cauchy-Schwarz inequality, we have
3 ≤ 12
⎛⎝ 1
X=1
( − 0 )0 0
⎞⎠⎛⎝ 1
X=1
( − 0 )
⎞⎠= 12
⎧⎨⎩ 1
X=1
°°° − 0
°°°2⎫⎬⎭⎧⎨⎩ 1
X=1
kk2⎫⎬⎭ = 12 (( )
−1+ −2) (1) = (1)
(iv) For 4 we make the following decomposition:
4 = −12X=1
00(0 − )−−12
X=1
0·0(0 − ) ≡ 41 −42
41 is the same as 3 studied in the proof of Theorem 3.1 and thus 41 = (1) For 42
we can apply the Cauchy-Schwarz inequality to obtain
|42| ≤(
X=1
0·000·
)12(−1
X=1
°°°0 −
°°°2)12 = (12) (( )
−12+−1) = (1)
becauseP
=1 0·0
00· =
P=1tr(0
00·0·) ≤
P=1tr(
00·0·) ≤
P=1
0·
0· and
X=1
(0·0·) =
1
2
X=1
X=1
X=1
X=1
X=1
(0) =
1
2
X=1
X=1
X=1
¡2
0
¢= ( )
Hence 4 = (1)
(v) For 5 we make the following decomposition:
5 = −12X=1
000
⎛⎝ 1
X=1
( − 0 )
⎞⎠−−12X=1
0·00
⎛⎝ 1
X=1
( − 0)
⎞⎠≡ 51 −52
Using 0 = − 1i i
0 we can decompose 51 as follows
51 = −32X=1
X=1
000( − 0 )
=
0X=1
−32X∈
X=1
000( − 0) =
0X=1
51
¡ − 0
¢
where51 = −32P
∈
P=1
00
0 Let = min(( )12
) To show that51 =
(1) it suffices to prove that 51 = ( ) for = 1 0 as − 0 = (( )−12
+ −1)Following the proofs of Propositions A.3 and A.1, we can show that
51 = −32X∈0
X=1
000 + (1)
= −32X∈0
X=1
X=1
X=1
+ (1) = 51 + ( )
53
where 51 = −32P
∈0
P=1
P=1
P=1 Now, using (A.26) we make the following
decomposition
51 = −1−32X∈0
X=1
X=1
X=1
0Ω−1 − −2−32
X∈0
X=1
X=1
X=1
X=1
0Ω−1
−−2−32X∈0
X=1
X=1
X=1
X=1
0Ω−1
+−3−32X∈0
X=1
X=1
X=1
X=1
X=1
0Ω−1
≡ 51 (1)− 51 (2)− 51 (3) + 51 (4)
As in the analysis of 13 we can show that 51 () = ( ) for = 1 4 For example, we can
easily show that 51 (1) has the same probability order as −1−12
P=1
P=1
P=1
0Ω−1
with =1
P∈0
() and the squared Frobenius norm of the last term has expectation given by
−2−1X=1
X=1
X=1
X=1
X=1
X=1
¡
0Ω−1
0Ω−1
¢0
= −2−1X=1
X=1
X=1
X=1
X=1
¡
0Ω−1
0Ω−1
¢0
+−2−1X=1
X=1 6=
X=1
X=1
X=1
X=1
¡
0Ω−1
¢¡
0Ω−1
¢0
= ( ) + ()
by the repeated use of Davydov inequality. It follows that 51 (1) =
¡12 + 12
¢= ( )
Then we have 51 = ( )
For 52 we apply 0 = − 1i i
0 to make the following decomposition:
52 = −12X=1
0· (00)
−1 00
⎛⎝ 1
X=1
( − 0 )
⎞⎠−−1−12
X=1
0·i i0 (
00)
−1 00
⎛⎝ 1
X=1
( − 0 )
⎞⎠ = 52 −52
By the Cauchy-Schwarz inequality, we have52 ≤ 52 (1)12 52 (2)12 where52 (1) =
−1P
=1 0·
0· and
52 (2) = −1X=1
⎛⎝ 1
X=1
( − 0 )
⎞⎠0
0 (00)
−2 00
⎛⎝ 1
X=1
( − 0 )
⎞⎠12
54
By Markov inequality, we can readily show that 52 (1) = (1) In addition, noting that 00
is a projection matrix and has maximum eigenvalue 1, we have
52 (2) ≤ 2−1
X=1
⎛⎝ 1
X=1
( − 0 )
⎞⎠0
00
⎛⎝ 1
X=1
( − 0)
⎞⎠12
≤ 2
°°°°°° 1X=1
( − 0)0 0
°°°°°°2
≤ 2
1
X=1
°°° − 0
°°°2 1
X=1
kk2
= (1) (( )−1+ −2) (1) =
¡−1 + −1
¢
It follows that 52 =
¡−12 + −12
¢= (1) Similarly, we can show that 52 = (1)
Thus 52 = (1) and 5 = (1)
(vi) By the Cauchy-Schwarz inequality and the results in (ii)-(iii), 6 ≤ 212 3 12 = (1)
Combining the results in (i)-(vi) completes the proof of the theorem. ¥
B Some technical lemmas
Define theth order U-statistic U =Ã
!−1P1≤1≤
¡1
¢where is symmetric in its
arguments. Let (·) denote the distribution function of Let (0) =R · · · R (1 )Π=1 ()
and () (1 ) =R · · · R (1 +1 ) Π=+1 () for = 1 Let (1) () =
(1) () − (0) and () (1 ) = () (1 ) −P−1
=1
P()
()¡1
¢ −(0) for = 2
where the sumP
() is taken over all subsets 1 ≤ 1 2 · · · ≤ of 1 2 Let H()
=Ã
!−1 P1≤1≤ ()
¡1
¢ Then by Theorem 1 in Lee (1990, p. 26), we have the following
Hoeffding decomposition:
U = (0) +
X=1
Ã
!H()
(B.1)
To study the second moment of H()
for 3 ≤ ≤ we need the following lemma.
Lemma B.1 Let ≥ 1 be an -dimensional strong mixing process with mixing coefficient (·) and dis-tribution function (·) Let the integers (1 ) be such that 1 ≤ 1 2 · · · ≤ Suppose that
maxR | (1 )|1+ 1 (1 )
R | (1 )|1+ 1 (1 ) +1(+1 ) ≤ for some 0 , where, e.g., 1 (1 ) denotes the distribution function of
¡1
¢.
Then¯Z (1 ) 1 (1 )−
Z (1 )
(1)1
(1 ) +1 (+1 )
¯≤ 41(1+) (+1 − )
(1+)
Proof. See Lemma 2.1 in Sun and Chiang (1997).
55
Lemma B.2 Let ≥ 1 be an -dimensional strong mixing process with mixing coefficient (·) anddistribution function (·) Suppose that () =
¡−3(2+)−
¢ If there exists 0 such that
≡ max½Z
| (1 · · · )|2+ Π=1 () ¯¡1
¢¯2+¾ ≤ X=1
()
and −1P
=1
P=1
() = (1) then [H()
]2 =
¡−3
¢for 3 ≤ ≤
Proof. The proof is analogous to that of Lemma A.6 in Su and Chen (2013) who consider conditional
strong mixing processes instead.
Lemma B.3 Recall Ω ≡ 00 and Ω ≡ (Ω) Let Ω1 ≡ 0
and Ω1 ≡ (Ω1) Suppose
Assumptions A.1-A.3 hold. Then (i) max(Ω1) ≤ max (Ω1) +
¡−12
¢ (ii) min(Ω1) ≥ min (Ω1)−
¡−12
¢ (iii) max1≤≤
°°°Ω1 −Ω1°°° = ( ) (iv) max1≤≤°°°Ω−11 −Ω−11 °°° = ( ) (v)
max1≤≤°°°Ω −Ω°°° = ( ) and max1≤≤
°°°Ω−1 −Ω−1 °°° = ( ) where ≡ max( )1(4+2)
× log ( ) (log ( ) )12
Proof. The results in (i)-(ii) follow from Lemmas A.1(iv)-(v) in Su and Jin (2012). Su and Chen (2013,
Lemma A.7) prove (iii) for the conditional strong mixing process. The result also holds for strong mixing
processes with a simple application of the Bernstein-type inequality for strong mixing processes (see, e.g.,
Lemma 2.2 in Sun and Chiang (1997)). (iv) follows from (i)-(iii) and the submultiplicative property of the
Frobenius norm.
Now we show (v). Using 0 = − −1i i0 , we can decompose Ω −Ω as follows:
Ω −Ω = −1 [ 00 − ( 0
0)]
=³Ω1 −Ω1
´− · 0
· +¡· 0
·¢
=³Ω1 −Ω1
´− £· −
¡·¢¤ £
· −¡·¢¤0 − £· −
¡·¢¤¡ 0·¢
− ¡·¢ £· −
¡·¢¤0+£¡· 0
·¢−
¡·¢¡ 0·¢¤
Following the proof of (iii), we can show that max1≤≤°°· −
¡·¢°° = (1 ) where 1 ≡
max( )1(8+4)
log ( ) (log ( ) )12 = ( ) max1≤≤°° ¡·
¢°° = (1) by Assump-
tion A.1(i). Let denote the ( )th element of ¡· 0
·¢−
¡·¢¡ 0·¢for = 1 Then by
triangle inequality, Davydov inequality, and Assumption A.2(iii),
|| =1
2
¯¯X=1
X=1
cov ()
¯¯ ≤ 1
2
X=1
|cov ()|+ 1
2
X1≤6=≤
|cov ()|
≤ ¡−1
¢+8
∞X=1
()(3+2)(4+2)
= ¡−1
¢
where ≤ sup≥1max1≤≤ max1≤≤ kk8+4 Then by the triangle inequality, we havemax1≤≤°°°Ω −Ω°°° = ( ) (vi) follows from (v) and Assumption A.1(ii).
Lemma B.4 Let and be as defined in the proof of Theorem 3.1. Suppose Assumptions A.1-A.3
hold. Then
56
(i) 1 ≡ −12P
=1
P1≤6=≤
¡ −
¢= (1)
(ii) 2 ≡ −2−12P
=1
P1≤≤
P=1 [ − ()]
0Ω−1 = (1)
(iii) 3 ≡ −2−12P
=1
P1≤≤
P=1
0Ω−1 [ − ()] = (1)
(iv) 4 ≡ −3−12P
=1
P1≤≤
P=1
P=1 [ − ()]
0Ω−1 [ − ()] = (1)
(v) 5 ≡ −3−12P
=1
P1≤≤
P=1
P=1 [ − ()]
0Ω−1 () = (1)
Proof. The proof of (i) is analogous to that of Lemma A.8 in Su and Chen (2013) except that we
replace their Lemmas A.5-A.7 by Lemmas B.1-B.3. To show (ii), letting ≡ [ − ()]0Ω−1
we can decompose 2 as follows
2 =1
2√
X=1
X1≤≤
X=1 6=
+1
2√
X=1
X1≤≤
+1
2√
X=1
X1≤≤
≡ 21 +22 +23 say.
Let = (0)0 0 ( ) = and ( ) = [0 ( ) + 0 ( )
+0 ( )+0 ( )+0 ( )+0 ( )]6 Let ≡Ã
3
!−1P1≤≤
( ) Then 21 =√
P=1 where =
(−1)(−2)2
By Assumption A.1 and Lemma
B.2, (221) =
2
P=1
¡2
¢= 2
¡−3
¢=
¡−1
¢ It follows that 21 =
¡−12
¢.
Noting that D (22) = 0 and (222) =
14
P=1
P1≤≤ (2) =
¡−1
¢
we have 22 =
¡−12
¢ Similarly, 23 =
¡−12
¢ Then (ii) follows. The proof of (iii)-(v)
is analogous to that of (ii) and thus omitted.
REFERENCES
Acemoglu, D., Johnson, S., Robinson, J., 2005. The rise of Europe: Atlantic trade, institutional change,and economic growth. American Economic Review 95, 546-79.
Acemoglu, D., Johnson, S., Robinson, J., Yared, P., 2008. Income and democracy. American EconomicReview 98, 808-842.
Ando, T., Bai, J., 2016. Panel data models with grouped factor structure under unknown group member-ship. Journal of Applied Econometrics 31, 163-191.
Barro, R. J., 1999. Determinants of democracy. Journal of Political Economy 107, 158-183.
Bertsekas, D.P., 1995. Nonlinear Programming. Athena Scientific, Belmont, MA.
Bester, C. A., Hansen, C. B., 2016. Grouped effects estimators in fixed effects models. Journal of Econo-metrics 190, 197-208.
Bonhomme, S., Manresa, E., 2015. Grouped patterns of heterogeneity in panel data. Econometrica 83,1147-1184.
Deb, P., Trivedi, P. K., 2013. Finite mixture for panels with fixed effects. Journal of Econometric Methods2, 35-51.
Hahn, J., and Kuersteiner, G., 2011. Reduction for dynamic nonlinear panel models with fixed effects.Econometric Theory 27, 1152-1191.
Hahn, J., Moon H. R., 2010. Panel data models with finite number of multiple equilibria. EconometricTheory 26, 863-881.
Hall, P., Heyde, C. C., 1980. Martingale Limit Theory and Its Applications. Academic Press, New York.
57
Holm, S., 1979. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics6, 65-70.
Hsiao, C., 2014. Analysis of Panel Data. Cambridge University Press, Cambridge.
Hsiao, C., Tahmiscioglu A. K., 1997. A panel analysis of liquidity constraints and firm investment. Journalof the American Statistical Association 92, 455-465.
Lee, A. J., 1990. U-statistics: Theory and Practice. Marcel Dekker, New York.
Leeb, H., Pötscher, P. M., 2005. Model selection and inference: facts and fiction. Econometric Theory21, 21-59.
Leeb, H., Pötscher, P. M., 2008. Sparse estimators and the oracle property, or the return of Hodges’estimator. Journal of Econometrics 142, 201-211.
Leeb, H., Pötscher, P. M., 2009. On the distribution of penalized maximum likelihood estimators: theLASSO, SCAD, and thresholding. Journal of Multivariate Analysis 100, 2065-2082.
Lin, C-C., Ng, S., 2012. Estimation of panel data models with parameter heterogeneity when groupmembership is unknown. Journal of Econometric Methods 1, 42-55.
Lipset, S. M., 1959. Some social requisites of democracy: economic development and political legitimacy.American Political Science Review 53, 69-105.
Onatski, A., 2009. Testing hypotheses about the number of factors in large factor models. Econometrica77, 1447-1479.
Pesaran, M. H., Smith, R., Im K. S., 1996. Dynamic linear models for heterogeneous panels,. in: L.Matyas and P. Sevestre (Eds.), The Econometrics of Panel Data: A Handbook of the Theory withApplications, second revised edition, pp.145-195, Kluwer Academic Publishers, Dordrecht.
Pesaran, M. H., and Yamagata, T., 2008. Testing slope homogeneity in large panels. Journal of Econo-metrics 142, 50-93.
Phillips, P. C. B., Sul, D., 2003. Dynamic panel estimation and homogeneity testing under cross sectiondependence, Econometrics Journal 6, 217-259.
Pollard, D., 1984. Convergence of Stochastic Processes. Springer-Verlag, New York.
Sarafidis, V., Weber, N., 2015. A partially heterogeneous framework for analyzing panel data. OxfordBulletin of Economics and Statistics 77, 274-296.
Schneider, U., Pötscher, P. M., 2009. On the distribution of the adaptive LASSO estimator. Journal ofStatistical Planning and Inference 139, 2775-2790.
Su, L., Chen, Q., 2013. Testing homogeneity in panel data models with interactive fixed effects. Econo-metric Theory 29, 1079-1135.
Su, L., Jin, S., 2012. Sieve estimation of panel data models with cross section dependence. Journal ofEconometrics 169, 34-4
Su, L., Shi, Z., Phillips, P. C. B., 2016. Identifying latent structures in panel. Econometrica, forthcoming.
Sun, S., Chiang, C-Y., 1997. Limiting behavior of the perturbed empirical distribution functions evaluatedat U-statistics for strongly mixing sequences of random variables. Journal of Applied Mathematicsand Stochastic Analysis 10, 3-20.
Sun, Y., 2005. Estimation and inference in panel structure models. Working paper, Dept. of Economics,UCSD.
58