mysmu.edumysmu.edu/faculty/ljsu/Publications/group20160915.pdfDetermining the Number of Groups in Latent Panel Structures with an Application to Income and Democracy ∗ Xun Lu and

Determining the Number of Groups in Latent Panel Structures

with an Application to Income and Democracy∗

Xun Lu and Liangjun Su

Department of Economics, Hong Kong University of Science & TechnologySchool of Economics, Singapore Management University, Singapore

September 15, 2016

Abstract

We consider a latent group panel structure as recently studied by Su, Shi, and Phillips

(2016), where the number of groups is unknown and has to be determined empirically. We

propose a testing procedure to determine the number of groups. Our test is a residual-based

LM-type test. We show that after being appropriately standardized, our test is asymptotically

normally distributed under the null hypothesis of a given number of groups and has power

to detect deviations from the null. Monte Carlo simulations show that our test performs

remarkably well in finite samples. We apply our method to study the effect of income on

democracy and find strong evidence of heterogeneity in the slope coefficients. Our testing

procedure determines three latent groups among seventy-four countries.

Key words: Classifier Lasso; Dynamic panel; Latent structure; Penalized least square; Num-

ber of groups; Test

JEL Classification: C12, C23, C33, C38, C52.

1 Introduction

Recently latent group structures have received much attention in the panel data literature; see,

e.g., Sun (2005), Lin and Ng (2012), Deb and Trivedi (2013), Bonhomme and Manresa (2015,

∗The authors express their sincere appreciation to the co-editor, Frank Schorfheide, and three anonymous referees

for their many constructive comments on the previous version of the paper. They also thank Stéphane Bonhomme,

Xiaohong Chen, Han Hong, Cheng Hsiao, Hidehiko Ichimura, Peter C. B. Phillips, and Katsumi Shimotsu for

discussions on the subject matter and valuable comments on the paper. Su gratefully acknowledges the Singapore

Ministry of Education for Academic Research Fund under Grant MOE2012-T2-2-021 and the funding support

provided by the Lee Kong Chian Fund for Excellence. All errors are the authors’ sole responsibilities. Address

correspondence to: Liangjun Su, School of Economics, Singapore Management University, 90 Stamford Road,

Singapore, 178903; Phone: +65 6828 0386; e-mail: [email protected].

1

BM hereafter), Sarafidis and Weber (2015), Ando and Bai (2016), Bester and Hansen (2016), and

Su, Shi, and Phillips (2016, SSP hereafter). In comparison with some other popular approaches

to model unobserved heterogeneity in panel data models such as random coefficient models (see,

e.g., Hsiao (2014, chapter 6)), one important advantage of the latent group structure is that it

allows flexible forms of unobservable heterogeneity while remaining parsimonious. In addition, the

group structure has sound theoretical foundations from game theory or macroeconomic models

where multiplicity of Nash equilibria is expected (c.f. Hahn and Moon (2010)). The key question

in latent group structures is how to identify each individual’s group membership. Bester and

Hansen (2016) assume that membership is known and determined by external information, say,

external classification or geographic location, while others assume that it is unrestricted and

unknown and propose statistical methods to achieve classification. Sun (2005) uses a parametric

multinomial logit regression to model membership. Lin and Ng (2012), BM, Sarafidis and Weber

(2015), and Ando and Bai (2016) extend K-means classification algorithms to the panel regression

framework. Deb and Trivedi (2013) propose EM algorithms to estimate finite mixture panel data

models with fixed effects. Motivated by the sparse feature of the individual regression coefficients

under latent group structures, SSP propose a novel variant of the Lasso procedure, i.e., classifier

Lasso (C-Lasso), to achieve classification. While these methods make important contributions

by empirically grouping individuals, to implement these methods, we often need to determine

the number of groups first. Some information criteria have been proposed to achieve this goal

(see, e.g., BM and SSP), which often rely on certain tuning parameters. This paper provides a

hypothesis-testing-based solution to determine the number of groups.

Specifically, we consider the panel data structure as in SSP:

= 00 + + = 1 and = 1 (1.1)

where and are the vector of regressors, individual fixed effect, and idiosyncratic error

term, respectively, and 0 is the slope coefficient that can depend on individual . We assume that

the individuals belong to groups and all individuals in the same group share the same slope

coefficients That is, 0 ’s are homogeneous within each of the groups but heterogeneous across

the groups. For a given we can apply SSP’s C-Lasso procedure or the K-means algorithm

to determine the group membership and estimate 0 ’s. However, in practice, is unknown and

has to be determined from data. This motivates us to test the following hypothesis:

H0 (0) : = 0 versus H1 (0) : 0 ≤ max

where 0 and max are pre-specified by researchers. We can sequentially test the hypotheses

H0 (1) H0 (2) until we fail to reject H0(∗) for some ∗ ≤ max and conclude that the

number of groups is ∗ Onatski (2009) applies a similar procedure to determine the number of

latent factors in panel factor structures.

2

In addition to helping to determine the number of groups, testingH0 (0) itself is also useful for

empirical research. When0 = 1 the test becomes a test for homogeneity in the slope coefficients,

which is often assumed in empirical applications. When 0 is some integer greater than 1, we

test whether the group structure is correctly specified. Although the group structure is flexible in

terms of modeling unobserved slope heterogeneity, it could still be misspecified. Inferences based

on misspecified models are often misleading. Thus conducting a formal specification test is highly

desirable. Although we work in the same framework as SSP, the questions of interest are different.

SSP study the classification and estimation problem, while we consider the specification testing.

To avoid overlapping with SSP’s paper, we omit the detailed discussions on the estimation in this

paper.

Our test is a residual-based LM-type test. We estimate the model under the null hypothesis

H0 (0) to obtain the restricted residuals, and the test statistic is based on whether the regressors

have predictive power for the restricted residuals. Under the null of the correct number of latent

groups, the regressors should not contain any useful information about the restricted residuals. We

show that after being appropriately standardized, our test statistic is asymptotically normal under

the null. The p-values can be obtained based on the standard normal approximation, and thus

the test is easy to implement. Our test is related to the literature on testing slope homogeneity

and poolability for panel data models in which case 0 = 1. See, e.g., Pesaran, Smith, and Im

(1996), Phillips and Sul (2003), Pesaran and Yamagata (2008), and Su and Chen (2013), among

others. Nevertheless, none of the existing tests can be directly applied to test = 0 where

0 1 Also, the test proposed in this paper substantially differs from the existing tests in

technical details, as we need to apply the C-Lasso method or K-means algorithm to estimate the

model under the null.

When there are no time fixed effects in the model, both SSP’s C-Lasso method and the K-

means algorithm can be used to estimate the model under H0 (0). So the residuals that are used

to construct our LM statistic can be obtained from either method. When time fixed effects are

present in the model, we extend SSP’s method to allow for time fixed effects and show that the

LM statistic behaves asymptotically equivalent to that in the absence of time fixed effects.

We conduct Monte Carlo simulations to show the excellent finite sample performance of our

test. Both the levels and powers of our test perform well in finite samples. With a high probability,

our method can determine the number of groups correctly.

Our method is applicable to a wide range of empirical studies. In this paper, we provide

detailed empirical analysis on the relationship between income and democracy. Specifically,

is a measure of democracy for country in period . includes its income (the logarithm of

its real GDP per capita) and lagged The main parameter of interest is the effect of income

on democracy. In an influential paper, Acemoglu, Johnson, Robinson, and Yared (2008, AJRY

hereafter) use the standard panel data model with common slope coefficients and find that the

3

effect of income on democracy is insignificant. We find that the slope coefficients (the effects

of income and lagged democracy on democracy) are actually heterogeneous. The hypothesis of

homogeneous slope coefficients is strongly rejected with p-values being less than 0.001. This

suggests that the common coefficient model is likely to be misspecified. Further, we determine the

number of heterogeneous groups to be three and find that the slope coefficients of the three groups

differ substantially. In particular, for one group, the effect of income on democracy is positive and

significant, and for the other two groups, the income effects are also significant, but negative with

different magnitudes. Therefore, our method provides the new insight that the effect of income

on democracy is heterogeneous and significant, in sharp contrast to the existing finding that

the income effect is insignificant. Our classification of groups is completely data-driven and the

result does not show any apparent pattern, like in geographic locations. We further investigate

the determinants of the group pattern by running a cross-country multinomial logit regression

on some country specific-covariates. We find that two variables, namely, the constraints on the

executive at independence and the long-run economic growth, are important determinants.

There are numerous potential applications of our method. For example, SSP study the de-

terminants of saving rates across countries. That is, is the ratio of saving to GDP, and

includes the lagged saving rates, inflation rates, real interest rates, and per capita GDP growth

rates. Using an information criterion, they find a two-group structure in the slope coefficients.

Our method is perfectly applicable to their setting. Another example is Hsiao and Tahmiscioglu

(1997, HT) who study the effect of firms’ liquidity constraints on their investment expenditure.

Specifically, their is the ratio of firm ’s capital investment to its capital stock at time and

includes its liquidity (defined as cash flow minus dividends), sales and Tobin’s HT clas-

sify firms into two groups, less-capital-intensive firms and more-capital-intensive firms, using the

capital intensity ratio of 0.55 as a cutoff point. Our method can be applied to their setting by

classifying the firms into heterogeneous groups in a data-driven way, which allows unobservable

heterogeneity in the slope coefficients. In general, our method provides a powerful tool to detect

unobserved heterogeneity, which is an important phenomenon in panel data analysis. It can be

applied to any linear model for panel data where the time dimension is relatively long.

The remainder of the paper is organized as follows. In Section 2, we introduce the hypotheses

and the test statistic for panel data models with individual fixed effects only. In Section 3 we

derive the asymptotic distribution of our test statistic under the null and study the global power

of our test. In Section 4, we extend the analysis to allow for both individual and time fixed effects

in the models. We conduct Monte Carlo experiments to evaluate the finite sample performance

of our test in Section 5 and apply it to the income-democracy dataset in Section 6. Section 7

concludes. All proofs are relegated to the Appendix.

Notation. For an × real matrix we denote its transpose as 0 and its Frobenius norm as

kk (≡ [tr(0)]12) where ≡ means “is defined as”. Let ≡ (0)−10 and ≡ −

4

where denotes an × identity matrix. When = is symmetric, we use max ()and min () to denote its maximum and minimum eigenvalues, respectively, and denote diag()

as a diagonal matrix whose ( )th diagonal element is given by . Let 0 ≡ −1i i0 and

0 ≡ − −1i i0 where i is a × 1 vector of ones. Moreover, the operator −→ denotes

convergence in probability, and−→ convergence in distribution. We use ( ) → ∞ to denote

the joint convergence of and when and pass to infinity simultaneously. We abbreviate

“positive semidefinite”, “with probability approaching 1”, and “without loss of generality” to

“p.s.d.”, “w.p.a.1”, and “wlog”, respectively.

2 Hypotheses and test statistic

In this section we introduce the hypotheses and test statistic.

2.1 Hypotheses

We consider the panel structure model

= 00 + + = 1 and = 1 (2.1)

where is a ×1 vector of strictly exogenous or predetermined regressors, an individual fixedeffect, and the idiosyncratic error term. The model with both individual and time fixed effects

will be studied in Section 4. We assume that 0 has the following group structure:

0 =

⎧⎪⎪⎨⎪⎪⎩01 if ∈ 01...

...

0 if ∈ 0

(2.2)

where is an integer and©01

0

ªforms a partition of 1 such that ∪=10 =

1 and 0 ∩ 0 = ∅ for any 6= Further, 0 6= 0 for any 6= Let = #0

denote the cardinality of the set 0 We assume that G0 ≡ ©01 0ª α0 ≡ (01 0)and β0 ≡ (01

0) are all unknown. One key step in estimating all these parameters is to

first determine as once is determined, we can readily apply SSP’s C-Lasso method or the

K-means algorithm. This motivates us to test the following hypothesis:

H0 (0) : = 0 versus H1 (0) : 0 ≤ max (2.3)

Here we assume that max and are fixed, i.e., they do not increase with the sample size

or

The testing procedure developed below can be used to determine Suppose that we have

a priori information such that min ≤ ≤ max where min is typically 1. Then we can first

5

test: H0(min) against H1(min) If we fail to reject the null, then we conclude that = min

Otherwise, we continue to test H0(min + 1) against H1(min + 1) We repeat this procedure

until we fail to reject the null H0(∗) and conclude that = ∗ If we reject = max then

we can use the random coefficient model to estimate the model, i.e., = . This procedure is

also used in other contexts, e.g., to determine the number of lags in autoregressive (AR) models,

the cointegration rank, the rank of a matrix, and the number of latent factors in panel factor

structures (Onatski, 2009).

In theory, max can be any finite number. However, in practice, we do not suggest choosing

too large a value for max Our method is powerful when the number of groups is small. When the

number of groups is large, the classification errors can be large. Therefore, if we reject = max

for a reasonably large max, then we can simply adopt the random coefficient model to avoid

classification errors.

2.2 Estimation under the null and test statistic

Our test is a residual-based test and so we only need to estimate the model under H0 (0) : =

0 In the special case where 0 = 1 the panel structure model reduces to a homogeneous panel

data model so that 0 = 0 for all = 1 and we can estimate the homogeneous slope

coefficient using the usual within-group estimator

In the general case where 0 1 we can consider two popular ways to estimate the unknown

group structure and the group-specific parameters. For example, we can apply SSP’s C-Lasso

procedure. Let = − · where · = −1P

=1 Define and · analogously. Let

β ≡ (1 ) and α0≡ (1 0

) be the C-Lasso estimators proposed in SSP, which are

defined as the minimizer of the following criterion function:

(0)

1 (βα0) = 1 (β) +

X=1

Π0

=1 || − || (2.4)

where ≡ is a tuning parameter and

1 (β) =1

X=1

X=1

³ − 0

´2

Let =n ∈ 1 2 : =

ofor = 1 0 Let 0 = 1 2 \(∪0

=1) Al-

though SSP demonstrate that the number of elements in 0 shrinks to zero as → ∞ in finite

samples, 0 may not be empty. To fully impose the null hypothesis H0 (0) we can force all the

estimates of the slope coefficients to be grouped into 0 groups. Specifically, we classify member

in 0 to group ∗ where ∗ ≡ argminn|| − || = 1 0

o Our final estimators of 00 s

are the post-Lasso estimators. Specifically, for each of the classified groups, we re-estimate the

6

homogeneous slope coefficient using the usual within-group estimator. Let ( = 1 0) be

the estimator for group . Then the final estimators of 00 s are β ≡ (1 ) where =

for ∈

Alternatively, one can apply the K-means algorithm as advocated by Lin and Ng (2012), BM,

Sarafidis and Weber (2015), and Ando and Bai (2016). Let g = 1 denote the groupmembership such that ∈ 1 0 The K-means algorithm estimates of α0

and g can be

obtained as the minimizer of the following objective function

(0) (α0

g) =1

0X=1

X: =

X=1

³ − 0

´2 (2.5)

Let g ≡ 1 denote the K-means algorithm estimate of g With a little abuse of notation,

we also use α0≡ (1 0

) and β ≡ (1 ) to denote the K-means algorithm estimate

of α00and β0 where = Define = ∈ 1 2 : = for = 1 0 Note

that the K-means algorithm forces all individuals to be classified into one of the 0 groups

automatically so that 0 becomes an empty set in this case.

Given we can estimate the individual fixed effects using = 1

P=1( −

0)

1 The

residuals are obtained by

≡ − 0 − (2.6)

It is easy to show that

= ( − ·)−¡ − ·

¢0

= − · +¡ − ·

¢0(0 − ) (2.7)

where · = −1P

=1 Under the null hypothesis, is a consistent estimator of 0 2 Hence,

should be close to By the assumption that the regressors are predetermined, should not

have any predictive power for Specifically, we assume that the error term is a martingale

difference sequence (m.d.s.) (Assumption A.1(v) below), which implies that (|) = 0. If

we write in the linear regression form, we have

= + 0 + = 1 = 1

where and are the intercept and slope coefficients, respectively, and is the regression

error. Then the m.d.s. assumption implies that = 0 for all ’s. This motivates us to run the

following auxiliary regression model

= + 0 + = 1 = 1 (2.8)

1 If 0 = 1 we set = the within-group estimator of the homogeneous slope coefficient. Note that we also

suppress the dependence of on 02Strictly speaking, is a consistent estimator of

0 under the null. But because the cardinality of the set 0

shrinks to zero under the null as →∞ the difference between and is asymptotically negligible.

7

and test the null hypothesis

H∗0 : = 0 for all = 1

We construct an LM-type test statistic by concentrating the intercept out in (2.8). Consider

the Gaussian quasi-likelihood function for :

(φ) =

X=1

( −0)0 ( −0)

where φ ≡ (1 )0 ≡ (1 )0 ≡ (1 )0 0 ≡ − −1i i0 and i is a

× 1 vector of ones. Define the LM statistic as

1 (0) =

µ−12

(0)

φ

¶0 µ−−1

2 (0)

φ φ0

¶−1 µ−12

(0)

φ

¶ (2.9)

where we make the dependence of 1 (0) on 0 explicit. We can verify that

1 (0) =

X=1

00

¡ 00

¢−1 00 (2.10)

where the dependence of 1 (0) on 0 is through that of on 0 We will show that

after being appropriately scaled and centered, 1 (0) is asymptotically normally distributed

under H0 (0) and diverges to infinity under H1 (0)

Remark 2.1. We have included a constant term in the regression in (2.8). Under the assumption

that () = 0 and and pass to infinity jointly, one can also omit the constant term and

obtain the following LM test statistic:

1 (0) =

X=1

0

¡ 0

¢−1 0 (2.11)

The asymptotic distribution of (0) can be similarly studied with little modification. In

case is not very large as in our empirical applications, we recommend including a constant term

in the auxiliary regression in (2.8) and thus only focus on the study of 1 (0) below.

3 Asymptotic properties

In this section we first present a set of assumptions that are necessary for asymptotic analyses,

and then study the asymptotic distributions of 1 (0) under both H0 (0) and H1 (0).

8

3.1 Assumptions

Let kk ≡ [(kk)]1 for ≥ 1. Let Ω ≡ −1 00 and Ω ≡ (Ω) Define F ≡

(+1 −1 −1 =1) Let ∞ be a generic constant that may vary across

lines. Following SSP, we define the following two types of classification errors:

=n ∈ | ∈ 0

oand =

n ∈ 0 | ∈

o (3.1)

where = 1 and = 1 0 Let = ∪∈0 and = ∪∈

We make the following assumptions.

Assumption A.1. (i) max1≤≤ max1≤≤ kk8+4 ≤ for some 0 for = and

(ii) There exist positive constants Ω and Ω such that Ω ≤ min1≤≤ min (Ω) ≤ max1≤≤

max (Ω) ≤ Ω

(iii) For each = 1 ( ) : = 1 2 is a strong mixing process with mixingcoefficients (·). (·) ≡ (·) ≡ max1≤≤ (·) satisfies () = (

−) where

= 3(2 + ) + for some 0. In addition, there exist integers 0 ∗ ∈ (1 ) such that (0) = (1) ( +12) (∗)(1+)(2+) = (1) and 12−12∗ = (1)

(iv) Let ≡ (1 )0 ( ) = 1 are mutually independent of each other.

(v) For each = 1 (|F−1) = 0 a.s.

Assumption A.2. Under H0 (0) we have

(i) → ∈ (0 1) for each = 1 0 as →∞

(ii)P0

=1

°° − 0°°2 =

³( )−1 + −2

´

(iii) ³∪0

=1

´= (1) and

³∪0

=1

´= (1)

Assumption A.3. There exist finite nonnegative numbers 1 and 2 such that lim sup( )→∞ log( ) 2 = 1 and lim sup( )→∞ log( ) (3+)(4+2)−(5+3)(4+2) = 2

A.1(i) imposes moment conditions on and . A.1(ii) requires that Ω be positive definite

uniformly in A.1(iii) requires that each individual time series ( ) : = 1 2 be strongmixing. This condition can be verified if does not contain lagged dependent variables no

matter whether one treats the fixed effects ’s as random or fixed. In the case of dynamic

panel data models, Hahn and Kuersteiner (2011) assume that ’s are nonrandom and uniformly

bounded, in which case the strong mixing condition can also be verified. In the case of random

fixed effects, they suggest adopting the concept of conditional strong mixing where the mixing

coefficient is defined by conditioning on the fixed effects. The dependence of the mixing rate

on defined in A.1(i) reflects the trade-off between the degree of dependence and the moment

bounds of the process ( ) ≥ 1 The last set of conditions in A.1(iii) can easily be met.In particular, if the process is strong mixing with a geometric mixing rate, the conditions on (·)

9

can be met simply by specifying 0 = ∗ = b log c for some sufficiently large where bcdenotes the integer part of . A.1(iv) rules out cross sectional dependence among ( ) and

greatly facilitates our asymptotic analysis. A.1(v) requires that the error term be a martingale

difference sequence (m.d.s.) with respect to the filter F which allows for lagged dependent

variables in and conditional heteroskedasticity, skewness, or kurtosis of unknown form in 3

A.2(i) is typically imposed in the literature on panel data models with latent group structure;

see BM, Ando and Bai (2016) and SSP who have rigorous asymptotic analysis for clustering

estimators in different contexts. It implies that each group has asymptotically non-negligible

members as →∞ A.2(ii)-(iii) can be verified for either the SSP’s C-Lasso estimators or the K-

means algorithm estimators under suitable conditions. Please note neither BM nor Ando and Bai

(2016) study the same model as ours. But it is not difficult to extend their analysis to our model

and verify A.2(ii)-(iii). Our model is a special case considered by SSP. We can readily verify A.2(ii)-

(iii) under Assumption A.1 and the following additional conditions: (1) 12−1(ln )9 → 0 and

2−(3+2) → ∈ [0∞) as ( ) → ∞ and (2) 2(ln )6+2 → ∞ and (ln ) → 0 for

some 0 as ( )→∞ Note that we allow to include the lagged dependent variables, in

which case, both the SSP’s post Lasso-estimators ’s or the K-means algorithm estimators have

asymptotic bias of order ¡−1

¢. In the absence of dynamics in the model, one would expect

that − 0 = (( )−12).

A.3 imposes conditions on the rates at which and pass to infinity, and the interaction

between ( ) and . Note that we allow and to pass to infinity at either identical or

suitably restricted different rates. The appearance of the logarithm terms is due to the use of a

Bernstein inequality for strong mixing processes. If the mixing process ( ) ≥ 1 has ageometric decay rate, one can take an arbitrarily small in A.1(i) . In this case, A.3 puts the

most stringent restrictions on ( ) by passing → 0 : 35 → 0 as ( ) → ∞ ignoring

the logarithm term. On the other hand, if ≥ 1 in A.1(i), then the second condition in A.3

becomes redundant given the first condition. In the case of conventional panel data models with

strictly exogenous regressors only, Pesaran and Yamagata (2008) require that either√ → 0

or√ 2 → 0 for two of their tests; but for stationary dynamic panel data models, they prove

the asymptotic validity of their test only under the condition that → ∈ [0∞)3 If the error terms are serially correlated in a static panel, it is well known that by adding lagged dependent and

independent variables in the model, one can potentially ameliorate problems caused by such serial correlation. See

Su and Chen (2013, p.1090). For dynamic panel data models (e.g., panel AR(1) model), we cannot allow for serial

correlation in the error terms (e.g., AR(1) errors); otherwise, the error terms will be correlated with the lagged

dependent variables, causing the endogeneity issue which we do not address in this paper.

10

3.2 Asymptotic null distribution

Let denote the ( )’th element of ≡0(00)

−1 00 Let

† ≡ −−1

P=1 ()

and ≡ Ω−12 †. Define

≡ −12X=1

X=1

2 and ≡ 4−2−1X=1

X=2

"

0

−1X=1

#2 (3.2)

The following theorem studies the asymptotic null distribution of the infeasible statistic 1

Theorem 3.1 Suppose Assumptions A.1-A.3 hold. Then under H0 (0)

1 (0) ≡³−121 (0)−

´p

−→ (0 1) as ( )→∞

The proof of the above theorem is tedious and relegated to the appendix. To implement the

test, we need consistent estimates of both and . Let = Ω−12 ( − −1

P=1)

and ≡ (1 )0 Note that =0Ω−12 We propose to estimate by

(0) = −12X=1

X=1

2 (3.3)

and by

(0) = 4−2−1

X=1

X=2

"

0

−1X=1

#2 (3.4)

Define

1 (0) ≡³−121 (0)− (0)

´

q (0) (3.5)

The following theorem establishes the consistency of (0) and (0) and the asymptotic

distribution of 1 (0) under H0 (0)

Theorem 3.2 Suppose Assumptions A.1-A.3 hold. Then under H0 (0) (0) = +

(1) (0) = + (1) and 1 (0)−→ (0 1) as ( )→∞

Theorem 3.2 implies that the test statistic 1 (0) is asymptotically pivotal under H0 (0)

and we reject H0 (0) for a sufficiently large value of 1 (0).

We obtain the distributional results in Theorems 3.1 and 3.2 despite the fact that the individual

effects ’s can only be estimated at the slower rate −12 than the rate ( )−12 or ( )−12+

−1 at which the group-specific parameter estimates = 1 0 converge to their truevalues under H0 (0). The slow convergence rate of these individual effect estimates does not

have adverse asymptotic effects on the estimation of the bias term the variance term

and the asymptotic distribution of 1 (0) Nevertheless, they can play an important role in

finite samples, which we verify through Monte Carlo simulations.

11

3.3 Consistency

Let G = (1 ) : ∪=1 = 1 and ∩ = ∅ for any 6= That is, Gdenotes the class of all possible -group partitions of 1 . To study the consistency of ourtest, we add the following assumption.

Assumption A.4. (i) −1P=1

°°0°°2 = (1)

(ii) inf(10)∈G0 min(10)

−1P0

=1

P∈

°°0 − °°2 → 0

0 as →∞

A.4(i) is trivially satisfied if 0 ’s are uniformly bounded or random with finite second moments.

A.4(ii) essentially says that one cannot group the parameter vectors©0 1 ≤ ≤

ªinto 0

groups by leaving out an insignificant number of unclassified individuals. It is satisfied for a variety

of global alternatives:

1. The number of groups is = 0+ for some positive integer such that → ∈ (0 1)for each = 1 0 +

2. There is no grouped pattern among©0 1 ≤ ≤

ªsuch that we have a completely het-

erogeneous population of individuals.

3. The regression model is actually a random coefficient model: 0 = 0 + where 0 is

a fixed parameter vector, and ’s are independent and identical draws from a continuous

distribution with zero mean and finite variance.

4. The regression model is a hierarchical random coefficient model:

0 =

⎧⎪⎪⎨⎪⎪⎩01 + 1 if ∈ 01

......

0 + if ∈ 0

where 01 0 are defined as before, ’s (for = 1 ) are independent and identical

draws from a continuous distribution with zero mean and finite variance, and may be

different from 0

The following theorem establishes the consistency of 1 .

Theorem 3.3 Suppose Assumptions A.1 and A.3-A.4 hold. Then under H1 (0) with possible

diverging max and random coefficients, (1 (0) ≥ ) → 1 as ( ) → ∞ for any

non-stochastic sequence = ¡12

¢

The above theorem indicates that our test statistic 1 (0) is divergent at 12 -rate under

H1 (0) and thus has power to detect any alternatives such that A.4 is satisfied.

12

Remark 3.1. Our asymptotic theories here are “pointwise”, as the unknown parameters are

treated as fixed. We do not consider the uniformity issue, and so our procedures may suffer from

the same problem as the other post-selection inference (see, e.g., Leeb and Pöscher, 2005, 2008,

2009 and Schneider and Pöscher, 2009). This seems to be a well-known challenge in the literature

of model selection or pretest estimation. Although it is a very important question, developing a

thorough theory on uniform inference is beyond the scope of this paper.

4 Time fixed effects

In applications, we may also consider the model with both individual and time fixed effects:

= 00 + + + = 1 and = 1 (4.1)

where is the time fixed effects, the other variables are defined as above, and we assume that

0 ’s have the unknown group structure defined in (2.2). We treat and as unknown fixed

parameters that are not separately identifiable. Despite this, it is possible to estimate the group-

specific parameters 0’s and identify each individual’s group membership consistently.

4.1 Estimation under the null and test statistic

When 0 = 0 for all = 1 one can follow Hsiao (2014, Chapter 3.6) and sweep out both

the individual and time effects from (4.1) via suitable transformation. We do a similar thing in

the presence of heterogeneous slope coefficients 0 ’s. As before, we first eliminate the individual

effects in (4.1) via the within-group transformation:

= 0 + + (4.2)

where = − · = − and = −1P

=1 Then we eliminate from the above

model to obtain

= 0 − 1

X=1

0 + (4.3)

where = − · − · + · = 1

P=1 =

1

P=1

P=1 and · and are

similarly defined. Under H0 (0) : = 0 we can follow SSP and estimate β ≡(01 0) andα0

≡ (01 00) by minimizing the following criterion function:

(0)

2 (βα0) = 2 (β) +

X=1

Π0

=1 || − || (4.4)

where

2 (β) =1

X=1

X=1

⎛⎝ − 0 +1

X=1

0

⎞⎠2 13

Let β≡(1 ) and α0≡ (1 0

) denote the estimates of β and α0 respectively. If

necessary, we can estimate bye = 1

P=1( −

0) for = 1 Given β and α0

, we

classify the individuals into 0 groups as in Section 2. As before, we use 1 0to denote

the 0 estimated groups and to denote the cardinality of

The post-Lasso estimates can be obtained in two ways. One is to pool all the observations

within each estimated group and estimate the group-specific parameters for each group separately

after we demean over time and across individuals within the group. In this way, one can easily

work out the standard error for each group-specific estimate as usual. Alternatively, we consider

estimation based on (4.3) by using the estimated group membership. Assuming all individuals

are classified into one of the 0 groups, we obtain the post-Lasso estimates of 0 as thesolution to the following minimization problem:

min

1

0X=1

X∈

X=1

⎛⎝ − 0 +1

0X=1

X∈

0

⎞⎠2 (4.5)

Let = for ∈ and = 1 0 Let α0= (1 0

)

Given the above post-Lasso estimates, we define the restricted residual as

b = − 0 +

1

X=1

0 (4.6)

Following the analysis in Section 2.2, we propose to test H0 (0) by using the following LM-type

test statistic

2 (0) =

X=1

b00

¡ 00

¢−1 00

bwhere b = (b1 b )0We will show that after being appropriately centered, −122 (0)

follows the same asymptotic distribution as −121 (0) under the null.

4.2 Asymptotic properties of the C-Lasso estimates

Since SSP do not allow the fixed time effects in their model, their asymptotic theory does not

apply to our C-Lasso estimates of the parameters and group identity. Fortunately, we can study

the preliminary convergence rates of various estimates under Assumptions A.1, A.2(i) and A.3.

Theorem 4.1 Suppose that Assumptions A.1, A.2(i), and A.3 hold. Then under H0 (0)

(i) 1

P=1

°°° − 0

°°°2 =

¡−1

¢

(ii)¡(1) (0)

¢− (01 00) =

¡−12

¢

(iii) max1≤≤°°° − 0

°°° = ( + )

where = max( )1(4+2) log ( ) (log ( ) )12 and ((1) (0)) is some suit-

able permutation of (1 0)

14

Theorems 4.1(i)-(iii) establish the mean square convergence rate of the convergence rateof the group-specific estimates and the uniform convergence rate of ’s, respectively. The

findings are similar to those in SSP. In particular, the estimates of the group-specific parameters

do not depend on Note that = ((log ( ) )12) under the additional restriction that

lim sup( )→∞−(1+) = ∈ [0∞) For notational simplicity, hereafter we simply write for () as the consistent estimator of

0’s.

Define the two types of classification errors and as before. To study the consistency

of our classification method, we add the following assumption.

Assumption A.5. (i) 12−1(ln )9 → 0 and lim sup( )→∞−(1+) = ∈ [0∞)(ii) 2(ln )6+2 →∞ and (ln ) → 0 for some 0 as ( )→∞

Assumptions A.5(i)-(ii) strengthen the conditions on and . Note that A.5(ii) implies

that it suffices to choose ∝ − for some ∈ (14 12)The following theorem establishes the uniform consistency for our classification method.

Theorem 4.2 Suppose that Assumptions A.1, A.2(i), A.3 and A.5 hold. Then under H0 (0)

(i) ³∪0

=1

´≤P0

=1 ³

´→ 0 as ( )→∞

(ii) ³∪0

=1

´≤P0

=1 ³

´→ 0 as ( )→∞

Given the results in Theorems 4.1 and 4.2, we will show that the post-Lasso estimators isasymptotically oracally efficient in the sense that they are as efficient as the infeasible estimators

under some regularity conditions:

≡ argmin1

0X=1

X∈0

X=1

⎛⎝ − 0 +1

0X=1

X∈0

0

⎞⎠2 (4.7)

Let α0= (1 0

) Let Q =1

P∈0

X0 Q =

1

P∈0

P∈0

X0 and

V =1

P∈0

X0 for = 1 0 where X = −−1P

∈0 and = (1 )

0

Define

Q =

⎛⎜⎜⎜⎜⎜⎝Q1 −Q11 −Q12 · · · −Q10

−Q21 Q2 −Q22 · · · −Q20

......

. . ....

−Q01 −Q02 · · · Q0 −Q00

⎞⎟⎟⎟⎟⎟⎠ and V =

⎛⎜⎜⎜⎜⎜⎝V1

V2

...

V0

⎞⎟⎟⎟⎟⎟⎠

(4.8)

Let Q and V be analogously defined as Q and V with 0’s and ’ replaced by ’s

and ’s. We can verify that

vec(α0) = Q−1V and vec(α0

) = Q−1 V

15

Let B ≡ 1 2

P∈0

P=1

P=1 U ≡ 1

P∈0

P=1

⊥ and Ω =

2P

∈0

P=1

¡⊥

⊥0

2

¢for = 1 0where

⊥ = −(()

· −()) ()· = 1

P∈0

and () = 1

P=1

()· 4 Let B =

¡B01 B

00

¢0 U =

¡U01 U

00

¢0 and

Ω =diag(Ω1 Ω0 )

We add the following assumption.

Assumption A.6. (i) Q→ Q0 0 as ( )→∞

(ii)√U

→ (0Ω0) as ( )→∞ where Ω0 = lim( )→∞Ω

Assumption A.6 imposes conditions to ensure the asymptotic normality of the post-Lasso

estimators of the group-specific parameters. When 0 = 1 Q reduces to Q1 −Q11 which isrequired to be asymptotically nonsingular for the usual fixed effects estimator in a two-way error

component model to be asymptotically normal.

Theorem 4.3 Suppose Assumptions A.1, A.2(i), A.3, A.5 and A.6 hold. Then under H0 (0)

(i) − 0 = − 0 + (( )−12) =

³( )−12 + −1

´for = 1 0

(ii)√

£vec¡α0−α00

¢+Q−1B

¤ → ¡0Q−10 Ω0Q

−10

¢

Theorem 4.3(i) indicates that is asymptotically equivalent to the infeasible estimator It

also reports the convergence rate of and implies that1

P=1

°°° − 0

°°°2 = (( )−1+−2)

Theorem 4.3(ii) reports the asymptotic distribution of α0. By the Davydov inequality, we can

readily show that Q−1B is

¡−1

¢ which suggests that the order of the asymptotic bias of

vec(α0) is given by

¡−1

¢ Note that this bias term has the exact form as the bias term in

the linear panel data model with individual fixed effects alone considered by SSP. As in SSP, the

bias is asymptotically vanishing in the case where only contains strictly exogeneous regressors

or = ( )

For inference, one needs to estimateQ0 Ω0 andQ−1B When necessary, one can follow SSP

to estimate Q−1B A consistent estimate of Q0 is given by Q Similarly, we can consistently

estimate Ω0 by a sample analogue of Ω For brevity, we do not give the details.

4.3 Asymptotic null distribution of 2 (0)

Despite the presence of time fixed effects, we can show that 2 (0) shares the same asymp-

totic null distribution as 1 (0) This is summarized in the following theorem.

Theorem 4.4 Suppose Assumptions A.1, A.2(i), A.3, A.5 and A.6 hold. Then under H0 (0)

2 (0) ≡³−122 (0)−

´p

−→ (0 1) as ( )→∞

4 If ≥ 1 is mean-stationary, then ⊥ =

16

That is, −122 (0) share the same asymptotic bias ( ) and asymptotic variance

( ) as −121 (0) As before, we can consistently estimate both and by

(0) and (0). The major difference is that we now need to use the residuals b toreplace throughout. We omit the details for brevity.

5 Monte Carlo simulations

In this section, we conduct Monte Carlo simulations to examine the finite sample performance of

our proposed testing method.

5.1 Data generating processes and implementation

We consider five generating processes (DGPs). The first four DGPs only have individual fixed

effects:

DGP 1: = 011 + 022 + +

DGPs 2-4: = 011 + 02−1 + +

where = + = 1 2 and 1 2 and are IID (0 1) variables, and mutually

independent of each other. DGP 5 has both individual and time fixed effects:

DGP 5: = 011 + 02−1 + 05 + 05 +

where 1 = 1++ and 1 and are mutually independent IID (0 1) variables.

DGP 1 is a static panel structure, while DGPs 2-5 are dynamic panel structures. In DGPs 1,

2 and 5,¡01

02

¢has a group structure:

¡01

02

¢=

⎧⎪⎪⎨⎪⎪⎩(05−05) with probability 0.3

(−05 05) with probability 0.3

(0 0) with probability 0.4

Therefore in DGPs 1, 2 and 5, the true number of groups is 3. In DGP 3, we consider a com-

pletely heterogeneous (random coefficient) panel structure where 01 and 02 follow (05 1) and

U(−05 05) respectively. In principle, the true number of groups is the cross-sectional dimension in this case. In DGP 4, (01

02) is similar to that in DGPs 1, 2 and 5 except that it has some

additional small disturbance. Specifically,

¡01

02

¢=

⎧⎪⎪⎨⎪⎪⎩(05 + 011−05 + 012) with probability 0.3

(−05 + 011 05 + 012) with probability 0.3

(011 012) with probability 0.4

where 1 and 2 are each IID (0 1) mutually independent, and independent of , and

DGP 4 can be thought of as a small deviation from a group structure.

17

For each DGP, we first test the null hypotheses: H0(1) H0(2) and H0(3) to examine the level

and power of our test.

We then use our tests to determine the number of groups as described in Section 2.1. We set

max = 8 and let the nominal size decrease with the time series dimension to ensure that the

type I error decreases with Specifically, we let the nominal size be 1 which equals 0.10 and

0.025 for = 10 and 40, respectively.5 If all eight hypotheses, H0(1) and H0(8) are rejected,

then we stop and conclude that the number of groups is greater than 8 and we can adopt the

random coefficient model.

For the combination of and , we consider the typical case in empirical applications that

is smaller than or comparable to and let ( ) = (40 10) (80 10) and (40 40) The number

of replications in the simulations is 1000.

One important step in implementing our testing procedure is to choose the tuning parameter

Following the theory in SSP, we let = ·2 ·−13 where is the sample standard deviationof and is some constant. We use three different values of (025 05 and 075) to examine

the sensitivity of our results to (thus ).

As a comparison, we also consider the information criterion (IC) proposed in SSP to determine

the number of groups. Specifically, we choose to minimize the following IC:

() = ln³2( )

´+ 1 = 1 max (5.1)

where 2( ) =1

P=1

P=1

2 and 0s are the residuals under the specification of groups

with the tuning parameter . 0s are defined in (2.6) or (4.6) in the presence of both individual

and time fixed effects. 1 is a tuning parameter and we set 1 =23( )−12 as in SSP. Note

that SSP’s IC can also be used to choose the tuning parameter and jointly by minimizing

() over both and

5.2 Simulation results

Table 1 shows the level and power behavior of our test statistics for testing the three null hypothe-

ses: H0(1) H0(2) and H0(3) We choose three conventional nominal levels: 0.01, 0.05 and 0.10.

For DGPs 1, 2 and 5, the true number of groups is 3. For H0(1) the rejection frequencies are

all almost 1 for all combinations of and and all three DGPs at all three nominal levels. For

H0(2) the power of the test increases rapidly with both and For example, when = 40 and

= 40 the rejection frequencies are all 1 or nearly 1 at all three nominal levels. For H0(3) we

examine the level of our test and find that the rejection frequencies are fairly close to the nominal

levels.

5We also try fixing the nominal level at 0.05. The results are similar and available upon request.

18

For the heterogeneous DGP 3, our test rejects all three hypotheses with the frequencies being 1

or nearly 1 at all three nominal levels. This reflects the power of our test against global alternatives.

For DGP 4 which represents a small deviation from the group structure, our test shows reasonable

power for large , though it rejects = 3 with a small frequency when is small, as expected.

Also note that all the testing results are quite robust to the values of (thus ).

Table 2A shows how our method performs in terms of determining the number of groups.

Specifically, the numbers in the main entries are the proportions of the replications in which the

number of groups determined by our method is equal to, less than (5 in DGP 3), or greater

than (5 in DGPs 1, 2, 4, and 5, 8 in DGP 3) a number. For DGPs 1, 2 and 5, our method

determines the correct number of groups (3) with a large probability. For example, when = 40

and = 40 for DGP 5, the number of groups determined by our testing procedure equals the true

number (3) with probabilities ranging from 0986 to 0991. For DGP 3 where the true number of

groups is our method determines a large number of groups (greater than 8) with probabilities

exceeding 0.99 when = 40 and = 40. DGP 4 represents a small deviation from a 3-group

structure. When the sample size is small ( = 40 and = 10) with a high probability, the

number of groups determined by our method is 2 or 3. When the sample size increases to = 40

and = 40 our method determines 4 groups with a probability larger than 0.25. These results

are reasonable. Intuitively, if and are small, the data can only provide limited information

on the underlying DGP, and it is reasonable to apply a small group structure to serve as a good

approximation to the true model. As and become large, more information on the underlying

DGP is revealed, and it is sensible to adopt a larger number of groups to approximate the true

model more accurately.

Table 2B shows the performance of SSP’s information criterion (IC). Comparing Tables 2A

and 2B, we find that for DGPs 1, 2 and 5, the performances of our method and SSP’s IC are

comparable. However, for DGPs 3 and 4, our method dominates SSP’s. Specifically, for DGP 3

where the number of groups is , SSP tend to determine a small number of groups (less than

6 even when = 40 and = 40), while our method determines that the number of groups is

greater than 8 for most cases. For DGP 4, in large samples ( = 40 and = 40), SSP determine

3 groups with a high probability, while our method determines a relatively large number of groups.

To examine the performance of the C-Lasso estimator following our method to determine the

number of groups, we compare the following four estimators: () the common coefficient estimator,

() the random coefficient estimator, () the post C-Lasso estimator with the number of groups

determined by our method, and () the infeasible estimator where the true number of groups

and group classification are known. We report the mean squared errors (MSE), calculated as the

average values of 1

P=1

°°° − 0

°°°2 over 1000 replications, where denotes the estimator. Table

19

3 presents the results for DGP 1.6 It is clear that our estimator dominates the common coefficient

estimator and the random coefficient estimator. However it performs worse than the infeasible

estimator, especially when the sample size is small. This reflects the effect of classification errors

in finite samples.

6 Empirical application: income and democracy

The relationship between income and democracy has attracted much attention in empirical re-

search; see, e.g., Lipset (1959), Barro (1999), AJRY, and BM. To the best of our knowledge, none

of the existing studies allows for heterogeneity in the slope coefficients in their model specifica-

tions. As discussed in AJRY, “societies may embark on divergent political-economic development

paths”. Thus, ignoring the heterogeneity in the slope coefficients may result in model misspeci-

fication and invalidate subsequent inferences. Hence, it is important to know whether the data

support the assumption of homogeneous slope coefficients. If not, then we need to determine the

number of heterogeneous groups and classify the countries using statistic methods. We apply our

new method to study this important question.

6.1 Data and implementation

We let be a measure of democracy for country in period and be the logarithm of its real

GDP per capita. The measure of democracy and real GDP per capita are from the Freedom House

and Penn World Tables, respectively.7 Note that the Freedom House measures of democracy ()

are normalized to be between 0 and 1.

We consider the specification with both individual and time fixed effects

= 1−1 + 2−1 + + +

and assume that (1 2) has a group structure to account for possible heterogeneity. In a closely

related paper, BM consider a group structure in the interactive fixed effects and assume (1 2)

is constant for all ’s

We use a balanced panel dataset similar to that in BM. The number of countries () is 74. The

time index is = 1 8 where each period corresponds to a five-year interval over the period 1961-

2000. For example, = 1 refers to years 1961-1965. Because the lagged is used as a regressor,

the number of time periods ( ) is 7 i.e., = 2 8 The choice of countries is determined by data

availability. In addition, we exclude the countries whose measures of democracy remain constant

6The results for other DGPs are available upon request.7All the data are directly from AJRY: http://economics.mit.edu/faculty/acemoglu/data/ajry2008.

20

Table 1: Empirical rejection frequency

= 025 = 05 = 075

0.01 0.05 0.10 0.01 0.05 0.10 0.01 0.05 0.10

40 10 0.998 0.999 1 0.998 0.999 1 0.998 0.999 1

= 1 80 10 1 1 1 1 1 1 1 1 1

40 40 1 1 1 1 1 1 1 1 1

40 10 0.126 0.312 0.431 0.121 0.297 0.414 0.115 0.291 0.410

DGP 1 = 2 80 10 0.277 0.530 0.665 0.277 0.523 0.663 0.262 0.524 0.652

40 40 0.999 1 1 0.999 1 1 0.999 1 1

40 10 0.000 0.014 0.035 0.005 0.025 0.060 0.014 0.057 0.099

= 3 80 10 0.001 0.013 0.036 0.003 0.029 0.067 0.015 0.059 0.124

40 40 0.003 0.021 0.041 0.008 0.032 0.069 0.016 0.065 0.118

40 10 1 1 1 1 1 1 1 1 1

= 1 80 10 1 1 1 1 1 1 1 1 1

40 40 1 1 1 1 1 1 1 1 1

40 10 0.059 0.205 0.317 0.057 0.184 0.294 0.048 0.170 0.285

DGP 2 = 2 80 10 0.164 0.383 0.539 0.143 0.362 0.513 0.136 0.350 0.497

40 40 1 1 1 1 1 1 1 1 1

40 10 0.001 0.005 0.013 0.001 0.009 0.018 0.001 0.017 0.035

= 3 80 10 0.000 0.003 0.015 0.002 0.008 0.019 0.001 0.022 0.048

40 40 0.001 0.013 0.028 0.005 0.023 0.040 0.011 0.046 0.088

40 10 1 1 1 1 1 1 1 1 1

= 1 80 10 1 1 1 1 1 1 1 1 1

40 40 1 1 1 1 1 1 1 1 1

40 10 0.998 1 1 0.998 1 1 0.999 1 1

DGP 3 = 2 80 10 1 1 1 1 1 1 1 1 1

40 40 1 1 1 1 1 1 1 1 1

40 10 0.975 0.996 0.998 0.982 0.998 0.998 0.993 0.999 1

= 3 80 10 1 1 1 1 1 1 1 1 1

40 40 1 1 1 1 1 1 1 1 1

40 10 1 1 1 1 1 1 1 1 1

= 1 80 10 1 1 1 1 1 1 1 1 1

40 40 1 1 1 1 1 1 1 1 1

40 10 0.124 0.299 0.436 0.117 0.283 0.420 0.099 0.273 0.407

DGP 4 = 2 80 10 0.340 0.598 0.717 0.322 0.565 0.700 0.306 0.533 0.687

40 40 1 1 1 1 1 1 1 1 1

40 10 0.001 0.018 0.043 0.004 0.018 0.050 0.004 0.034 0.080

= 3 80 10 0.010 0.045 0.077 0.015 0.061 0.103 0.024 0.084 0.148

40 40 0.262 0.488 0.624 0.317 0.537 0.668 0.382 0.617 0.742

40 10 1 1 1 1 1 1 1 1 1

= 1 80 10 1 1 1 1 1 1 1 1 1

40 40 1 1 1 1 1 1 1 1 1

40 10 0.380 0.610 0.748 0.364 0.604 0.737 0.370 0.611 0.735

DGP 5 = 2 80 10 0.718 0.889 0.938 0.711 0.892 0.939 0.709 0.887 0.936

40 40 1 1 1 1 1 1 1 1 1

40 10 0.010 0.035 0.056 0.010 0.034 0.058 0.010 0.042 0.070

= 3 80 10 0.008 0.036 0.070 0.012 0.047 0.085 0.025 0.069 0.127

40 40 0.002 0.020 0.052 0.003 0.022 0.058 0.006 0.030 0.066

21

Table 2A: Frequency of number of groups determined by our method

= 1 = 2 = 3 = 4 = 5 5

40 10 0 0.570 0.395 0.034 0.001 0

= 025 80 10 0 0.336 0.628 0.034 0.001 0.001

40 40 0 0 0.987 0.012 0.001 0

40 10 0 0.586 0.354 0.049 0.008 0.003

DGP 1 = 05 80 10 0 0.338 0.596 0.051 0.012 0.003

40 40 0 0 0.981 0.015 0.004 0

40 10 0 0.590 0.313 0.068 0.018 0.011

= 075 80 10 0 0.349 0.529 0.076 0.038 0.008

40 40 0 0 0.963 0.029 0.008 0

= 1 = 2 = 3 = 4 = 5 5

40 10 0 0.683 0.304 0.013 0 0

= 025 80 10 0 0.461 0.524 0.014 0.001 0

40 40 0 0 0.996 0.004 0 0

40 10 0 0.707 0.275 0.016 0.002 0

DGP 2 = 05 80 10 0 0.487 0.494 0.013 0.006 0

40 40 0 0 0.991 0.009 0 0

40 10 0 0.716 0.251 0.025 0.007 0.001

= 075 80 10 0 0.506 0.447 0.036 0.011 0

40 40 0 0 0.976 0.024 0 0

5 = 5 = 6 = 7 = 8 8

40 10 0.024 0.027 0.035 0.048 0.011 0.855

= 025 80 10 0 0.002 0 0.001 0 0.997

40 40 0 0 0 0 0.002 0.998

40 10 0.009 0.012 0.013 0.021 0.009 0.936

DGP 3 = 05 80 10 0 0.001 0 0 0 0.999

40 40 0 0 0 0 0.000 1

40 10 0.004 0.004 0.006 0.009 0.006 0.971

= 075 80 10 0 0 0.001 0 0 0.999

40 40 0 0 0 0 0.001 0.999

= 1 = 2 = 3 = 4 = 5 5

40 10 0 0.564 0.393 0.038 0.005 0

= 025 80 10 0 0.283 0.640 0.071 0.006 0

40 40 0 0 0.620 0.264 0.110 0.006

40 10 0 0.581 0.369 0.038 0.012 0

DGP 4 = 05 80 10 0 0.301 0.597 0.085 0.016 0.001

40 40 0 0 0.576 0.283 0.132 0.009

40 10 0 0.594 0.327 0.057 0.019 0.003

= 075 80 10 0 0.314 0.539 0.111 0.031 0.005

40 40 0 0 0.489 0.343 0.162 0.006

= 1 = 2 = 3 = 4 = 5 5

40 10 0 0.252 0.692 0.053 0.002 0.001

= 025 80 10 0 0.062 0.868 0.063 0.006 0.001

40 40 0 0 0.991 0.007 0.002 0

40 10 0 0.263 0.679 0.054 0.004 0

DGP 5 = 05 80 10 0 0.062 0.853 0.081 0.003 0.001

40 40 0 0 0.991 0.006 0.003 0

40 10 0 0.265 0.665 0.062 0.008 0

= 075 80 10 0 0.064 0.811 0.118 0.007 0

40 40 0 0 0.986 0.012 0.002 0

22

Table 2B: Frequency of number of groups determined by the information criterion

= 1 = 2 = 3 = 4 = 5 5

40 10 0 0.486 0.500 0.012 0.002 0

= 025 80 10 0 0.128 0.807 0.059 0.006 0

40 40 0 0.001 0.999 0 0 0

40 10 0 0.665 0.327 0.008 0 0

DGP 1 = 05 80 10 0 0.280 0.680 0.037 0.003 0

40 40 0 0.001 0.999 0 0 0

40 10 0 0.795 0.201 0.004 0 0

= 075 80 10 0 0.463 0.519 0.017 0.001 0

40 40 0 0.001 0.998 0.001 0 0

= 1 = 2 = 3 = 4 = 5 5

40 10 0 0.548 0.437 0.014 0.001 0

= 025 80 10 0 0.174 0.786 0.039 0.001 0

40 40 0 0 1 0 0 0

40 10 0 0.697 0.298 0.004 0.001 0

DGP 2 = 05 80 10 0 0.310 0.660 0.024 0.006 0

40 40 0 0 1 0 0 0

40 10 0 0.809 0.189 0.002 0 0

= 075 80 10 0 0.510 0.458 0.030 0.002 0

40 40 0 0 0.999 0.001 0 0

5 = 5 = 6 = 7 = 8 8

40 10 0.962 0.033 0.005 0 0 0

= 025 80 10 0.952 0.048 0 0 0 0

40 40 0.666 0.210 0.101 0.019 0.004 0

40 10 0.986 0.012 0.002 0 0 0

DGP 3 = 05 80 10 0.979 0.021 0 0 0 0

40 40 0.808 0.128 0.049 0.012 0.003 0

40 10 0.992 0.006 0.001 0.001 0 0

= 075 80 10 0.985 0.014 0 0.001 0 0

40 40 0.848 0.114 0.022 0.013 0.003 0

= 1 = 2 = 3 = 4 = 5 5

40 10 0 0.512 0.455 0.032 0.001 0

= 025 80 10 0 0.155 0.738 0.096 0.011 0

40 40 0 0.002 0.996 0.002 0 0

40 10 0 0.645 0.337 0.017 0.001 0

DGP 4 = 05 80 10 0 0.298 0.648 0.047 0.007 0

40 40 0 0.001 0.995 0.004 0 0

40 10 0 0.770 0.216 0.013 0.001 0

= 075 80 10 0 0.503 0.453 0.039 0.005 0

40 40 0 0.002 0.985 0.013 0 0

= 1 = 2 = 3 = 4 = 5 5

40 10 0 0.159 0.785 0.056 0 0

= 025 80 10 0 0.010 0.888 0.096 0.005 0.001

40 40 0 0 1 0 0 0

40 10 0 0.181 0.752 0.066 0.001 0

DGP 5 = 05 80 10 0 0.016 0.826 0.155 0.002 0.001

40 40 0 0 1 0 0 0

40 10 0 0.220 0.699 0.079 0.002 0

= 075 80 10 0 0.031 0.697 0.267 0.004 0.001

40 40 0 0 1 0 0 0

23

Table 3: Comparison of MSE of various estimators, DGP 1

Common Random Post C-Lasso

coefficient coefficient = 025 = 05 = 075 Infeasible

40 10 0.308 0.334 0.175 0.178 0.183 0.017

80 10 0.304 0.334 0.160 0.166 0.173 0.008

40 40 0.302 0.056 0.019 0.024 0.030 0.004

Table 4: Summary statistics ( = 74)

Time period Years : democracy : logarithm of real GDP per capita

mean median s.d. mean median s.d.

1 1961 - 1965 0.558 0.540 0.261 7.694 7.657 0.811

2 1966 - 1970 0.367 0.333 0.333 7.832 7.759 0.853

3 1971 - 1975 0.340 0.333 0.311 7.926 7.958 0.886

4 1976 - 1980 0.419 0.333 0.317 8.031 8.022 0.927

5 1981 - 1985 0.455 0.333 0.349 8.055 8.072 0.942

6 1986 - 1990 0.505 0.500 0.340 8.098 8.158 1.009

7 1991 - 1995 0.543 0.500 0.331 8.143 8.233 1.102

8 1996 - 2000 0.604 0.667 0.325 - - -

over the seven periods ( = 2 8). A list of the 74 countries can be found in Table 8. Table 4

presents the summary statistics. The implementation details of our method are the same as in

the simulations.

6.2 Testing and estimation results

We first test the hypothesis H0(1) i.e., we test whether (1 2) is constant for all We soundly

reject this hypothesis with a p-value being less than 0.001. This provides strong evidence that

the slope coefficients are not homogeneous. We then test the hypothesis H0(2) for three values of

the tuning parameter = · 2 · −13 ( = 025 05 and 075). We reject this hypothesis at

the 5% level with p-values being 0.027, 0.027 and 0.020 for = 025 0.5 and 0.75, respectively.

We continue to test H0(3) and find that the p-values are 0.423, 0.204 and 0.143 for = 025

0.5 and 0.75, respectively. Considering that the p-values are above or close to the recommended

nominal level 1 (0.143), we stop the testing procedure and conclude that the number of groups

is 3. Table 5 presents all the testing results and Table 8 shows the country membership of the

three groups. To take into account of the issue of multiple testing, we also report Holm’s (1979)

adjusted -values in the last row in Table 5,8 which also lend strong support to the conclusion of

three groups in the data.

8Suppose that we have tested ∗ individual hypotheses, and we order the individual p-values from the smallest

to the largest as (1) ≤ (2) ≤ ≤ (∗) with their corresponding null hypotheses labeled accordingly as H0(1)

H0(2) H0(∗) We calculate the step-down Holm adjusted -values for testing H0() as

adjusted-() = min(

∗ − + 1) () 1

24

We also consider SSP’s IC (see equation (5.1) above) to determine the number of groups.

Again, we try three choices of as above with = 025 05 and 075 When = 025 and 0.5,

the IC determines 3 groups, as shown in Figure 1. When = 075 the IC determines 2 groups.9

Given that the majority of our results determine 3 groups, we conclude that = 3

Table 6 presents the estimation results. We report both C-Lasso estimates and post C-Lasso

estimates. C-Lasso estimates are defined in Section 4.1. The post C-Lasso is implemented as in

(4.5). Both estimates are bias-corrected and the standard errors are obtained using the asymptotic

theory developed in SSP. The C-Lasso and post C-Lasso estimation results are similar for different

values of = 025 0.5 and 0.75. In the following discussion, we focus on the post C-Lasso estimates

with = 05 It is clear that the estimated slope coefficients exhibit substantial heterogeneity. For

1 the estimates for the three groups are 0.257, -0.160 and -0.235. All of them are significant at

the 1% level. It is interesting to note that not only the magnitudes but also the signs of estimates

are different among the three groups. For Group 1, income has a positive effect on democracy,

while for Groups 2 and 3, the effects are negative with different magnitudes. For 2 the three

group estimates are 0.097, -0.237 and 0.412. The first estimate is not significant, while the second

and third estimates are significant at the 1% level. We also present the point estimates of the

cumulative income effect (CIE), which is defined as 1 (1− 2) The estimates of CIE for the

three groups are 0.284, -0.129 and -0.400, which imply that a 10% increase in income per capita is

associated with increases of 0.0284, -0.0129 and -0.0400 in the measures of democracy, respectively.

Note that if we assume that 1 and 2 are homogeneous across then the common esti-

mates of 1 and 2 are -0.017 and 0.273 with t-statistics being -0.450 and 5.950, respectively.

This suggests that the effect of income on democracy is not statistically significant. This find-

ing is consistent with that in AJRY. Nevertheless, the insignificance could be due to the model

misspecification which ignores slope heterogeneity. Note that the common estimate falls in the

middle of the group estimates. Here all group estimates are significant, but with different signs.

Therefore the common estimate, which can be thought of as a certain weighted average of the

group estimates, could be close to zero and statistically insignificant.

To understand the heterogeneity in the data intuitively, we select three countries (Malaysia,

Indonesia, and Nepal) and show their time-series data in Table 7. We simply calculate the cor-

relations between the dependent variable and the key explanatory variable −1 Even the

simple correlations exhibit substantial heterogeneity with the values being -0.863, 0.069 and 0.658.

This suggests that it is implausible for the slope coefficients to be the same for all countries even

without performing any sophisticated analysis.

This application shows that ignoring the heterogeneity in the slope coefficients can mask the

true underlying relationship among economic variables.

9The values of IC for 2 groups and 3 groups are -3.3888 and -3.3862, respectively, which are actually quite close.

25

Table 5: Test statistics

= 025 = 05 = 075

Null hypothesis = 1 = 2 = 3 = 1 = 2 = 3 = 1 = 2 = 3

Statistics 3.566 1.930 0.195 3.566 1.925 0.829 3.566 2.060 1.065

p-values 0.0002 0.027 0.423 0.0002 0.027 0.204 0.0002 0.020 0.143

Holm adjusted 0.0006 0.054 0.423 0.0006 0.054 0.204 0.0006 0.040 0.143

p-values

1 2 3 4 5 6 7 8−3.45

−3.4

−3.35

−3.3

−3.25

−3.2

Number of groups

Information Criteria (IC) for different numbers of groups

IC with c=0.25IC with c=0.5IC with c=0.75

Information criteria

6.3 Explaining the group pattern

As discussed above, the group estimates of 1 and 2 show substantial heterogeneity, though

most of them are statistically significant at the 5% level. So far, our classification of the groups

is completely statistical and does not use any a priori information. One natural question is how

to explain the group membership. Apparently, the group membership shown in Table 8 does

not match the countries’ geographic locations. We further investigate the group pattern using a

cross-sectional multinomial logit model.

We let the dependent variable be group membership, which takes one of three values: 1,

2 or 3.10 The explanatory variables include () a measure of constraints on the executive at

independence, () the 500-year change in income per capita over 1500-2000, () the 500-year

change in democracy over 1500-2000, () independence year/100, () initial education level in

10We only report the results for = 05. The results for = 025 and 075 are similar and available upon request.

26

Table 6: Estimation results

1 2 CIE

estimates s.e. t-stat estimates s.e. t-stat estimates

common estimation -0.017 0.038 -0.450 0.273 0.046 5.950 -0.023

Group 1 0.222 0.051 4.402 0.119 0.058 2.042 0.252

C-Lasso Group 2 -0.221 0.032 -6.920 -0.305 0.055 -5.519 -0.169

= 025 Group 3 -0.250 0.102 -2.443 0.540 0.052 10.423 -0.543

Group 1 0.249 0.046 5.375 0.091 0.058 1.586 0.273

Post Group 2 -0.189 0.033 -5.670 -0.240 0.041 -5.896 -0.152

C-Lasso Group 3 -0.281 0.095 -2.966 0.493 0.049 9.991 -0.554

Group 1 0.273 0.041 6.594 0.122 0.067 1.826 0.311

C-Lasso Group 2 -0.176 0.031 -5.662 -0.324 0.068 -4.767 -0.133

= 05 Group 3 -0.206 0.084 -2.451 0.480 0.065 7.420 -0.396

Group 1 0.257 0.045 5.663 0.097 0.064 1.510 0.284

Post Group 2 -0.160 0.033 -4.795 -0.237 0.047 -5.051 -0.129

C-Lasso Group 3 -0.235 0.078 -3.020 0.412 0.061 6.782 -0.400

Group 1 0.309 0.042 7.458 0.052 0.069 0.751 0.326

C-Lasso Group 2 -0.152 0.032 -4.752 -0.280 0.058 -4.800 -0.119

= 075 Group 3 -0.145 0.077 -1.900 0.502 0.059 8.449 -0.292

Group 1 0.266 0.047 5.714 0.037 0.067 0.552 0.276

Post Group 2 -0.159 0.033 -4.764 -0.237 0.047 -5.002 -0.128

C-Lasso Group 3 -0.191 0.070 -2.721 0.432 0.054 8.061 -0.336

Note: CIE stands for cumulative income effect, which is defined as (1 (1− 2))

Table 7: Correlation between and −1 for selected countriesMalaysia Indonesia Nepal

Time period

1 0.800 7.823 0.100 6.798 0.290 6.652

2 0.833 7.967 0.333 6.992 0.167 6.704

3 0.667 8.186 0.333 7.257 0.167 6.785

4 0.667 8.492 0.333 7.547 0.667 6.756

5 0.667 8.603 0.333 7.731 0.667 6.908

6 0.333 8.783 0.167 7.955 0.500 6.991

7 0.500 9.072 0.000 8.201 0.667 7.125

8 0.333 9.202 0.667 8.200 0.667 7.286

Correlation between

and −1 -0.863 0.069 0.658

27

1965, () initial income level in 1965, and () initial democracy level in 1965. Acemoglu,

Johnson and Robinson (2005) argue that () is an important determinant of democracy. () and

() present long-run changes in income and democracy levels, respectively. () measures how

recently a country became independent. () () and () are the initial key economic variables.

All the data are taken directly from AJRY.

Table 9 provides summary statistics for each of the three groups. The sample averages of

() the constraints on the executive at independence are clearly substantially different among the

three groups. For this variable, the average value of Group 1 is only about half of that of Group 3.

The variable of () the 500-year change in income per capita differs noticeably among the three

groups. Group 2 has the largest sample average of 2.258, while Group 3 has the smallest value of

1.501.

Table 10 presents the multinomial logit regression results for various model specifications. We

choose Group 3 as the base group. Compared with Group 3, at the 5% level, a higher value of ()

the constraints on the executive at independence leads to a reduced likelihood of being in Group

1. On the other hand, a higher value of () the 500-year change in income per capita leads to

a higher likelihood of being in Group 2. For the case = 05 reported here, the variable ()

(independence year/100) is also significant at the 5% level. However, this result is not robust to

other choices of In summary, we find that the constraints on the executive at independence and

the long-run economic growth are important determinants of our group pattern.

7 Conclusion

We develop a data-driven testing procedure to determine the number of groups in a latent group

panel structure proposed in SSP. The procedure is based on conducting hypothesis testing on the

model specifications. The test is a residual-based LM-type test and is asymptotically normally

distributed under the null. We apply our new method to study the relationship between income

and democracy and find strong evidence that the slope coefficients are heterogeneous and form

three distinct groups. Further, we find that the constraints on the executive at independence and

long-run economic growth are important determinants of the group pattern.

There are several interesting topics for further research. Here we apply our testing procedure

to determine the number of groups for slope coefficients. The same idea can be applied to other

group structures, such as those considered in BM where fixed effects have a grouped pattern. We

may also extend our methods to non-linear panel data models such as discrete choice models.

28

Table 8: Classification of countries

Group 1 (“positive 1 and insignificant 2”) (1 = 21)

Brazil Cyprus Algeria Ecuador Spain

Ghana Greece Jordan Korea Rep. Malawi

Nepal Panama Philippines Portugal Rwanda

Thailand Taiwan Uruguay Venezuela RB Congo Dem. Rep

Zambia

Group 2 (“negative 1 and negative 2”) (2 = 16)

Argentina Congo Rep. Dominican Republic Finland Gabon

Indonesia Japan Luxembourg Mexico Nigeria

Singapore El Salvador Togo Tunisia Turkey

Uganda

Group 3 (“negative 1 and positive 2”) (3 = 37)

Burundi Benin Burkina Faso Bolivia Central African Republic

Chile China Cameroon Colombia Egypt Arab Rep

Guinea Guatemala Guyana Honduras India

Iran Israel Jamaica Kenya Sri Lanka

Morocco Madagascar Mali Mauritania Malaysia

Niger Nicaragua Peru Paraguay Romania

Sierra Leone Sweden Syrian Arab Republic Chad Trinidad and Tobago

Tanzania South Africa

Table 9: Summary statistics by group

Group 1 Group 2 Group 3

Variables Variable description mean s.d. mean s.d. mean s.d.

constraint constraints on the executive 0.148 0.239 0.334 0.298 0.392 0.364

at independence

growth 500 year income per 2.004 1.182 2.258 1.086 1.501 0.951

capita change

democ 500 year democracy change 0.803 0.231 0.660 0.295 0.654 0.288

indcent year of independence/100 18.913 0.726 18.947 0.710 19.104 0.667

edu65 education level in 1965 2.778 1.516 2.805 2.170 2.621 1.931

inc65 logarithm of real GDP 7.715 0.822 7.925 0.968 7.582 0.729

per capita in 1965

dem65 measure of democracy in 1965 0.507 0.271 0.634 0.245 0.554 0.261

29

Table 10: Determinants of the group pattern

Group 1constraints -3.011** -3.259*** -3.044*** -5.201*** -5.546*** -5.924*** -5.990***

(1.200) (1.183) (1.132) (1.526) (1.682) (1.813) (1.887)growth 0.599** 0.580* 0.958** 1.157* 1.668** 1.693**

(0.294) (0.350) (0.405) (0.605) (0.723) (0.744)demco 1.651 3.936** 4.947* 4.694* 4.478*

(1.399) (1.854) (2.661) (2.704) (2.661)indcent 1.772** 2.116** 1.917** 1.973**

(0.765) (0.893) (0.942) (0.957)edu65 -0.344 -0.136 -0.080

(0.340) (0.366) (0.414)inc65 -1.406 -1.360

(0.932) (0.955)dem65 -0.533

(2.019)

Group 2constraints -0.498 -1.032 -0.989 -1.391 -1.578 -1.991 -2.436

(0.911) (0.946) (0.976) (1.294) (1.616) (1.740) (1.829)growth 0.724** 0.889** 1.016** 1.193** 1.760** 1.915**

(0.324) (0.372) (0.406) (0.604) (0.766) (0.805)demco -0.977 -0.713 -0.674 -0.730 -0.558

(1.219) (1.420) (2.272) (2.386) (2.315)indcent 0.345 0.242 0.055 0.102

(0.698) (0.923) (0.990) (0.998)edu65 -0.286 -0.089 -0.252

(0.353) (0.377) (0.421)inc65 -1.398 -1.618

(0.992) (1.045)dem65 1.966

(2.167)Note: *, ** and *** denote significance at the 10%, 5% and 1% levels, respectively. The results arebased on a multinomial logit regression where Group 3 is taken as the reference group. The standarderrors are calculated without taking into account the fact that the dependent variables are estimated.

30

APPENDIX

In this appendix we prove the main results in the paper. The proof relies on some technical lemmas

given in Appendix B.

A Proof of the main results

Proof of Theorem 3.1. Let = (1 )0 and

= (00)

−1 0 Then by (2.7) and the fact

that 0 = , we have

=0 +0(0 − ) (A.1)

andp1 (0) =

Ã−12

X=1

000 −

!+−12

X=1

(0 − )0 0

0(0 − )

+2−12X=1

00(0 − )

≡ (1 − ) +2 + 23 , say. (A.2)

We complete the proof by showing that under H0 (0), (i) 1 −→ (0 0) (ii) 2 = (1)

and (iii) 3 = (1), where 0 = lim( )→∞ We prove (i)-(iii) in Propositions A.1-A.3 below.

Proposition A.1 1 −→ (0 0) under H0 (0)

Proof. Recall that = 00 and that denotes the ( )’th element of : =

−1P

=1

P=1 0

¡−1 0

0

¢−1 where = 1 − −1 and 1 = 1 = Let ≡

−1P

=1

P=1 0

Ω−1 Observe that 1 − = 2−12

P=1

P1≤≤ +

2−12P

=1

P1≤≤

¡ −

¢ ≡ 11 + 12 It suffices to show that: (i) 11→

(0 0) and (ii) 12 = (1)

First, we show (i). Using = 1 − −1 we have

11 =2

√

X=1

X1≤≤

X=1

X=1

0Ω−1

=2

√

X=1

X1≤≤

0Ω−1 − 2

2√

X=1

X1≤≤

X=1

0Ω−1

− 2

2√

X=1

X1≤≤

X=1

0Ω−1 +

2

3√

X=1

X1≤≤

X=1

X=1

0Ω−1

=2

√

X=1

X1≤≤

†0Ω−1

† −

2

2√

X=1

X1≤≤

X=1

[ − ()]0Ω−1

− 2

2√

X=1

X1≤≤

X=1

0Ω−1 [ − ()]

+2

3√

X=1

X1≤≤

X=1

X=1

[ − ()]0Ω−1 [ − ()]

31

+4

3√

X=1

X1≤≤

X=1

X=1

[ − ()]0Ω−1 ()

≡ 111 +112 +113 +114 +115 say.

By Lemmas B.4(ii)-(v), 112+113+114+115 = (1) We are left to show that 111→ (0 0) Observe that

111 =2

√

X=1

X1≤≤

†0Ω−1

† =

X=2

where ≡ 2−1−12P

=1

P−1=1

0 and ≡ Ω−12

† By Assumption A.1(v)

(|F−1) ≡ 2−1−12X=1

−1X=1

0(|F−1) = 0

That is, F is an m.d.s. By the martingale CLT (e.g., Pollard (1984, p. 171)), it suffices to

show that

Z ≡X=2

F−1h||4

i= (1) and

X=2

2 − = (1) (A.3)

where F−1 denotes expectation conditional on F−1 Observing that Z ≥ 0 it suffices to show thatZ = (1) by showing that (Z) = (1) by Markov inequality. By Assumptions A.1(iv)-(v),

(Z) =16

42

X=2

X=1

X=1

X=1

X=1

X1≤≤−1

¡0

0

0

0

¢= 48Z1 + 16Z2

where

Z1 ≡ 1

42

X=2

X=1

X=1 6=

X1≤≤−1

¡0

0

2

¢¡0

0

2

¢ and(A.4)

Z2 ≡ 1

42

X=2

X=1

X1≤≤−1

¡0

0

0

0

4

¢ (A.5)

For the moment we assume that = 1 so that we can treat the ×1 vector as a scalar. (The general casefollows from Slutsky lemma and the fact that 0

0 =

P=1

P=1 where denotes

the ’th element of ) To bound the summation in (A.4), we consider three cases for the time indices in

≡ − 1 : (a) # = 5 (b) # = 4 and (c) # ≤ 3 We use 1 1 and 1 to denote

the corresponding summations when the time indices are restricted to cases (a), (b), and (c) respectively.

In case (a), applying Davydov inequality (e.g., Hall and Heyde (1980, p. 278)) yields¯¡

2

¢¯ ≤ 8 ( ) (− 1− ( ∨ ))(1+)(2+) (A.6)

where ∨ ≡ max ( ) and 1 ≡ max°°°°4+2 °°22°°4+2 A similar inequality holds

for ( 2) By the repeated use of Cauchy-Schwarz’s and Jensen’s inequalities and As-

sumption A.1(i),

|1| ≤ 1

2

h°°°°24+2 + °°22°°24+2i≤ 1

4

n°°°°28+4 + °°°°28+4 + 2°°22°°24+2o ≤ 1 (A.7)

32

for some 1 ∞ With this, we can readily show that under Assumption A.1(iii),

1 ≤ 6421

4

X=2

⎧⎨⎩ X1≤≤−1

(− 1− ( ∨ ))(1+)(2+)⎫⎬⎭2

= ¡−1

¢

In case (b), we consider two subcases: (b1) one and only one of equals − 1 (b2) # = 3We use 11 and 12 to denote the corresponding summations when the individual indices are restricted

to subcases (b1) and (b2), respectively. In subcase (b1), wlog we assume that = − 1 and apply¯¡ −1−12

¢¯ ≤ 82 (− 1− )(1+)(2+)

for 2 ≡°°°°8+4D °°2−1−12°°(8+4)3D ≤ 2 for some 2 ∞ and (A.6)-(A.7) to obtain

11 ≤ 6412

3

X=2

⎧⎨⎩ 1 X1≤≤−1

(− 1− ( ∨ ))(1+)(2+)⎫⎬⎭⎧⎨⎩ X1≤≤−1

(− 1− )(1+)(2+)

⎫⎬⎭=

¡−2

¢

In subcase (b2), wlog we assume that = and − 1 We consider two subsubcases: (b21) either− 1− ∗ or − ∗, and (b22) − 1− ≤ ∗ and − ≤ ∗ In the first case, we have

¯¡

2

¢¯ ≤ ( 83 (∗)(1+)(2+)

if − 1− ∗84 ( ) (∗)

(1+)(2+)if − ∗

where 3 ≡°°2°°4+2 °°°°4+2 ≤ 3 ∞ and 4 ≡

°°2°°(8+4)3 °°°°8+4≤ 4 ∞ These results, in conjunction with the fact that the total number of terms in the summation in

subcase (b22) is of order ¡2 32∗

¢and Assumption A.1(iii), imply that

12 ≤ h 2 (∗)

(1+)(2+)+ −4−22 32∗

i=

³ 2 (∗)

(1+)(2+)+ −12∗

´= (1) .

Consequently, 1 = (1) In case (c), we have 1 = ¡−1

¢as the number of terms in the summation

is ¡2 3

¢and each term in absolute value has a bounded expectation. It follows that Z1 = (1)

To bound Z2 we consider two cases for the set of indices ≡ − 1: (a) # = 5, and (b)all the other cases. We use 2 and 2 to denote the corresponding summations when the individual

indices are restricted to subcases (a) and (b), respectively. In the first case, letting = max( ) we

have ¯¡4

4

¢¯ ≤ 85 ( ) (− 1− )(2+)

where 5 ≡°° °°2+ °°44°°2+ ≤ 5 ∞ Then2 ≤ 8−1

P=1 ()

(2+)

= ¡−1

¢ In case (b), we have 2 =

¡−1

¢ It follows that Z2 =

¡−1

¢and thus Z = (1)

Consequently the first part of (A.3) follows.

For the second part of (A.3), by Assumptions A.1(iv)-(v) we have

X=2

(2) = 4−2−1X=2

"X=1

−1X=1

0

#2

= 4−2−1X=2

X=1

−1X=1

−1X=1

(2 0

0) =

33

In addition, we can show by straightforward moment calculations that (P

=2 2)

2 = 2 + (1)

Thus Var(P

=2 2) = (1) and the second part of (A.3) follows. This completes the proof of (i).

In addition, by Lemma B.4(i), 12 = (1)

Proposition A.2 2 = (1) under H0 (0)

Proof. Noting that 1 ∈ 0 = 1 ∈ + 1 ∈ 0\− 1 ∈ \0 under H0 (0) we have

2 = −120X=1

X∈0

(0 − )0 0

0(0 − )

= −120X=1

X∈

¡0 −

¢0 00

¡0 −

¢+−12

0X=1

X∈0

\

(0 − )0 0

0(0 − )

−−120X=1

X∈\0

¡0 −

¢0 00

¡0 −

¢≡ 21 +22 −23 say.

Let k·ksp denote the spectral norm. Note that kk ≤rank() kksp and kksp ≤ kk for any matrix

By these properties, the submultiplicative property of the spectral norm, the fact that k0ksp = 1 and

Assumptions A.1(i) and A.2(ii),

21 ≤ −120X=1

°°0 − °°2 X

∈

k 00k ≤ −12

0X=1

°°0 − °°2 X

∈

kk2

= −12

¡( )−1 + −2

¢ ( ) = −12

¡1 +−1

¢= (1)

By Assumption A.2(iii), for any 0

(22 ≥ ) ≤ ³∪0

=1

´→ 0 and (23 ≥ ) ≤

³∪0

=1

´→ 0

It follows that 22 = (1) and 23 = (1) Consequently 2 = (1) under H0 (0)

Proposition A.3 3 = (1) under H0 (0)

Proof. As in the proof of Proposition A.2, we make the following decomposition:

3 = −120X=1

X∈0

00(0 − )

= −120X=1

X∈

00

¡0 −

¢+−12

0X=1

X∈0

\

00(0 − )

−−120X=1

X∈\0

00

¡0 −

¢≡ 31 +32 −33 say.

34

Using the same arguments as those used in the study of 22 and 23 we can show that 32 =

(1) and 33 = (1) Noting that 0 − =

¡( )−12 + −1

¢for = 1 0 under H0 (0)

by Assumption A.2(ii), it suffices to prove that 31 = (1) by showing that

31 ≡ −12X∈

00 =

³min(( )12 )

´for = 1 0

By the fact that 1 ∈ = 1 ∈ 0 + 1 ∈ \0 − 1 ∈ 0\ and the arguments usedin the study of 22 and 23 we can show that 31 ≡ 31 + (1) where 31 =

−12P

∈0 00 Using

00 =

P=1( − ·) we can decompose 31 as follows

31 = −12X∈0

X=1

−−12−1X∈0

X=1

X=1

=¡1− −1

¢−12

X∈0

X=1

−−12−1X∈0

X1≤≤

−−12−1X∈0

X1≤≤

≡ 311 − 312 − 313 say.

Using Chebyshev inequality, we can readily show that 311 =

¡ 12

¢under Assumptions A.1(i), (iv)

and (v). Let = (1 )0 be an arbitrary × 1 nonrandom vector with kk = 1 By Assumptions

A.1(i), (iv) and (v), and Jensen inequality,

∙³0312

´2¸= −1−2

X∈0

X∈0

X1≤≤

X1≤≤

¡0

0

¢= −1−2

X∈0

X1≤≤

X1≤≤

¡0

0

¢= −1−2

X∈0

X1≤≤

¡0

0

2

¢= ( )

Then 312 =

¡ 12

¢by Chebyshev inequality. Next,

∙³0313

´2¸= −1−2

X∈0

X1≤≤

X1≤≤

¡0

0

¢+−1−2

X∈0

X∈0

6=

X1≤≤

X1≤≤

¡0

¢¡0

¢≡ + say.

Let ≡ To bound we consider two cases: (a) # = 4 and (b) # ≤ 3 and denote the

corresponding summations as and such that = + Apparently, = ( ) For wlog we

consider three subcases: (a1) (a2) (a3) , and denote the

corresponding summations as 1 2 and 3 respectively. (Note = 2(1 + 2 + 3)) In subcase

(a1), we apply Davydov inequality to obtain

|1| ≤ 8−1−2X∈0

X1≤≤

(− )(1+)(2+) ≤ 8

∞X=1

()(1+)(2+)

= ( )

35

where = kk8+4°°0

0

°°(8+4)3

≤ ∞ by Assumption A.1(i) and Jensen inequality.

Analogously, we can show that 2 = ( ) and 3 = ( ) It follows that = ( ) For we apply

Davydov inequality to obtain

|| ≤ −1−2

⎧⎨⎩X∈0

X1≤≤

¯¡0

¢¯⎫⎬⎭2

≤ −1−2

⎧⎨⎩X∈0

X1≤≤

(− )(3+2)(4+2)

⎫⎬⎭2

= −1−2¡2 2

¢= ()

where =°°0

°°8+4

°°0

°°8+4

≤ ∞ by Assumption A.1(i). Consequently, [0311 (3)]2

= ( + ) and 313 = (12 + 12)

In sum, we have 31 =

¡12 + 12

¢ It follows that 31 = (min(( )12 ))

Proof of Theorem 3.2. By Theorem 3.1 and the Slutsky lemma, it suffices to prove the first two parts

of the theorem.

Step 1. We prove (i) (0) = + (1) under H0 (0) Let denote a ×1 vector with 1 inthe ’th position and zeros everywhere else. Then = 00

0 =P

=1

P=1

0 (0)

−1

Using 2 − 2 = ( − )

2+ 2 ( − ) we decompose − as follows:

(0)− =1√

X=1

X=1

( − )2 +

2√

X=1

X=1

( − )

≡ 1 + 22 say.

Noting that diag() is p.s.d., we have by (A.1) and Cauchy-Schwarz inequality

1 = −12X=1

( − )0diag () ( − )

≤ 2−12X=1

00diag ()0 + 2−12

X=1

( − 0 )0 0

0diag ()0( − 0 )

≡ 211 + 212 say.

We will show that 1 = (1) for = 1 and 2 By the fact thatP

=1 0 = and 0 is idempotent,

we have i0diag() i =tr[i0diag () i ] =

P=1tr

¡00

0¢=tr

¡0

0

¢=tr[0(

00)

−1

× 00] = This, in conjunction with Davydov inequality, implies that

¯11

¯= −2−12

X=1

[0i (i0diag () i ) i

0] = −2−12

X=1

(0i i0)

= −2−12X=1

X=1

¡2¢+ 2−2−12

X=1

X1≤≤

()

= (12−1) +(12−1) = (12−1).

36

Consequently 11 = (12−1) = (1) by Markov inequality. Using 1 ∈ 0 = 1 ∈ +

1 ∈ 0\− 1 ∈ \0 following similar arguments as those used in the proof of Proposition A.2,and by Assumptions A.1(i) and A.2(ii), we can show that under H0 (0)

12 = −120X=1

X∈

( − 0)0 0

0diag ()0( − 0) + (1)

≤ −120X=1

°° − 0°°2 X

∈

k 00diag ()0k+ (1)

= −12

³( )

−1+ −2

´ () + (1) = (1)

based on the fact that 00 ≤ for any p.s.d. matrix and

X∈

k 00diag ()0k ≤

X∈

k 0diag ()k ≤

X∈

X=1

°° 000

0

°°≤

X∈

X=1

°° 00

°° = X∈

X=1

°°° 0 (

00)

−1 0

°°°≤ max

1≤≤

°°°Ω−1 °°°−1 X∈

X=1

kk4 = () by Lemma B.3(v).

Consequently, we have shown that 1 = (1)

For 2 we first apply (A.1) to decompose it as follows

2 =1√

X=1

( − )0diag()

=1√

X=1

00diag() +1√

X=1

( − 0 )0 0

0diag() ≡ 21 + 22

Observe that

21 =1

2√

X=1

X=1

X=1

00Ω

−1 0

0

=1

2√

X=1

X=1

X=1

00Ω

−1 0

0

+1

2√

X=1

X=1

X=1

00

hΩ−1 −Ω−1

i 00 ≡ 211 + 212 say.

Noting that k00k ≤ k0k = kk we apply Lemma B.3(v) to obtain¯212

¯≤ max

1≤≤

°°°Ω−1 −Ω−1 °°° 1

2√

X=1

¯¯X=1

¯¯

X=1

kk2 kk

= ( )

³12−12

´= (1)

37

For 211 we have

211 =1

2√

X=1

X=1

X=1

X=1

X=1

0Ω−1

=1

2√

X=1

X=1

X=1

0Ω−1 − 2

3√

X=1

X=1

X=1

X=1

0Ω−1

+1

4√

X=1

X=1

X=1

X=1

X=1

0Ω−1

≡ 211 − 2211 + 221 say.

We further decompose 211 as follows 211 =1

2√

P=1

P=1

0Ω−1

2 +

1

2√

P=1

P1≤≤

0Ω−1 +

1

2√

P=1

P1≤≤

0Ω−1 = 211 (1) + 211 (2) + 211 (3) Ap-

parently, 211 (1) =

¡12−1

¢by Markov inequality. Noting that [211 (2)] = 0 by Davy-

dov inequality we can readily show that

h211 (2)

i2=

1

4

X=1

X1≤≤

X1≤≤

£

0Ω−1

0Ω−1

¤=

¡−1

¢

It follows that 211 (2) =

¡−12

¢ Similarly, 211 (3) =

¡−12

¢ Then 211 =

¡12−1 + −12

¢= (1) Analogously, we can show that 211 = (1) for = Then

we have 211 = (1) and 21 = (1)

For 22 using the same arguments as those used in the proof of Proposition A.2, we can show that

under H0 (0)

22 =1√

0X=1

( − 0)0 X∈

00diag() + (1)

=1√

0X=1

( − 0)0 X∈0

00diag() + (1) ≡ 22 + (1)

Let =1√

P∈0

00diag() Then as in the proof of Proposition A.2 and analysis of 211 (2),

we can show that

=1

√

X∈0

X=1

X=1

0Ω−1 0

0

=1

√

X∈0

X=1

X=1

0Ω−1 0

0 + (1)

=1

√

X∈0

X=1

X=1

X=1

X=1

0Ω−1 + (1) =

³12 + 12

´

It follows that 22 =

¡( )−12 + −1

¢

¡12 + 12

¢= (1) This completes the proof of

(i1).

38

Step 2. We prove (ii) (0) = + (1) Observe that (0)− = 41 +42

where

1 = −2−1X=2

X=1

⎧⎨⎩"

0

−1X=1

#2−"

0

−1X=1

#2⎫⎬⎭ and

2 = −2−1X=2

X=1

⎧⎨⎩"

0

−1X=1

#2−

"

0

−1X=1

#2⎫⎬⎭

Noting that (2) = 0 and Var(2) = (1) by direct moment calculations, we have 2 = (1)

by Chebyshev’s inequality. Thus we are left to show that 1 = (1) Again, using 2− 2 = (− )

2+

2 (− ) we have

1 = −2−1X=2

X=1

"

0

−1X=1

− 0

−1X=1

#2

+2−2−1X=2

X=1

"

0

−1X=1

− 0

−1X=1

#

0

−1X=1

≡ 11 + 212

Let 12 ≡ −2−1P

=2

P=1[

0

P−1=1 ]

2 By Cauchy-Schwarz inequality 12 ≤ 1112©12

ª12 It is straightforward to show that 12 = (1) so that we can prove that 1 = (1)

by showing that 11 = (1) Using = ( − ) + and Cauchy-Schwarz inequality,

11 ≤ 3−2−1X=2

X=1

"³ −

´0 −1X=1

#2

+3−2−1X=2

X=1

"

0

−1X=1

³ −

´#2

+3−2−1X=2

X=1

"³ −

´0 −1X=1

³ −

´#2≡ 3111 + 3112 + 3113

We complete the proof of (ii) by showing that (ii1) 111 = (1) (ii2) 112 = (1) and (ii3)

113 = (1)

We first show (ii1) 111 = (1). Using − = ( − ) +(−)+( − ) (−) and Cauchy-Schwarz inequality, we have

111 ≤ 3−2−1X=2

X=1

"( − )

0

−1X=1

#2

+3−2−1X=2

X=1

"

³ −

´0 −1X=1

#2

+3−2−1X=2

X=1

"( − )

³ −

´0 −1X=1

#2≡ 3111 + 3111 + 3111

39

By Markov and Davydov inequalities we can show that −2−1P

=2

P=1[

0

P−1=1 ]

2 = (1)

By Boole inequality, Doob inequality (e.g., Hall and Heyde (1980, pp.14-15)) for m.d.s., and then Davydov

inequality, for any 0 we have

Ãmax1≤≤

max2≤≤

°°°°°−12−1X=1

°°°°° 18

!≤

X=1

Ãmax2≤≤

°°°°°−12−1X=1

°°°°° 18

!

≤ 1

48

X=1

°°°°°−1X=1

°°°°°8

= (1)

It follows that

max1≤≤

max2≤≤

°°°°°−1X=1

°°°°° =

³ 1218

´ (A.8)

The same conclusion follows when one replaces by or 1. Let =diag(°°1°°2 °°°°2) By (A.1),

we can readily show that

−1X=1

k − k2 ≤ 2−1X=1

k0k2 + 2−1X=1

°°°0(0 − )

°°°2= (1) + (1) = (1) (A.9)

and

−1X=2

X=1

( − )2°°°°2 = −1

X=1

( − )0 ( − )

≤ 2−1X=1

000 + 2−1

X=1

(0 − )0 0

00(0 − )

= (1) + (1) = (1) (A.10)

By (A.8), (A.10), and Assumption A.3,

111 = −2−1X=2

X=1

( − )2

"0

−1X=1

#2

≤ −2 max1≤≤

max2≤≤

°°°°°−1X=1

°°°°°2(

−1X=2

X=1

( − )2°°°°2)

= −2

³14

´ (1) = (1)

To determine the probability order of 111 and 111 we use the uniform probability order of −We decompose − as follows:

− = Ω−12

" − −1

X=1

#−Ω−12

" − −1

X=1

()

#

= − −1

X=1

−Ω−12 −1X=1

[ − ()]

≡ 1 − 2 − 3 (A.11)

40

where ≡ Ω−12 −Ω−12 By Lemma B.3 and the fact thatmax1≤≤ max1≤≤ kk = (( )1(8+4)

)

by Boole and Markov inequalities, we have max1≤≤ max1≤≤ k1k = ( × ( )1(8+4)

) Follow-

ing the proof of Lemma B.3(v), we can show that

max1≤≤

°°°°°−1X=1

°°°°° ≤ max1≤≤

°°°°°−1X=1

[ − ()]

°°°°°+ max1≤≤

°°°°°−1X=1

()

°°°°°=

³max( )

1(8+4)log ( ) (log ( ) )12

´+ (1)

= (1) (A.12)

It follows that max1≤≤ max1≤≤ k2k = ( ) Also, max1≤≤ max1≤≤ k3k = ( )

Thus max1≤≤ max1≤≤°°° −

°°° = ( ( )1(8+4)

) In addition, using (A.11) and the above

bounds, we have

−1−1X=2

X=1

2

°°° −

°°°2≤ 3−1−1

X=2

X=1

2 k1k2 + 3−1−1X=2

X=1

2 k2k2 + 3−1−1X=2

X=1

2 k3k2

≤ 3−1−1X=2

X=1

2 k1k2 + (2 ) + (

2 ) = (

2 )

where the last equality follows from the fact that −1−1P=2

P=1

2 k1k2 ≤ max1≤≤ kk2 −1−1P=2

P=1

2 kk2 = (2 ) Then by Assumption A.3,

111 = −2−1X=2

X=1

"

³ −

´0 −1X=1

#2

≤ −1 max1≤≤

max1≤≤

"−1X=1

#2 "−1−1

X=2

X=1

2

°°° −

°°°2 #= −1

³14

´ (

2 ) = (

142 ) = (1)

and

111 = −2−1X=2

X=1

"( − )

³ −

´0 −1X=1

#2

= −2 max1≤≤

max1≤≤

°°° −

°°°2 max1≤≤

max1≤≤

°°°°°−1X=1

°°°°°2

−1X=2

X=1

( − )2

= −2 (2 ( )1(4+2)

)

³14

´ (1) = (

142 ) = (1)

It follows that 111 = (1)

To show (ii2) and (ii3), we find that it is convenient to bound ≡ −12P−1

=1(− ) Using

− = ( − ) + ( − ) + ( − ) ( − ), we have

≡ −12−1X=1

( − ) + −12−1X=1

( − ) + −12−1X=1

( − ) ( − )

≡ 1 + 2 + 3 say.

41

By (A.11), 1 = −12P−1

=1 (1 − 2 − 3) ≡ 11 − 12 − 13 say. By the remark after

(A.8), (A.12), and Lemma B.3,

max1≤≤

max1≤≤

k11k ≤ max1≤≤

kk max1≤≤

max1≤≤

°°°°°−12−1X=1

°°°°° = ( ) (14) = (1)

max1≤≤

max1≤≤

k12k ≤ max1≤≤

kk max1≤≤

°°°°°−1X=1

°°°°° max1≤≤max1≤≤

°°°°°−12−1X=1

°°°°°= ( ) (1) (

18) = (1)

and

max1≤≤

max1≤≤

k13k ≤ max1≤≤

°°°Ω−12

°°° max1≤≤

°°°°°−1X=1

[ − ()]

°°°°° max1≤≤max1≤≤

°°°°°−12−1X=1

°°°°°= (1) ( ) (

18) = (1)

It follows that max1≤≤ max1≤≤ k1k = (1) Similarly we can show that max1≤≤ max1≤≤k2k = (1) and max1≤≤ max1≤≤ k3k = (1). Hence max1≤≤ max1≤≤ kk = (1) It

follows that

112 = −2−1X=2

X=1

"

0

−1X=1

³ −

´#2

≤½max1≤≤

max1≤≤

kk2¾(

−1−1X=2

X=1

°°0°°2)= (1) (1) = (1)

and

113 = −2−1X=2

X=1

"³ −

´0 −1X=1

³ −

´#2

≤½max1≤≤

max1≤≤

kk2¾(

−1−1X=2

X=1

°°° −

°°°2) = (1) (1) = (1)

as one can readily show that −1−1P

=2

P=1

°°° −

°°°2 = (1). Thus 11 = (1) This

completes the proof of () ¥

Proof of Theorem 3.3. Observe that

q (0)1 (0) = 1 − (0) + 2 + 23

where 1 2 and 3 are as defined in (A.2). We study the probability order of each term in the

last expression.

Noting that k 00k2 ≤ 2 k 0

k2 + 2 k 00k2 we have −1−2

P=1 k 0

0k2 ≤ 21 + 22where 1 = 2

−1−2P

=1

P=1

P=1

0 and 2 = −1−4

P=1

P=1

P=1

P=1

P=1

0

× By Assumption A.1 and Markov inequality, we can readily show that 1 = (−1) and

2 = (−1) It follows that −1−2

P=1 k 0

0k2 = (−1) Then by Lemma B.3(v),

−12−11 = −1−1X=1

000 ≤ max

1≤≤max

³

´−1−2

X=1

0000

= (1)

¡−1

¢=

¡−1

¢

42

By (A.1) and Cauchy-Schwarz inequality

−12−1 = −1−1X=1

X=1

2 = −1−1X=1

0diag ()

≤ 2−1−1X=1

00diag ()0

+2−1−1X=1

(0 − )0 0

0diag ()0(0 − )

≡ 21 + 22 say.

By Cauchy-Schwarz inequality, 1 ≤ 2−1−1P

=1 0diag() + 2

−1−1P

=1 00diag()0 ≡

211+212 say. By the fact =00 ≤ [min()]

−1−1000 and Lemma B.3(v), we have

11 ≤ [min()]−1−1−2

X=1

0diag (000)

= [min()]−1−1−2

X=1

X=1

X=1

X=1

0 =

¡−1

¢

as we can readily show that −1−2P

=1

P=1

P=1

P=1

0 = (

−1) based on thefact that = 1 = − −1 and Markov inequality. As in the analysis of 11 we can readily apply

the fact that i0diag() i = to obtain

12 = −1−3X=1

0i i0diag () i i

0 = −1−3

X=1

0i i0

= −1−3X=1

X=1

X=1

=

¡−2

¢

It follows that 1 =

¡−1

¢ By Cauchy-Schwarz inequality, 2 ≤ 2−1−1

P=1(

0−)0 0

diag()

×(0 − ) + 2−1−1

P=1(

0 − )

0 00diag()0(

0 − ) ≡ 221 + 222 say. Noting that

diag(000) ≤diag(

0) we have

21 ≤ [min()]−1−1−2

X=1

(0 − )0 0

diag (000)(

0 − )

≤ [min()]−1−1−2

X=1

(0 − )0 0

diag (0)(

0 − )

= [min()]−1−1−2

X=1

(0 − )0

0

0(

0 − )

≤ −1[min()]−1 max

1≤≤−1

X=1

kk4−1X=1

°°°0 −

°°°2= −1 (1) (1) (1) =

¡−1

¢

43

Using i0diag() i = we have

22 = −1−3X=1

(0 − )0 0

i i0diag () i i

0(

0 − )

= −1−3X=1

(0 − )0i i

0(

0 − )

≤ −1 max1≤≤

°°·°°2−1 X

=1

°°°0 −

°°°2 =

¡−1

¢

It follows that 2 =

¡−1

¢and −12−1 (0) =

¡−1

¢

By Assumption A.4(ii) and Lemma B.3(v), w.p.a.1

−12−12 = −1−1X=1

³0 −

´0 00

³0 −

´≥ min

³

´−1

0X=1

X∈

°°0 − °°2 ≥ 1

2min () 0

Now, we decompose −12−13 as follows: −12−13 = −1−1

P=1

0(

0−)−−1−2

×P=1

0i i

0(

0 − ) ≡ 31 + 32 say. For the first term, we apply the Cauchy-Schwarz

inequality to obtain

|31| ≤ −12(−1−1

X=1

k0k2)12

×⎧⎨⎩−1

0X=1

X∈

°°0 − °°2⎫⎬⎭

12

= −12 (1) (1) = (1)

where we use the fact that −1−1P

=1 k0k2 = (1) under Assumption A.1. Similarly,

|32| ≤ max1≤≤

°°−10i°° max1≤≤

°°−1 0i°°−1 X

=1

°°°0 −

°°°≤ max

1≤≤

°°−10i°° max1≤≤

°°−1 0i°°⎧⎨⎩−1

0X=1

X∈

°°0 − °°2⎫⎬⎭

12

= ( ) (1) (1) = (1)

It follows that −12−13 = (1)

In sum, we have −12−1q (0)1 (0) ≥ 1

2min () 0

+ (1) w.p.a.1. In addition, we

can show that (0) has a positive probability limit under H1 (0) It follows that under H1 (0)

(1 (0) ≥ )→ 1 as ( )→∞ for any = ¡12

¢ ¥

Proof of Theorem 4.1. (i) Let = (1 )0 = (1 )

0 and = (1 )0 The

minimization problem in (4.4) can be rewritten as

min

(0)

2 (βα) =1

X=1

°°°°°° − +1

X=1

°°°°°°2

+

X=1

Π0

=1 k − k (A.13)

44

Let β =β0+ −12v where v =(1 ) is a × matrix. We want to show that for any given ∗ 0there exists a large constant = (∗) such that for sufficiently large and we have

(inf

−1

=1kk2=

(0)

2

³β0 + −12v α

´

(0)

2

¡β0α0

¢) ≥ 1− ∗ (A.14)

where α ≡ α (v) is chosen such that (β0 + −12v α) minimizes (0)

2 (βα) for some given v This

implies that w.p.a.1 there is a local minimum β α such that −1P

=1 ||||2 =

¡−1

¢where =

− 0

Observe that

h(0)

2

³β0 + −12v α

´−

(0)

2

¡β0α0

¢i≥ 1

X=1

⎧⎪⎨⎪⎩°°°°°° −

³0 + −12

´+1

X=1

³0 + −12

´°°°°°°2

− kk2⎫⎪⎬⎪⎭

=1

X=1

⎧⎪⎨⎪⎩°°°°°° − −12

⎛⎝ − 1

X=1

⎞⎠°°°°°°2

− kk2⎫⎪⎬⎪⎭

=1

X=1

°°°°°° − 1

X=1

°°°°°°2

− 2

12

X=1

⎛⎝ − 1

X=1

⎞⎠=

1

X=1

00 − 1

2

X=1

X=1

00 − 2

12

X=1

0

=1

0Ω − 1

0A − 2

12

X=1

0 (A.15)

where Ω ≡diag(Ω1 Ω ) A is an × matrix with a typical × block submatrix ( )−1

0

and we use the fact thatP

=1 = 0 By Lemma B.3(v) and Assumption A.1(ii),

1

0Ω ≥ 1

kk2 min

1≤≤min(Ω) ≥ 1

kk2 2 w.p.a.1. . (A.16)

Define the upper block-triangular matrix

1 =1

⎛⎜⎜⎜⎜⎝ 011 0

12 · · · 01

0 022 · · · 0

2

......

. . ....

0 0 · · · 0

⎞⎟⎟⎟⎟⎠

Note that A = 1 +01 − where = −1Ω By the fact that the eigenvalues of a block upper/lowertriangular matrix are the combined eigenvalues of its diagonal block matrices, Weyl’s inequality, Lemma

B.3(v), and Assumption A.1(ii), we have max (A) ≤ 2max (1)−min() ≤ 2−1max1≤≤ max(Ω) =

(−1) It follows that

1

0A ≤ 1

kk2 max (A) = 1

kk2 (

−1) (A.17)

45

In addition, by Cauchy-Schwarz and Markov inequalities, we can readily show that

1

12

¯¯X=1

0

¯¯ ≤

(1

X=1

0

)12(1

X=1

00

)12= −12 kk (1) (A.18)

Combining (A.15)-(A.18) yields

h(0)

2

³β0 + −12v α

´−

(0)

2

¡β0α0

¢i ≥ 1

kk2 [2− (1)]−−12 kk (1)

The first term dominates the second term in the last display for sufficiently large That is [(0)

2(β0+

−12v α) − (0)

2

¡β0α0

¢] 0 for sufficiently large Consequently, the minimizer β must satisfy

−1P

=1 ||||2 =

¡−1

¢

(ii) Given (i), we can readily argue the pointwise convergence of as in the proof of (iii) below. Let

(βα) =1

P=1Π

0

=1 k − k and (α) = Π0−1=1

°°° −

°°°+Π0−2=1

°°° −

°°°×°°0 − 0

°°++Π0

=2

°°0 − °° By the repeated use of Minkowski’s inequality (see, e.g., Su, Shi, and Phillips (2016,

SSP hereafter)), we have that as ( )→∞¯Π0

=1

°°° −

°°°−Π0

=1

°°°0 −

°°°¯ ≤ (α)°° − 0

°° and (α) ≤ 0 (α)³1 + 2

°°° − 0

°°°´ where 0 (α) = max1≤≤ max1≤≤≤0−1Π

=1

°°0 − °°0−1−

= max1≤≤0max1≤≤≤0−1

Π=1°°0 −

°°0−1−= (1) and ’s are finite integers. It follows that as ( )→∞

¯ (βα)− (β

0α)¯≤ 0 (α)

1

X=1

°°°°°°+ 20 (α)1

X=1

°°°°°°2≤ 0 (α)

(1

X=1

°°°°°°2)12 + (−1) = (

−12) (A.19)

By (A.19) and the fact that

¡β0α0

¢= 0, we have

0 ≥ (β α)− (βα0) =

¡β0 α

¢−

¡β0α0

¢+ (

−12)

=1

X=1

Π0

=1

°°0 − °°+ (

−12)

=1

Π0

=1

°° − 01°°+ +

0

Π0

=1

°° − 00

°°+ (−12) (A.20)

Then by Assumption A.2(i), we have Π0

=1

°° − 0°° = (

−12) for = 1 0 It follows that

((1) (0))− (01 00) = (

−12) for some suitable permutation ((1) (0)) of (1 0).

(iii) We invoke subdifferential calculus (e.g., Bertsekas (1995, Appendix B.5)). A necessary condition for

and to minimize the objective function in (A.13) is that for each = 1 (resp. = 1 0),

0×1 belongs to the subdifferential of (0)

2 (βα) with respect to (resp. ) evaluated at and That is, for each = 1 and = 1 0 we have

0×1 = − 2

− 1

0

⎛⎝ − +1

X=1

⎞⎠+

0X=1

Π0

=16=°°° −

°°° (A.21)

46

where =−k−k if

°°° −

°°° 6= 0 and kk ≤ 1 if°°° −

°°° = 0 Noting that = 0 −

1

P=1

0 + (A.21) implies that

Ω =1

0 +

1

X=1

0 −

2 ( − 1)0X=1

Π0

=1 6=°°° −

°°° (A.22)

where = − 0 Ω is asymptotically nonsingular uniformly in by Lemma B.3(v) and Assumption

A.1(ii). Using the arguments as used in the proof of Lemma B.3, we can readily show that

max1≤≤

°°°° 1 0

°°°° = max1≤≤

°°°°° 1X=1

¡ − ·

¢( − ·)

°°°°°≤ max

1≤≤

°°°°° 1X=1

°°°°°+ max1≤≤

°°°°° 1X=1

·

°°°°°+ max1≤≤

°°·°° max1≤≤

|·|+ max1≤≤

°°·°°

= ( )

where = max( )1(4+2)

log ( ) (log ( ) )12 By Assumption A.1 and Lemma B.3,

1

°°°°°°X=1

0

°°°°°° ≤ max1≤≤

1

12

°°°

°°°⎧⎨⎩ 1

X=1

°°°

°°°2⎫⎬⎭12⎧⎨⎩ 1

X=1

°°°°°°2⎫⎬⎭12

≡ 1 = (−12)

Let =

2(−1)P0

=1 Π0

=16=°°° −

°°° By the fact that ¯Π0

=1 6=°°° −

°°°−Π0

=1 6=°°0 −

°°¯ ≤0

(α)³1 + 2

°°° − 0

°°°´ for some 0(α) = (1) we have

kk ≤

0X=1

Π0

=1 6=°°° −

°°°≤

0X=1

³Π0

=1 6=°°° −

°°°−Π0

=1 6=°°0 −

°°´+

0X=1

Π0

=1 6=°°0 −

°°≤ 200

(α)°°°°°°+

where = 00(α)+max1≤≤

P0

=1Π0

=1 6=°°0 −

°° = () as 0 ’s can only take 0 values.

It follows that

°°°°°° ≤°°°Ω−1 °°°

sp

⎛⎝°°°° 1 0

°°°°+°°°°°° 1

X=1

0

°°°°°°+ 200(α)

°°°°°°+

⎞⎠≤

°°°Ω−1 °°°sp

µ°°°° 1 0

°°°°+ 1 + 200(α)

°°°°°°+

¶ and

max1≤≤

°°°°°° ≤ 2

1− 2200(α)

µmax1≤≤

°°°° 1 0

°°°°+ 1 +

¶= (2 )

½max1≤≤

°°°° 1 0

°°°°+ 1 +

¾= ( + )

where k·ksp denotes the spectral norm and 2 = [min1≤≤ min(Ω)]−1 = (1) by Lemma B.3(v). ¥

47

Proof of Theorem 4.2. (i) Fix ∈ 1 0 By the consistency of and in Theorem 4.1, we have

− → 0−0 6= 0 for all ∈ 0 and 6= and = Π0

=1 6=°°° −

°°° → 0 ≡ Π0

=1 6=°°0 − 0

°° 0for ∈ 0 Now, suppose that

°°° −

°°° 6= 0 for some ∈ 0 Then the first order condition (with respect

to ) for the minimization problem implies that

0×1 = − 2√ 0

⎛⎝ − +1

X=1

⎞⎠+ √

− 10X=1

Π0

=16=°°° −

°°°= − 2√

0

⎡⎣ −

³ − 0

´+1

X=1

³ − 0

´⎤⎦+ √

− 10X=1

Π0

=1 6=°°° −

°°°= − 2√

0 +

⎛⎝ 2 0 +

− 1°°° −

°°°⎞⎠√ ³ −

´+2

0

√¡ − 0

¢

− 2√

X=1

0

³ − 0

´+

√

− 10X

=1 6=Π

0

=16=°°° −

°°°≡ 1 + 2 + 3 + 4 + 5 say, (A.23)

where =−k−k if

°°° −

°°° 6= 0 and kk ≤ 1 if °°° −

°°° = 0Let = −12 (ln )3++ (ln )

Following SSP and using Assumption A.5(i), we can strengthen

the result in 4.1 to obtain that

µmax1≤≤

°°°°°° ≥

¶= (−1) for any 0 (A.24)

This, in conjunction with the proof of Theorem 4.1(ii), implies that

³°° − 0

°° ≥ −12 (ln )´=

¡−1

¢and

µmax∈0

¯ − 0

¯≥ 02

¶=

¡−1

¢ (A.25)

By (A.24)-(A.25), Lemma B.4(v) and Assumption A.1(x),

µmax∈0

°°°3

°°° (ln )3+

¶=

¡−1

¢and

µmax∈0

°°°5

°°° √

¶=

¡−1

¢

Noting that 1√

°°°P=1

0

³ − 0

´°°° ≤ −12°°°

°°°½ 1

P=1

°°°

°°°2¾12½

P=1

°°° − 0

°°°2¾12= (1) we can readily show that

³max∈0

°°°4

°°° (ln )3+

´=

¡−1

¢ It follows that (Ξ ) =

1− ¡−1

¢ where

Ξ ≡½max∈0

¯ − 0

¯≤ 02

¾∩½max∈0

°°°3

°°° ≤ (ln )3+¾∩½max∈0

°°°4

°°° ≤ (ln )3+

¾∩½max∈0

°°°5

°°° ≤ √

¾

Then conditional on Ξ we have that uniformly in ∈ 0°°°°³ −

´0 ³2 + 3 + 4 + 5

´°°°° ≥°°°°³ −

´02

°°°°− °°°°³ −

´0 ³3 + 4 + 5

´°°°°≥

− 1√

°°° −

°°°− °°° −

°°° 2 (ln )3+ +√

≥√0

°°° −

°°° 4 for sufficiently large ( )

48

where the last inequality follows because ≥ 02 on Ξ √ À 3 (ln )

3++√ under

Assumption A.5(ii). It follows that for all ∈ 0

() = ³ ∈ | ∈ 0

´=

³−1 = 2 + 3 + 4 + 5

´≤

³¯( − )

01

¯≥¯( − )

0³2 + 3 + 4 + 5

´¯´≤

³°°° −

°°°°°°1

°°° ≥ √0 °°° −

°°° 4Ξ

´+ (Ξ∗ )

≤ ³°°°1

°°° ≥ √04´+ (Ξ∗ ) = ¡−1

¢

where Ξ∗ denotes the complement of Ξ and the convergence follows by Assumptions A.1(i), A.2(iv)

and A.3(i) (see the remark after Assumption A.3), and the fact that (Ξ∗ ) = ¡−1

¢ Consequently,

we can conclude that with probability 1− ¡−1

¢, − must be in a position where k − k is not

differentiable with respect to for any ∈ 0. That is, ³°°° −

°°° = 0 | ∈ 0

´= 1 −

¡−1

¢as

( )→∞ Then the rest of the proof follows SSP. ¥

Proof of Theorem 4.3. For the post-Lasso estimates we have the following first order conditions:

1

X∈

bX0⎛⎝ − +

1

0X=1

X∈

⎞⎠ = 0 for = 1 0

wherebX = − 1

P∈

for ∈ It follows that vec(α0) = Q−1 V Let Q Q and V

be analogously defined as Q Q and V for = 1 0 By Theorem 4.2 and using similar

arguments as used in the proof of Proposition A.2, we can readily show that Q = Q+ (( )−12

)

Q = Q + (( )−12

) and V = V + (( )−12

) It follows that

vec (α0) = Q−1V + (( )

−12)

That is, α is asymptotically equivalent to the infeasible oracle estimator α0under H (0). Using =

0 − 1

P0

=1

P∈0

0 + we can readily show that

vec¡α0−α00

¢= Q−1U

0 + (( )

−12)

where U0 =¡U001 U

000

¢0and U =

1

P∈0

X0 Noting that = 0 − · + i and

X = − 1

P∈0

for ∈ 0 we have

U0 =1

X∈0

X=1

⎛⎝ − 1

X∈0

⎞⎠ ( − ·)

It is easy to show that U0 = (( )−12

+ −1) by moment calculations and Chebyshev inequalityunder Assumptions A.1 and A.2(i). Then vec

¡α0−α00

¢= (( )

−12+−1) under Assumption A.6.

(ii) We make the following decomposition:

U0 =1

X∈0

X=1

¡ − ·

¢( − ·)− 1

X∈0

X=1

⎛⎝ 1

X∈0

¡ − ·

¢⎞⎠ ( − ·)

≡ U01 −U02

49

Under Assumptions A.1 and A.2(i), we can readily show that

U01 =1

X=1

X=1

½

1© ∈ 0

ª −(

()· − ())

¾− B +

³−1−12

´ and

U02 =1

X=1

X=1

£1© ∈ 0

ª− 1¤(()· − ()) +

³−1−12

´

where B is defined in Section 4.2. It follows that

U0 =1

X∈0

X=1

n −(

()· − ())

o − B +

³−1−12

´= U − B +

³−1−12

´

Then by Assumptions A.6(i)-(ii),√vec

¡α0−α00

+Q−1B

¢= Q−1

√U+ (1)

→ (0Q−10 Ω0Q−10 )

¥

Proof of Theorem 4.4. Let · = (·1 · )0 Noting that b = + (0 − )+

1

P=1 ( −0)

=0 − · + i +0(0 − ) +

1

P=10( − 0 )

0 = and 0i = 0 we have

−122 (0) = −12X=1

b000b

= −12X=1

(0 − ·)00

0 (0 − ·)

+−12X=1

(0 − )0 0

0(0 − )

+−12X=1

⎛⎝ 1

X=1

( − 0 )0 0

⎞⎠00

⎛⎝ 1

X=1

( − 0)

⎞⎠+2−12

X=1

(0 − ·)00(

0 − )

+2−12X=1

(0 − ·)00

0

⎛⎝ 1

X=1

( − 0)

⎞⎠+2−12

X=1

(0 − )00

⎛⎝ 1

X=1

( − 0)

⎞⎠≡ 1 +2 +3 + 24 + 25 + 26 say.

We analyze = 1 2 6 in turn.

(i) For 1 we make the following decomposition

1 = −12X=1

000 +−12

X=1

0·00· − 2−12

X=1

000·

≡ 11 +12 − 213

50

11 is identical to 1 studied in the proof of Theorem 3.1 and thus we have 11 − →

(0 0) Using 0 = − 1i i

0 we can decompose 12 as follows

12 = −12X=1

0·00·

= −12X=1

0·· + −2−12

X=1

0·i i0

i i0 · − 2−1−12

X=1

0·i i

0 ·

≡ 12 +12 − 212 say.

Note that12 = −1−12P

=1 0·Ω

−1 0

· ≤ 2 12 where 2 ≡ [min1≤≤ min(Ω)]−1

= (1) and 12 = −1−12P

=1 0·

0· In view of the fact that 0

· =P

=1 · =1

P=1

P=1 and by straightforward moment calculations and Assumptions A.1(iv)-(v), we can

show that

¯12

¯= −1−12

X=1

(0·0·) = −1−52

X=1

X=1

X=1

X=1

X=1

(0)

= −1−52X=1

X=1

X=1

¡2

0

¢= (−12)

Then 12 =

¡−12

¢by Markov inequality and 12 =

¡−12

¢. For 12 we have

12 = −3−12X=1

0·i i0Ω

−1 0

i i0 · ≤ 3 12

where 3 ≡ max1≤≤

¡−1i0

¢22 = (1) and 12 = −1−12

P=1

0·i i

0 · Noting

that i0 · =1

P=1

P=1 we have

¯12

¯= −1−12

X=1

(0·i i0 ·) = −1−52

X=1

X=1

X=1

X=1

X=1

()

= −1−52X=1

X=1

X=1

¡2¢= (−12)

Then 12 =

¡−12

¢by Markov inequality and 12 =

¡−12

¢. By the Cauchy-Schwarz

inequality, |12| ≤ 1212 1212 =

¡−12

¢ Consequently we have shown that

12 =

¡−12

¢

Now, we study13Recall that denotes the ( ) element of00: = −1

P=1

P=1

0Ω−1 and ≡ −1

P=1

P=1

0Ω−1 where = 1 = − −1 Following

the proof of Lemma B.4(i), we can show that 13 = −12P

=1

P=1

P=1 · = 13 +

(1) where 13 = −12P

=1

P=1

P=1 · Now, in view of the fact that

= −1 0Ω−1 − −2

X=1

0Ω−1 − −2

X=1

0Ω−1 + −3

X=1

X=1

0Ω−1 (A.26)

51

we make the following decomposition

13 = −32X=1

X=1

X=1

X=1

= −1−32X=1

X=1

X=1

X=1

0Ω−1 − −2−32

X=1

X=1

X=1

X=1

X=1

0Ω−1

−−2−32X=1

X=1

X=1

X=1

X=1

0Ω−1

+−3−32X=1

X=1

X=1

X=1

X=1

X=1

0Ω−1

≡ 13 − 13 − 13 + 13

By Assumptions A.1(iv)-(v), we can show that ¡213

¢=

¡−1

¢ implying that 13 =

¡−12

¢ For 13 we have

¡13

¢= −2−32

X=1

X=1

X=1

X=1

¡

0Ω−1

¢= −2−32

X=1

X1≤6=6=≤

¡

0Ω−1

¢+(−12)

= 1 + 2 +(−12)

where 1 = −2−32P

=1

P1≤≤

¡

0Ω−1

¢and 2 = −2−32

P=1

P1≤≤

¡

0Ω−1

¢ By Assumptions A.1(i), (iii) and (v) and Davydov inequality

|1| ≤ −2−12X

1≤≤ (− )

(1+)(2+)= (−12)

Similarly, 2 = (−12) and ¡13

¢= (−12) Analogously, we can show that (2

13) =

(−1) It follows that 13 = (−12) By the same token, we can show that 13 = (1)

Noting that 13 = −3−32P

=1

P=1

P=1

P=1

P=1

P=1

0Ω−1 we have

¡213

¢= −6−3

X1≤1234≤

X1≤128≤

¡11223546

013Ω−11 14

037Ω−13 38

¢

Clearly, the expectation in the last summation is 0 if #1 2 3 4 ≥ 3With this observation, we can read-ily apply Assumptions A.1(i)-(v) and Davydov inequality to show that (2

13) = ¡−1 +−2

¢=

(1) Then 13 = (1) Consequently, we have 13 = (1)

In sum, we have shown that 1 −→ (0 0)

(ii) 2 is the same as 2 in the proof of Theorem 3.1 and thus is (1)

52

(iii) Noting that 00 is a projection matrix with maximum eigenvalue given by 1 and using the

Cauchy-Schwarz inequality, we have

3 ≤ 12

⎛⎝ 1

X=1

( − 0 )0 0

⎞⎠⎛⎝ 1

X=1

( − 0 )

⎞⎠= 12

⎧⎨⎩ 1

X=1

°°° − 0

°°°2⎫⎬⎭⎧⎨⎩ 1

X=1

kk2⎫⎬⎭ = 12 (( )

−1+ −2) (1) = (1)

(iv) For 4 we make the following decomposition:

4 = −12X=1

00(0 − )−−12

X=1

0·0(0 − ) ≡ 41 −42

41 is the same as 3 studied in the proof of Theorem 3.1 and thus 41 = (1) For 42

we can apply the Cauchy-Schwarz inequality to obtain

|42| ≤(

X=1

0·000·

)12(−1

X=1

°°°0 −

°°°2)12 = (12) (( )

−12+−1) = (1)

becauseP

=1 0·0

00· =

P=1tr(0

00·0·) ≤

P=1tr(

00·0·) ≤

P=1

0·

0· and

X=1

(0·0·) =

1

2

X=1

X=1

X=1

X=1

X=1

(0) =

1

2

X=1

X=1

X=1

¡2

0

¢= ( )

Hence 4 = (1)

(v) For 5 we make the following decomposition:

5 = −12X=1

000

⎛⎝ 1

X=1

( − 0 )

⎞⎠−−12X=1

0·00

⎛⎝ 1

X=1

( − 0)

⎞⎠≡ 51 −52

Using 0 = − 1i i

0 we can decompose 51 as follows

51 = −32X=1

X=1

000( − 0 )

=

0X=1

−32X∈

X=1

000( − 0) =

0X=1

51

¡ − 0

¢

where51 = −32P

∈

P=1

00

0 Let = min(( )12

) To show that51 =

(1) it suffices to prove that 51 = ( ) for = 1 0 as − 0 = (( )−12

+ −1)Following the proofs of Propositions A.3 and A.1, we can show that

51 = −32X∈0

X=1

000 + (1)

= −32X∈0

X=1

X=1

X=1

+ (1) = 51 + ( )

53

where 51 = −32P

∈0

P=1

P=1

P=1 Now, using (A.26) we make the following

decomposition

51 = −1−32X∈0

X=1

X=1

X=1

0Ω−1 − −2−32

X∈0

X=1

X=1

X=1

X=1

0Ω−1

−−2−32X∈0

X=1

X=1

X=1

X=1

0Ω−1

+−3−32X∈0

X=1

X=1

X=1

X=1

X=1

0Ω−1

≡ 51 (1)− 51 (2)− 51 (3) + 51 (4)

As in the analysis of 13 we can show that 51 () = ( ) for = 1 4 For example, we can

easily show that 51 (1) has the same probability order as −1−12

P=1

P=1

P=1

0Ω−1

with =1

P∈0

() and the squared Frobenius norm of the last term has expectation given by

−2−1X=1

X=1

X=1

X=1

X=1

X=1

¡

0Ω−1

0Ω−1

¢0

= −2−1X=1

X=1

X=1

X=1

X=1

¡

0Ω−1

0Ω−1

¢0

+−2−1X=1

X=1 6=

X=1

X=1

X=1

X=1

¡

0Ω−1

¢¡

0Ω−1

¢0

= ( ) + ()

by the repeated use of Davydov inequality. It follows that 51 (1) =

¡12 + 12

¢= ( )

Then we have 51 = ( )

For 52 we apply 0 = − 1i i

0 to make the following decomposition:

52 = −12X=1

0· (00)

−1 00

⎛⎝ 1

X=1

( − 0 )

⎞⎠−−1−12

X=1

0·i i0 (

00)

−1 00

⎛⎝ 1

X=1

( − 0 )

⎞⎠ = 52 −52

By the Cauchy-Schwarz inequality, we have52 ≤ 52 (1)12 52 (2)12 where52 (1) =

−1P

=1 0·

0· and

52 (2) = −1X=1

⎛⎝ 1

X=1

( − 0 )

⎞⎠0

0 (00)

−2 00

⎛⎝ 1

X=1

( − 0 )

⎞⎠12

54

By Markov inequality, we can readily show that 52 (1) = (1) In addition, noting that 00

is a projection matrix and has maximum eigenvalue 1, we have

52 (2) ≤ 2−1

X=1

⎛⎝ 1

X=1

( − 0 )

⎞⎠0

00

⎛⎝ 1

X=1

( − 0)

⎞⎠12

≤ 2

°°°°°° 1X=1

( − 0)0 0

°°°°°°2

≤ 2

1

X=1

°°° − 0

°°°2 1

X=1

kk2

= (1) (( )−1+ −2) (1) =

¡−1 + −1

¢

It follows that 52 =

¡−12 + −12

¢= (1) Similarly, we can show that 52 = (1)

Thus 52 = (1) and 5 = (1)

(vi) By the Cauchy-Schwarz inequality and the results in (ii)-(iii), 6 ≤ 212 3 12 = (1)

Combining the results in (i)-(vi) completes the proof of the theorem. ¥

B Some technical lemmas

Define theth order U-statistic U =Ã

!−1P1≤1≤

¡1

¢where is symmetric in its

arguments. Let (·) denote the distribution function of Let (0) =R · · · R (1 )Π=1 ()

and () (1 ) =R · · · R (1 +1 ) Π=+1 () for = 1 Let (1) () =

(1) () − (0) and () (1 ) = () (1 ) −P−1

=1

P()

()¡1

¢ −(0) for = 2

where the sumP

() is taken over all subsets 1 ≤ 1 2 · · · ≤ of 1 2 Let H()

=Ã

!−1 P1≤1≤ ()

¡1

¢ Then by Theorem 1 in Lee (1990, p. 26), we have the following

Hoeffding decomposition:

U = (0) +

X=1

Ã

!H()

(B.1)

To study the second moment of H()

for 3 ≤ ≤ we need the following lemma.

Lemma B.1 Let ≥ 1 be an -dimensional strong mixing process with mixing coefficient (·) and dis-tribution function (·) Let the integers (1 ) be such that 1 ≤ 1 2 · · · ≤ Suppose that

maxR | (1 )|1+ 1 (1 )

R | (1 )|1+ 1 (1 ) +1(+1 ) ≤ for some 0 , where, e.g., 1 (1 ) denotes the distribution function of

¡1

¢.

Then¯Z (1 ) 1 (1 )−

Z (1 )

(1)1

(1 ) +1 (+1 )

¯≤ 41(1+) (+1 − )

(1+)

Proof. See Lemma 2.1 in Sun and Chiang (1997).

55

Lemma B.2 Let ≥ 1 be an -dimensional strong mixing process with mixing coefficient (·) anddistribution function (·) Suppose that () =

¡−3(2+)−

¢ If there exists 0 such that

≡ max½Z

| (1 · · · )|2+ Π=1 () ¯¡1

¢¯2+¾ ≤ X=1

()

and −1P

=1

P=1

() = (1) then [H()

]2 =

¡−3

¢for 3 ≤ ≤

Proof. The proof is analogous to that of Lemma A.6 in Su and Chen (2013) who consider conditional

strong mixing processes instead.

Lemma B.3 Recall Ω ≡ 00 and Ω ≡ (Ω) Let Ω1 ≡ 0

and Ω1 ≡ (Ω1) Suppose

Assumptions A.1-A.3 hold. Then (i) max(Ω1) ≤ max (Ω1) +

¡−12

¢ (ii) min(Ω1) ≥ min (Ω1)−

¡−12

¢ (iii) max1≤≤

°°°Ω1 −Ω1°°° = ( ) (iv) max1≤≤°°°Ω−11 −Ω−11 °°° = ( ) (v)

max1≤≤°°°Ω −Ω°°° = ( ) and max1≤≤

°°°Ω−1 −Ω−1 °°° = ( ) where ≡ max( )1(4+2)

× log ( ) (log ( ) )12

Proof. The results in (i)-(ii) follow from Lemmas A.1(iv)-(v) in Su and Jin (2012). Su and Chen (2013,

Lemma A.7) prove (iii) for the conditional strong mixing process. The result also holds for strong mixing

processes with a simple application of the Bernstein-type inequality for strong mixing processes (see, e.g.,

Lemma 2.2 in Sun and Chiang (1997)). (iv) follows from (i)-(iii) and the submultiplicative property of the

Frobenius norm.

Now we show (v). Using 0 = − −1i i0 , we can decompose Ω −Ω as follows:

Ω −Ω = −1 [ 00 − ( 0

0)]

=³Ω1 −Ω1

´− · 0

· +¡· 0

·¢

=³Ω1 −Ω1

´− £· −

¡·¢¤ £

· −¡·¢¤0 − £· −

¡·¢¤¡ 0·¢

− ¡·¢ £· −

¡·¢¤0+£¡· 0

·¢−

¡·¢¡ 0·¢¤

Following the proof of (iii), we can show that max1≤≤°°· −

¡·¢°° = (1 ) where 1 ≡

max( )1(8+4)

log ( ) (log ( ) )12 = ( ) max1≤≤°° ¡·

¢°° = (1) by Assump-

tion A.1(i). Let denote the ( )th element of ¡· 0

·¢−

¡·¢¡ 0·¢for = 1 Then by

triangle inequality, Davydov inequality, and Assumption A.2(iii),

|| =1

2

¯¯X=1

X=1

cov ()

¯¯ ≤ 1

2

X=1

|cov ()|+ 1

2

X1≤6=≤

|cov ()|

≤ ¡−1

¢+8

∞X=1

()(3+2)(4+2)

= ¡−1

¢

where ≤ sup≥1max1≤≤ max1≤≤ kk8+4 Then by the triangle inequality, we havemax1≤≤°°°Ω −Ω°°° = ( ) (vi) follows from (v) and Assumption A.1(ii).

Lemma B.4 Let and be as defined in the proof of Theorem 3.1. Suppose Assumptions A.1-A.3

hold. Then

56

(i) 1 ≡ −12P

=1

P1≤6=≤

¡ −

¢= (1)

(ii) 2 ≡ −2−12P

=1

P1≤≤

P=1 [ − ()]

0Ω−1 = (1)

(iii) 3 ≡ −2−12P

=1

P1≤≤

P=1

0Ω−1 [ − ()] = (1)

(iv) 4 ≡ −3−12P

=1

P1≤≤

P=1

P=1 [ − ()]

0Ω−1 [ − ()] = (1)

(v) 5 ≡ −3−12P

=1

P1≤≤

P=1

P=1 [ − ()]

0Ω−1 () = (1)

Proof. The proof of (i) is analogous to that of Lemma A.8 in Su and Chen (2013) except that we

replace their Lemmas A.5-A.7 by Lemmas B.1-B.3. To show (ii), letting ≡ [ − ()]0Ω−1

we can decompose 2 as follows

2 =1

2√

X=1

X1≤≤

X=1 6=

+1

2√

X=1

X1≤≤

+1

2√

X=1

X1≤≤

≡ 21 +22 +23 say.

Let = (0)0 0 ( ) = and ( ) = [0 ( ) + 0 ( )

+0 ( )+0 ( )+0 ( )+0 ( )]6 Let ≡Ã

3

!−1P1≤≤

( ) Then 21 =√

P=1 where =

(−1)(−2)2

By Assumption A.1 and Lemma

B.2, (221) =

2

P=1

¡2

¢= 2

¡−3

¢=

¡−1

¢ It follows that 21 =

¡−12

¢.

Noting that D (22) = 0 and (222) =

14

P=1

P1≤≤ (2) =

¡−1

¢

we have 22 =

¡−12

¢ Similarly, 23 =

¡−12

¢ Then (ii) follows. The proof of (iii)-(v)

is analogous to that of (ii) and thus omitted.

REFERENCES

Acemoglu, D., Johnson, S., Robinson, J., 2005. The rise of Europe: Atlantic trade, institutional change,and economic growth. American Economic Review 95, 546-79.

Acemoglu, D., Johnson, S., Robinson, J., Yared, P., 2008. Income and democracy. American EconomicReview 98, 808-842.

Ando, T., Bai, J., 2016. Panel data models with grouped factor structure under unknown group member-ship. Journal of Applied Econometrics 31, 163-191.

Barro, R. J., 1999. Determinants of democracy. Journal of Political Economy 107, 158-183.

Bertsekas, D.P., 1995. Nonlinear Programming. Athena Scientific, Belmont, MA.

Bester, C. A., Hansen, C. B., 2016. Grouped effects estimators in fixed effects models. Journal of Econo-metrics 190, 197-208.

Bonhomme, S., Manresa, E., 2015. Grouped patterns of heterogeneity in panel data. Econometrica 83,1147-1184.

Deb, P., Trivedi, P. K., 2013. Finite mixture for panels with fixed effects. Journal of Econometric Methods2, 35-51.

Hahn, J., and Kuersteiner, G., 2011. Reduction for dynamic nonlinear panel models with fixed effects.Econometric Theory 27, 1152-1191.

Hahn, J., Moon H. R., 2010. Panel data models with finite number of multiple equilibria. EconometricTheory 26, 863-881.

Hall, P., Heyde, C. C., 1980. Martingale Limit Theory and Its Applications. Academic Press, New York.

57

Holm, S., 1979. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics6, 65-70.

Hsiao, C., 2014. Analysis of Panel Data. Cambridge University Press, Cambridge.

Hsiao, C., Tahmiscioglu A. K., 1997. A panel analysis of liquidity constraints and firm investment. Journalof the American Statistical Association 92, 455-465.

Lee, A. J., 1990. U-statistics: Theory and Practice. Marcel Dekker, New York.

Leeb, H., Pötscher, P. M., 2005. Model selection and inference: facts and fiction. Econometric Theory21, 21-59.

Leeb, H., Pötscher, P. M., 2008. Sparse estimators and the oracle property, or the return of Hodges’estimator. Journal of Econometrics 142, 201-211.

Leeb, H., Pötscher, P. M., 2009. On the distribution of penalized maximum likelihood estimators: theLASSO, SCAD, and thresholding. Journal of Multivariate Analysis 100, 2065-2082.

Lin, C-C., Ng, S., 2012. Estimation of panel data models with parameter heterogeneity when groupmembership is unknown. Journal of Econometric Methods 1, 42-55.

Lipset, S. M., 1959. Some social requisites of democracy: economic development and political legitimacy.American Political Science Review 53, 69-105.

Onatski, A., 2009. Testing hypotheses about the number of factors in large factor models. Econometrica77, 1447-1479.

Pesaran, M. H., Smith, R., Im K. S., 1996. Dynamic linear models for heterogeneous panels,. in: L.Matyas and P. Sevestre (Eds.), The Econometrics of Panel Data: A Handbook of the Theory withApplications, second revised edition, pp.145-195, Kluwer Academic Publishers, Dordrecht.

Pesaran, M. H., and Yamagata, T., 2008. Testing slope homogeneity in large panels. Journal of Econo-metrics 142, 50-93.

Phillips, P. C. B., Sul, D., 2003. Dynamic panel estimation and homogeneity testing under cross sectiondependence, Econometrics Journal 6, 217-259.

Pollard, D., 1984. Convergence of Stochastic Processes. Springer-Verlag, New York.

Sarafidis, V., Weber, N., 2015. A partially heterogeneous framework for analyzing panel data. OxfordBulletin of Economics and Statistics 77, 274-296.

Schneider, U., Pötscher, P. M., 2009. On the distribution of the adaptive LASSO estimator. Journal ofStatistical Planning and Inference 139, 2775-2790.

Su, L., Chen, Q., 2013. Testing homogeneity in panel data models with interactive fixed effects. Econo-metric Theory 29, 1079-1135.

Su, L., Jin, S., 2012. Sieve estimation of panel data models with cross section dependence. Journal ofEconometrics 169, 34-4

Su, L., Shi, Z., Phillips, P. C. B., 2016. Identifying latent structures in panel. Econometrica, forthcoming.

Sun, S., Chiang, C-Y., 1997. Limiting behavior of the perturbed empirical distribution functions evaluatedat U-statistics for strongly mixing sequences of random variables. Journal of Applied Mathematicsand Stochastic Analysis 10, 3-20.

Sun, Y., 2005. Estimation and inference in panel structure models. Working paper, Dept. of Economics,UCSD.

58

Documents

mysmu.edumysmu.edu/faculty/ljsu/Publications/group20160915.pdfDetermining the Number of Groups in Latent Panel Structures with an Application to Income and Democracy ∗ Xun Lu and