Bias Reduction in Nonlinear and Dynamic Panels in the ... · proaches. We show that the composition of the rst-order bias changes only under strong cross-section dependence, but not

Bias Reduction in Nonlinear and Dynamic

Panels in the Presence of Cross-Section

Dependence∗

Cavit Pakel†

Department of EconomicsBilkent University

First Version: 1 November 2011This Version: 6 August 2014

Abstract

Nonlinear and dynamic panel models that contain individual-specific parameters arewell-known to suffer from the incidental parameter bias. This bias and methods for cor-recting it have been studied extensively for panels with time-series dependence. However,a general analysis under cross-section dependence is missing. This paper extends the lit-erature to dependence across both dimensions in large-N large-T panels. Our analysisis based on the integrated likelihood method which nests many common estimation ap-proaches. We show that the composition of the first-order bias changes only under strongcross-section dependence, but not under weak or cluster-type dependence. Using bias cor-rection techniques, we also propose a novel estimation approach for GARCH-type financialvolatility modelling in small samples. Simulation analysis and a forecast exercise suggestthat this method achieves success with as little as 150 time-series observations, which ismuch less than what is required by standard volatility estimation methods. We also usethis approach to analyse the volatility characteristics of monthly hedge fund returns.

Keywords: Nonlinear and dynamic panels, incidental parameter bias, integrated like-lihood method, composite likelihood method, GARCH, hedge funds.

∗This paper has previously been circulated under the title “Bias Reduction under Dependence, in

a Nonlinear and Dynamic Panel Setting: the Case of GARCH Panels” and is based on two chaptersof my DPhil thesis at the University of Oxford. I would like to thank Neil Shephard for his supportand guidance. I would also like to express my gratitude to Manuel Arellano, Stephane Bonhomme, BanuDemir, Shin Kanaya, Kasper Lund-Jensen, Bent Nielsen, Whitney Newey, Andrew Patton, Anders Rahbek,Johannes Ruf, Enrique Sentana, Kevin Sheppard, Michael Streatfield, Martin Weidner and Kemal Yıldızfor stimulating discussions. I also thank seminar participants at various conferences and seminars. Thehedge fund dataset used in this paper has been consolidated by Michael Streatfield and Sushant Vale. Partof this work has been undertaken during my visit to CEMFI and their unbounded hospitality is greatlyacknowledged. All errors are mine.†Contact address: Department of Economics, Bilkent University, 06800 Ankara, Turkey Email ad-

dress: [email protected].

1

1 Introduction

Estimation of panel data models where the parameter space consists of common and

individual-specific parameters is known to suffer from the incidental parameter bias. To

illustrate, consider some estimation function f(λi, θ;XiT ) which depends on the data

XiT = {xit : t = 1, ..., T}, the individual-specific parameter λi and the common parameter

θ, while i and t are the individual and time indices, respectively. Our focus will be on

large-N large-T asymptotics.1 Typically, f(·) would be an appropriate log-likelihood func-

tion. A well-known example for λi would be individual fixed effects. Under cross-section

independence, the likelihood estimator of the common parameter is given by

θ = arg maxθ

N∑i=1

f(λi(θ), θ;XiT ), (1)

where λi(θ) = arg maxλi f(λi, θ;XiT ). In large-N large-T panels, as T tends to infinity

the estimation error in each λi(θ) vanishes. However, at the same time, as the number of

incidental parameters approaches infinity with N , the accumulated error grows. Eventu-

ally, this latter effect dominates, leading to a mis-centred likelihood function in (1). This

is an instance of the well-known incidental parameter bias (Neyman and Scott (1948); see

Lancaster (2000) for a survey). Under general assumptions,

E[θ − θ0] =Aθ0T

+ op

(1

T

), (2)

where θ0 is the true parameter value and the O(1/T ) term Aθ0/T is known as the first-

order bias term. The remainder is standardly considered negligible and the interest lies in

the first-order bias.

In a recent important contribution, Arellano and Bonhomme (2009) analyse the inte-

grated likelihood method as a unifying framework which encompasses some well-known

panel estimation aproaches, including (1), as special cases. The integrated likelihood

function is given by

ÌiT (θ) =1

Tln

∫λi

LiT (θ, λi)πi (λi|θ) dλi,

where LiT (θ, λi) = L(θ, λi;XiT ) is the likelihood function for individual i and πi (λi|θ) is

some weight function. For instance, letting πi (λi|θ) = 1 for λi = λi(θ) and πi (λi|θ) = 0

otherwise gives the fixed-effects likelihood used in (1). Similarly, by choosing πi(λi|θ)appropriately, random effects and Bayesian likelihoods can also be obtained. Then, under

cross-section independence, an estimator for θ0 is given by θIL = arg maxθ∑N

i=1 ÌiT (θ).

Arellano and Bonhomme (2009) show that the integrated likelihood estimator exhibits

1This asymptotic setting has been becoming increasingly more relevant due to the growing availability

of panels with large time-series and cross-section dimensions. Examples are firm data (e.g. studies ofinsider trading activity (Bester and Hansen (2009)) and earnings studies (Carro (2007), Fernandez-Val(2009), Hospido (2012)).

2

the same pattern as in (2) except, of course, with a different specification for Aθ0 . Once

an analytical expression of Aθ0 for the relevant estimation problem is obtained, it can be

used to bias correct θ. It is also possible to correct the first-order bias of the likelihood

function or the score. For examples see, among others, Hahn and Kuersteiner (2002, 2011),

Woutersen (2002), Sartori (2003), Hahn and Newey (2004), Arellano and Hahn (2006),

Carro (2007), Arellano and Bonhomme (2009), Bester and Hansen (2009), Fernandez-Val

(2009), Dhaene and Jochmans (2011) and Fernandez-Val and Weidner (2013). Arellano

and Hahn (2007) and Arellano and Bonhomme (2011) provide detailed surveys.

Cross-section dependence is generally assumed away in the analytical bias correction

literature, except for a few studies focusing on particular models rather than the general

likelihood estimation problem.2 When the interest is in microeconomic applications, as

has generally been the case so far, this can be considered not a central issue, especially

if dependence is of weak type. However, with financial and macroeconomic data, it is

reasonable to expect some type of cross-section dependence, in addition to at least weak

temporal dependence. Bai (2009), for example, suggests different scenarios for macro,

micro and financial data that would cause factor-type dependence. Spatial and network

dependence are other possible patterns. Analysing the bias under different types of depen-

dence is an important task because, if the specification of Aθ0 changes with dependence,

standard bias correction methods will not be appropriate anymore.

This paper extends the literature to panels that are both serially and cross-sectionally

dependent. Particularly, we consider a generic likelihood function LiT (θ, λi) and inves-

tigate the bias of the integrated likelihood estimator in the presence of weak time-series

dependence and several types of cross-section dependence. Our work is most closely re-

lated to Arellano and Bonhomme (2009) and Hahn and Kuersteiner (2011). The latter

contribution analyses the incidental parameter bias for the estimation problem in (1), al-

lowing for weak dependence across time only. Our main differences are that we focus on

the unifying framework of the integrated likelihood method and, more importantly, allow

for both time-series and cross-section dependence. As such, we also extend the results of

Arellano and Bonhomme (2009) to dependence across both panel dimensions.

Under cross-section dependence, the appropriate joint likelihood function will be dif-

ferent and probably more complicated than that in (1) or∑N

i=1 ÌiT (θ). In the case of

multivariate normality, for example, there will be O(N2) additional parameters from the

covariance matrix, posing a formidable estimation task when N is large. As such, large-

scale modelling is likely to suffer from curse of dimensionality. This can be solved by

parameter reduction methods such as factor modelling. However, even then computa-

tional issues are likely to arise. In volatility modelling, for example, estimation is based

on numerical optimisation, which becomes increasingly more time consuming as the cross-

section size gets larger. To circumvent such issues we use the composite likelihood method,

which has recently been becoming popular in the statistics literature; see Varin, Reid and

2See e.g. Phillips and Sul (2007) and Bai (2009).

3

Firth (2011) for a survey. The idea is to approximate the joint density by combinations

of simpler lower dimensional (such as univariate or bivariate) marginal densities. Beyond

its computational advantages, this method will also be more robust to misspecification in

cases where one can correctly specify lower dimensional marginal densities but not the full

joint density (Xu and Reid (2011)).

We provide the conditions for large-N large-T consistency of the integrated composite

likelihood estimator and then analyse the asymptotic behaviour of E[θIL− θ0] under three

types of cross-section dependence: (i) weak dependence in the form of mixing random

fields, (ii) strong dependence and (iii) clustered samples. Mixing random fields are similar

to mixing time-series sequences, except that the mixing property applies to the cross-

section, as well. This yields the same results as independence. Strong dependence, on

the other hand, is understood to be strong enough to imply√T -consistency. This is

the most severe type of dependence, where a new type of bias emerges which is of the

same order as the incidental parameter bias. Finally, as would be expected, clustering

yields a spectrum of results between the two polar cases of strong dependence and weak

dependence/independence. Weak time-series dependence is shown not to lead to any

changes in the first-order bias. At a more general level, the analysis in this part also

contributes to the integrated likelihood and composite likelihood literatures.

In the case of independence across both dimensions, it is well-known that the asymp-

totic distribution of the estimator in (1) is biased in the sense that

√NT (θ − θ0)

d→ N (β,Σ),

where β = O(1) as N,T → ∞ and Σ is some asymptotic variance matrix. Our analysis

for the integrated composite likelihood estimator reveals that under any convergence rate√NρT where 0 < ρ ≤ 1, as long as N tends to infinity at a slower rate than T 1/ρ, the

asymptotic distribution will be centred at zero. When it comes to√T -consistency, on the

other hand, the asymptotic distribution turns out to be unbiased no matter how fast N

tends to infinity. Intuitively, the first-order bias becomes a pure time-series small-sample

bias. These considerations are, of course, all asymptotic and bias correction would still be

required in small samples.

Another contribution of this paper is a novel method for modelling volatility in small

samples. The GARCH-type models due to Engle (1982) and Bollerslev (1986) are stan-

dardly used in modelling volatility at daily or lower frequencies. However, owing to the

persistent nature of financial data, these models require hundreds of observations to fit

volatility. Smaller samples suffer from identification issues and small sample bias. One

might not always possess such large datasets: data might be scarce because of a recent

structural break or simply because it is recorded at low frequency. A prime example of

the latter is hedge fund returns which are almost always recorded at monthly frequency

and only since 1994, implying less than 250 time-series observations. This is too small

for standard GARCH-type estimators. As a remedy, we propose to estimate volatility

4

using a bias-corrected panel GARCH model. Simulation results reveal that with as little

as 150 observations, this approach can fit volatility with little bias. This is a substantial

improvement, potentially implying that the GARCH machinery can be brough to bear to

deal with datasets which were previously outside its capabilities. The predictive success of

the proposed method is illustrated in a forecast excercise using stock market data. Also,

a novel analysis of volatility characteristics of monthly hedge fund returns is presented.

A related work is the recent paper by Engle, Pakel, Shephard and Sheppard (2014). We

derive the consistency theorem for the integrated composite likelihood estimator by using

their consistency result for the standard composite likelihood estimator. Their asymptotic

theory, on the other hand, builds upon the theory developed here for the strong dependence

case. However, unlike the present study, they are not interested in first-order bias analysis

per se, as this does not turn out to be an issue for the data sizes they consider. Rather, their

main objective is to develop a computationally feasible method for large-scale multivariate

volatility modelling. In contrast, volatility modelling is a secondary contribution here.

The rest of this study is organised as follows: Section 2 introduces the notation and

briefly discusses relevant concepts. Key assumptions are listed and discussed in Section

3. The main theoretical results are given in Sections 4 and 5. Section 5 also contains a

short simulation analysis for the dynamic autoregressive panel model. The Panel GARCH

application is introduced and investigated in a simulation analysis in Section 6. This is

followed by two empirical applications in Section 7. Section 8 concludes. The main proofs

are given in the Mathematical Appendix, while other proofs and additional discussions

can be found in the Supplementary Appendix.

2 Main Concepts and Notation

Let xit be the data, indexed by individuals and time (i = 1, ..., N and t = 1, ..., T , re-

spectively). The empirical application we will later consider contains a vector parameter

of interest and so, we let θ be P -dimensional while λi remains a scalar. The corre-

sponding (pseudo) true parameter values are given by θ0 and λi0. Following Arellano

and Hahn (2006), xit can be defined flexibly; for example for a variable of interest yit,

one can have xit = yit and `(θ, λi;xit) = `(θ, λi; yit) or xit = (yit, yi,t−1, ..., yi,t−q) and

`(θ, λi;xit) = `(θ, λi; yit|yi,t−1, ..., yi,t−q) where `(·) is the (conditional) log-likelihood func-

tion (although `(·) can be any appropriate objective function). For conciseness, henceforth

we let ìt(θ, λi) = `(θ, λi;xit) and use the following notation:

ìT (θ, λi) =1

T

T∑t=1

ìt (θ, λi) , `NT (θ, λ) =1

N

N∑i=1

ìT (θ, λi),

`λiT (θ, λi) =∂ìT (θ, λi)

∂λi, `λλiT (θ, λi) =

∂2ìT (θ, λi)

∂λ2i

etc.

Hence, λ appearing as a superscript denotes differentiation with respect to λi.

5

It is implicitly assumed that ìt(θ, λi) is an appropriate likelihood function, in the

sense that it yields fixed-N large-T consistent estimators. For example, in the presence of

cross-section independence and weak serial dependence, under standard assumptions,

(θ, λ1, ..., λN ) = arg maxθ,λ1,...,λN

1

NT

N∑i=1

T∑t=1

ìt(θ, λi) (3)

is fixed-N large-T consistent for (θ0, λ10, ..., λN0). However, when N → ∞ at the same

speed as or faster than T, the accumulated estimation error due to λ1, ..., λN contaminates

estimation of θ, which leads to the incidental parameter bias.

When data exhibit cross-section dependence, the objective function in (3) or∑N

i=1 ÌiT (θ)

are clearly not appropriate anymore. Instead, one has to specify a joint density func-

tion which incorporates the cross-section dependence. As explained previously, for panels

with large cross-sections, estimation of such models can be difficult both statistically and

computationally. The composite likelihood method side-steps these issues by using an

approximation to the joint density based on combinations of lower dimensional marginal

densities (see Lindsay (1988) and Cox and Reid (2004)). There will of course be some

efficiency loss but the severity of this will depend on the particular application at hand.

The simplest composite likelihood function is given by the equally weighted average

of all univariate marginal densities across i. This coincides with the objective function in

(3) and is equivalent to neglecting cross-section dependence. Neglecting dependence is of

course is not a new idea but, nevertheless, it still is a special case of the composite likelihood

approach. Using combinations of bivariate or trivariate marginal likelihoods is also a

possibility, which will preserve some of the dependence structure. Engle, Pakel, Shephard

and Sheppard (2014), for example, use bivariate densities to construct composite likelihood

functions and model large covariance matrices for financial data. We will not follow these

avenues, given that in our particular application the simple composite likelihood function

based on the marginal (integrated) likelihoods works sufficiently well.3

We now introduce the several likelihood functions used in the theoretical analysis. The

concentrated (or profile) likelihood function is given by

ìT (θ, λi(θ)) where λi(θ) = arg maxλi

T∑t=1

ìt(θ, λi).

The concentrated likelihood estimator of θ0 is given by θ = arg maxθ∑N

i=1

∑Tt=1 ìt(θ, λi(θ)).

Essentially, this is a theoretical approximation to the estimation problem in (3); see

Barndorff-Nielsen and Cox (1994) for an excellent treatment of likelihood estimation.

3It is important to underline that we are content with neglecting dependence (given that a consistent

estimator still exists), because estimating the dependence structure is not of interest in this study.

6

Next, we consider the benchmark target likelihood function, given by

ìT (θ, λiT (θ)), where λiT (θ) = arg maxλi

1

T

T∑t=1

E[ìt(θ, λi)].

Note that here and elsewhere, all moments are taken with respect to the underlying density

evaluated at (θ0, λ10, ..., λN0). This is an appropriate benchmark for several reasons. The

target likelihood is maximised at θ0 and the target likelihood estimator of θ0 is asymptot-

ically efficient. The curve defined by(θ, λiT (θ)

)is such that, by definition, λiT (θ0) = λi0

for all T. In essence, the concentrated likelihood and target likelihood functions are asymp-

totically equivalent around θ = θ0. Crucially, the target likelihood function guarantees

orthogonality of the score with respect to θ, to the plane spanned by the scores with respect

to (λ1, ..., λN ) and this leads to an unbiased estimator for θ0. For a detailed discussion,

see Severini and Wong (1992). Notice that this is an infeasible benchmark as λiT (θ) is

based on θ0 and λi0, due to the expectation term, as well as θ.4

Finally, the integrated likelihood function given by

ÌiT (θ) =1

Tln

∫λi

exp [TìT (θ, λi)]πi (λi|θ) dλi.

Our main objective is to analyse, under time-series and different types of cross-section

dependence, the first-order bias of the integrated composite likelihood estimator, given by

θIL = arg maxθ

N∑i=1

ÌiT (θ). (4)

The choice of weights/priors, πi (λi|θ) , is key to successfully removing the incidental pa-

rameter bias.5 Arellano and Bonhomme (2009) provide the analytical specification of

the bias-correcting prior, which they call the robust prior, for the case of time-series and

cross-section independence. In this study, we extend their results, allowing for dependence

across both dimensions. One can of course choose the prior function based on subjective

judgement. However, our analysis will be entirely frequentist.

Finally, we introduce some more notation. The operator ∇θ(k) is used to take the kth

order total derivative with respect to θ. For example,

∇θìT (θ, λi(θ)) =dìT (θ, λi(θ))

dθ, ∇

θ(2)ìT (θ, λi(θ)) =

d2ìT (θ, λi(θ))

dθdθ′etc.

4Dependence of the target likelihood estimator of λi0 on T is not non-standard. In essence, λi(θ)

also depends on T although this is never made explicit. Nevertheless, dependence of λiT (θ) on T is notsuppressed here as it is a less common term than λi(θ).

5Importantly, some seminal bias correction approaches developed in the statistics literature such as the

modified profile likelihood method (Barndorff-Nielsen (1983)) and the adjusted profile likelihood method(Cox and Reid (1987)) are known to have integrated likelihood representations (Severini (1999, 2007)). Apractical advantage of this method is that it reduces the dimension of the problem by integrating out theincidental parameters. Bias correction for the standard likelihood method still requires one to estimate λi,which can lead to computational issues.

7

Also, V λλiT (θ, λi) is used as short hand for `λλiT (θ, λi)−E[`λλiT (θ, λi)]. Furthermore, whenever

a likelihood function is evaluated at the curve (θ, λiT (θ)), the argument will be written

in short hand as (θ). For example, ìt(θ) = ìt(θ, λiT (θ)), V λλiT (θ) = V λλ

iT (θ, λiT (θ)) and

`NT (θ) = `NT (θ, λ1T (θ), ..., λNT (θ)). Finally,

S = ∇θ`NT (θ0), H = ∇θθ`NT (θ0) , ν = E[H],

Zj = E[∇θθ

d`NT (θ)

dθj

∣∣∣θ=θ0

], and M =

S′ν−1Z1ν

−1S...

S′ν−1ZP ν−1S

,where j ∈ {1, ..., P}. S is also sometimes referred to as the projected score, which is equal

to S = {`θNT (θ0) − {E[`λλNT (θ0)]}−1E[`λθNT (θ0)]`λNT (θ0)}, where the superscript θ denotes

partial differentiation with respect to θ.

3 Assumptions

The analysis is based on a generic likelihood function which is α-mixing across time. This

is a commonly used type of weak serial dependence; see Doukhan (1994).

Definition 3.1 (α-mixing) Define the σ-fields, Gti,−∞ = σ(xit, xi,t−1, ...) and H∞i,t =

σ(xit, xi,t+1, ...). The sequence of random vectors (xit, xi,t−1, xi,t−2, ...) is called α-mixing if

the α-mixing coefficient αi(m) = supt supG∈Gti,−∞,H∈H

∞i,t+m

|P (G ∩H)− P (G)P (H)| tends

to zero as m→∞. Moreover, for s ∈ R, if α(m) = O(m−s−ε) for some ε > 0, then α(m)

is said to be of size −s.

To represent likelihood derivatives concisely, we use a notation similar to Hahn and

Kuersteiner (2011). Let k = (k1, ..., kP ) be a set of P non-negative integers and let

|k| =∑P

p=1 kp. Then, any derivative and its centred version is represented by,

Z(m,k)it (θ, λ) =

dm+|k|ìt(θ, λ)

dλmdθk11 dθ

k22 ...dθ

kPP

and Z(m,k)it (θ, λ) = Z(m,k)

it (θ, λ)− E[Z(m,k)it (θ, λ)],

for some selection of indices (m, k). For example, the set of Z(m,k)it (θ, λ) for m + |k| ≤

3 yields all derivatives of the likelihood function up to and including the third order.

Choosing m = 0 and/or k with |k| = 0 implies that no derivative with respect to λ and/or

θ is taken. For example, Z(m,k)it (θ, λ) where m = 0 and |k| = 0 yields ìt(θ, λ). Similar

as before, Z(m,k)iT (θ, λ) = T−1∑T

t=1Z(m,k)it (θ, λ) and Z(m,k)

NT (θ, λ) = N−1∑Ni=1Z

(m,k)iT (θ, λ).

Also, Z(m,k)iT (θ) = Z(m,k)

iT (θ, λiT (θ)) and similarly for other terms.

Assumption 3.1 (i) N,T →∞ jointly and, for 0 < c <∞, N/T → c; (ii) ìt(θ, λ) ∈ C8

for all i, t, λ and θ where Cc is the class of functions whose derivatives up to and including

order c are continuous; (iii) the parameter spaces for θ and λi are given by Θ and Λ

8

which are compact convex subsets of RP and R, respectively; (iv) ìT (θ, λi) has a unique

maximum at λi(θ) for all i, T and θ; (v) infi,T |E[`λλiT (θ)]| > 0 for all θ and E[−∇θ(2)`NT (θ)]

is positive definite for all T,N and θ.

Assumption 3.2 For any η > 0, (i) ε = inf1≤i≤N{E[ÌiT (θ0)]−sup{θ:||θ−θ0||>η} E[ÌiT (θ)]} >0, and (ii) ε = infθ∈Θ inf1≤i≤N{E[ìT (θ, λiT (θ))] − sup{λ:||λ−λiT (θ)||>η} E[ìT (θ, λ)]} > 0.

Note that η does not necessarily take on the same value in both of its appearances above.

Assumption 3.3 (i)For all θ′, θ′′ ∈ Θ and λ′, λ′′ ∈ Λ we have

∣∣ìt(θ′, λ′)− ìt(θ′′, λ′′)∣∣ ≤ d(xit)∣∣∣∣(θ′, λ′)− (θ′′, λ′′)

∣∣∣∣ ,where d(·) is a measurable function of xit and supi,t E[||d(xit)||] < ∞; (ii) the same as-

sumption holds for `λλλit (θ, λ), `λλλλit (θ, λ) and Z(0,k)it (θ, λ) where |k| = 5.

Assumption 3.4 (i) For any i = 1, ..., N, {xit}Tt=1 is an α−mixing sequence where the

mixing coefficients are of size −(2+ε)/ε for some ε > 0 and satisfy∑∞

m=1mαi(m)δ/3+δ <

∞ for some δ > 0; (ii) for all (m, k) such that m ∈ {1, 2} and |k| ≤ 4 or m = 3 and

|k| = 0, V ar(√

T Z(m,k)iT (θ)

)> 0 for all i, θ and sufficiently large T.

Assumption 3.5 For c ∈ {1, 2, 3}, let kc be a set of P non-negative integers such that

|kc| = c. For each c there is some 0 ≤ ρc ≤ 1 such that σ2NT,kc

= V ar(√NρcT Z(0,kc)

NT (θ0, λ0)) >

0 for all N,T, and

√NρcT

Z(0,kc)NT (θ0, λ0)

σNT,kc

d→ N (0, 1) as N,T →∞.

Finally, V ar(√Nρ1TdZNT (θ0, λ0)/dθ) is positive definite for all N and T.

Assumption 3.1(i) formalises the large-N large-T asymptotic setting. The continuity

assumption in (ii) ensures that the likelihood function is smooth enough for the asymptotic

expansions to exist. This also implies that all the mixing properties of xit are inherited by

the likelihood function and its derivatives. This is because for any measureable function

g(·), if a sequence (xit, xi,t−1, xi,t−2, ...) is mixing, then, for finite τ , g(xit, xi,t−1, ..., xi,t−τ )

is also mixing with the same size. Assumption (iii) is used standardly in proving consis-

tency (see e.g. Hahn and Kuersteiner (2011)). The uniqueness of λi(θ) is required for

the existence of Laplace expansions for the integrated likelihood function.6 Finally, (v)

guarantees the existence of the inverse terms appearing in the expansions.

Assumption 3.2 ensures unique identification of θ0 and λiT (θ). That λiT (θ) is uniquely

identifiable for all θ implies unique identification of λiT (θ0) = λi0, as well. We need

this stronger identification condition because bias correction is with respect to the target

6In the case of multiple maxima, the solution would be to divide the parameter space into subintervals

with one global maximum in each.

9

likelihood estimator. Of course, a necessary implicit assumption is that the support of

πi (λi|θ) contains an open neighbourhood of the true parameters λi0 and θ0 (Arellano and

Bonhomme (2009)). The Lipschitz continuity condition of Assumption 3.3 is required for

consistency proofs. Part (ii) of this assumption is necessary for bounding the remainder

terms in the asymptotic expansions. Assumption 3.4(i) establishes the α-mixing charac-

teristics of data. The summability condition determines the decay rate of dependence. In

addition to part (ii) of this assumption, a series of existence and moment conditions is

given in Assumptions A.1 and A.2 in the Appendix.7

Assumption 3.5 is the key assumption that controls the double asymptotic behaviour

as N,T →∞. Essentially, this assumption implies that the first three centred derivatives

of the likelihood function `NT (θ, λ(θ)) with respect to θ are assumed to possess central

limit theorems (CLT) and converge at rates√Nρ1T ,

√Nρ2T and

√Nρ3T , respectively.

The particular value of ρc determines the nature of cross-section dependence. In the case

of cross-section independence, under standard assumptions, ρc = 1 for all c. On the other

hand, ρc = 0 implies that cross-section dependence is so strong that cross-section variation

does not make any information contribution. Other combinations of ρ1, ρ2 and ρ3 can be

used to achieve a variety of dependence structures.

4 Consistency and Characterisation of the Incidental Pa-

rameter Bias

In this and the following sections, we turn to the main contribution of this study. Our

objective is to prove the consistency of the integrated composite likelihood estimator in

(4) and investigate its first-order bias under time-series and cross-section dependence.

Theorem 4.1 Under Assumptions 3.1, 3.2, 3.3(i), 3.4(i), A.1 and A.2, for all η > 0,

P

[max

1≤i≤N

∣∣∣∣∣∣λi(θ)− λiT (θ)∣∣∣∣∣∣ > η

]= o(1) for all θ ∈ Θ as T →∞,

P[∣∣∣∣∣∣θIL − θ0

∣∣∣∣∣∣ > η]

= o(1) as N,T →∞,

where the ηs appearing in both expressions are not necessarily identical.

The first part of this result is based on Theorem 4.1 of Engle et al. (2014). The main

novelty here is the consistency result for the integrated composite likelihood estimator of

θ0. Intuitively, the integrated composite likelihood function is asymptotically equal to the

composite likelihood function. This asymptotic equivalence is used to obtain a uniform

convergence result which leads to θILp→ θ0.

7In what follows, for sake of brevity, we do not individually spell out the particular moment assumptions

required for a Theorem or Lemma. Instead, we simply state that Assumption A.1 holds. The same practiceapplies to Assumptions 3.4(ii) and A.2.

10

The next objective is to analyse the first-order bias of θIL, which is done in two steps.

First, we characterise the incidental parameter bias. Since each incidental parameter λi

affects estimation through the marginal likelihood for i, this bias is captured by the first-

order bias of the marginal integrated likelihood function for i, ÌiT (θ)− ìT (θ, λiT (θ)) - see

Theorem 4.2 below. In the second step, we derive the first-order bias of θIL. This depends

on the asymptotic behaviour of the composite integrated likelihood function∑N

i=1 ÌiT (θ),

which incorporates cross-sectional information. As such, it is here where the effect of

cross-section dependence on the first-order bias reveals itself - see Theorem 5.1.

Theorem 4.2 Under Assumptions 3.1-3.4, A.1 and A.2, for any i, as T →∞,

ÌiT (θ)− ìT (θ) =ln(2πT−1)

2T+ B(1)

iT (θ) + B(2)iT (θ) +Op

(1

T 2

), (5)

where

B(1)iT (θ) =

1

2

(`λiT (θ))2

E[−`λλiT (θ)]− 1

2Tln{−`λλiT [θ, λi(θ)]}+

1

Tlnπi(λi(θ)|θ) = Op

(1

T

),

E[B(1)iT (θ)] =

1

2

E[(`λiT (θ))2]

E[−`λλiT (θ)]− 1

2Tln{−E[`λλiT (θ)]}+

1

Tlnπi(λiT (θ)|θ) = O

(1

T

),

B(2)iT (θ) =

1

2

V λλiT (θ)(`λiT (θ))2

{E[`λλiT (θ)]}2− 1

6

(`λiT (θ))3E[`λλλiT (θ)]

{E[`λλiT (θ)]}3= Op

(1

T 3/2

),

E[B(2)iT (θ)] = O

(1

T 2

),

implying that

E[ÌiT (θ)− ìT (θ)] =ln(2πT−1)

2T+ E[B(1)

iT (θ)] +O

(1

T 2

). (6)

The first-order bias term in (6) is exactly the same as that derived by Arellano and

Bonhomme (2009) (see their Theorem 1). Hence, the Arellano-Bonhomme robust priors

are still valid. The first part of Theorem 4.2 also provides a higher order expansion

for ÌiT (θ) − ìT (θ) , which includes the Op(T−3/2) term B(2)

iT (θ) . This term does not

appear in the bias formula in (6), as its expectation is O(T−2), due to the strong mixing

assumption. The analytical expression for B(2)iT (θ) could still be useful for higher-order

correction, though this will not be pursued here.

Since the bias here is the same as the one found by Arellano and Bonhomme (2009),

the prior specification which yields a first-order unbiased score is also identical (see their

equations (12) and (14)). They provide two prior specifications where one depends on the

information inequality and the other does not. We will use the latter one as it is robust

11

to parametric misspeficiation. This is given by

πRi (λi|θ) ∝ {E[−`λλiT (θ, λi)]}1/2 exp

(−T

2{E[−`λλiT (θ, λi)]}

−1E{[`λiT (θ, λi)]2}). (7)

This formula can be obtained easily by taking the derivative of E[B(1)iT (θ)] with respect to

θ and choosing πRi (λi|θ) such that the resulting term is equal to zero.

5 Bias in the Presence of Cross-Section Dependence

5.1 The General Result

In the absence of cross-section dependence, it is well-known that a first-order corrected like-

lihood function yields a first-order unbiased estimator for θ0 (Arellano and Hahn (2007)).

This result will break down if cross-section dependence leads to extra O(T−1) bias terms,

since a bias correction method which is designed to deal with the incidental parameter

bias only will be ineffective against bias terms generated through different mechanisms.

Whether such extra bias terms do indeed exist is investigated in the next result.

Theorem 5.1 Under Assumptions 3.1-3.5, A.1 and A.2, as N,T →∞,

(θIL − θ0) = ANT +BNT + CNT +DNT +Op

(1

T 3/2

), (8)

where

ANT = −ν−1S = Op

(1√Nρ1T

),

BNT = −ν−1 1

N

N∑i=1

∇θ

{1

TlnE[`λλiT (θ)] +

1

Tlnπi(λiT (θ)|θ)− [`λiT (θ)]2

E[`λλiT (θ)]

}∣∣∣∣∣θ=θ0

= Op

(1

T

),

CNT = ν−1 (H − ν)Sν−1 = Op

(1

N (ρ1+ρ2)/2T

),

DNT = −1

2ν−1M = Op

(1

Nρ1T

).

In addition, √Nρ1T (θIL − θ0)

d→ N (ρβ, ν−1σ2S ν−1), (9)

where β = p limN,T→∞ TBNT , ν = p limN,T→∞ ν, ρ = limN,T→∞√Nρ1/T and σ2

S =

p limN,T→∞ V ar(√Nρ1TS). If θIL is based on the robust prior, then

√Nρ1T (θIL − θ0)

d→ N (0, ν−1σ2S ν−1). (10)

The first part of the theorem provides a higher order expansion for θIL − θ0. The

Op(T−3/2) remainder is composed of a multitude of terms, the exact magnitudes of which

12

depend on ρ1, ρ2 and ρ3. Hence, the exact order of the remainder term changes with the

level of dependence. For example, for the case of cross-section independence, which implies

ρ1 = ρ2 = ρ3 = 1, (8) actually becomes an expansion up to a Op(N−2T−2) remainder. See

Lemma A.13 in the Mathematical Appendix for a more detailed version of this expansion.

As standard, ANT drives the CLT and determines the speed of convergence. BNT , on the

other hand, is the cross-sectional average of the incidental parameter biases of individual

scores. Bias correction based on (7) will correct for this term.

It is CNT and DNT that constitute the main novelty in (8). These terms reflect the

effect of cross-section dependence on the behaviour of θIL. The first key result of Theorem

5.1 is that, when cross-section dependence is strong enough to imply ρ1 = ρ2 = 0, the first-

order bias term will change from BNT to BNT +CNT +DNT . In other words, the incidental

parameter bias will not be the sole source of the O(T−1) bias anymore. Therefore, the

robust prior, which is designed to deal with BNT only, will not be able to fully correct the

bias. It is, of course, an entirely different question whether the actual value of CNT +DNT

will be large enough to cause a substantial change in the value of the bias term. For any

ρ1 > 0 or, equivalently, for any convergence rate faster than√T , the extra term becomes

first-order negligible and the incidental parameter bias remains the only source of bias.

Some more subtle insight is provided by the CLT in (9). The standard result that under√NT -consistency the large-N large-T asymptotic distribution of the common parameter

estimator has an asymptotic bias of order O(√N/T ) is a special case of (9) with ρ1 = 1.

More generally, under√Nρ1T -consistency, the asymptotic bias will be negligible as long

as N = o(T 1/ρ1). Hence, as the level of dependence increases, N can be allowed to tend

to infinity at a faster rate than T . The second key result of Theorem 5.1, however, is that

under√T -consistency, the asymptotic distribution will be unbiased no matter how N is

related to T . Intuitively, in this case the incidental parameter bias becomes a pure time-

series small-sample bias. A subtle message of (8) and (9) together then is that although√T -consistency leads to an extra first-order bias term, there is no reason to worry about

this as long as T is large enough. The extra bias will become a problem only when T is

small. The available bias correction methods can then still be useful even in the absence

of the classical incidental parameter bias, to correct the small-sample bias this time. This

is indeed the main motivation behind the empirical application considered in this paper.

Finally, (10) confirms that the robust prior successfully removes the bias of the asymp-

totic distribution. This is not in conflict with the earlier observation that at√T -consistency,

the first-order bias includes extra terms that cannot be corrected by the robust prior.

These terms do not appear in (10), simply because at√T -consistency both

√TCNT and√

TDNT are asymptotically equal to zero.

Theorem 5.1 provides a flexible overview of the asymptotic behaviour of θIL under

different levels of dependence, determined by the particular values of ρ1 and ρ2. This

encompasses a spectrum of possibilities between the two polar cases of√T and

√NT con-

sistency. Independent of whether incidental parameter estimation leads to non-vanishing

13

asymptotic bias or not, in small samples one will still most likely have to deal with the

O(T−1) bias. This requires one to obtain an analytical expression for the first-order term

in the asymptotic expansion for E[θIL − θ0]. However, formally obtaining this expansion

based on (8) requires additional assumptions on the dependence structure, even if one is

prepared to take a stand on the particular values of ρ1, ρ2 ρ3. Next, we provide these

conditions for some commonly used types of dependence considered in the literature.

5.2 Weak Cross-Section Dependence

Recently, there has been an increased interest in modelling weak dependence in the form

of mixing random fields; see, e.g., Conley (1999), Kelejian and Prucha (2007) and Bester,

Conley and Hansen (2011). In this setting, although the random variables are assumed to

be dependent, the magnitude of dependence approaches zero as their distance increases.

This type of dependence is relevant in particular for the growing literature on networks.

Chandrasekhar and Lewis (2011) have recently considered this dependence setting for

social networks. Goldsmith-Pinkham and Imbens (2013) also mention that if individuals

exhibit homophily in some covariates, then the network can be characterised by a weak

dependence setting based on mixing assumptions, very much like the case considered

here. Examples of other relevant fields are urban, agricultural, development and labour

economics.

Defining a meaningful notion of economic distance is an important task in econometric

applications when assuming this type of spatial dependence. However, we abstract away

from this issue since our analysis does not require knowledge of the particular notion of

distance used in the application at hand. The crucial element here is the assumption that

the data constitute a mixing random field, the precise definition of which is given next.

The theoretical setting here is largely based on Jenish and Prucha (2009).8 Formally,

the location indices (i, t) are assumed to be located on an integer lattice D ⊆ Z2. The

distance between two locations j′ and j′′ is measured by d(j′, j′′) = maxl∈{1,2}∣∣j′l − j′′l ∣∣,

where j′l is the lth component of j′ (and likewise for j′′l ). The distance between subsets

of D, in turn, is given by d(D′, D′′) = inf{d(j′, j′′) : j′ ∈ D′ and j′′ ∈ D′′}, for any

D′, D′′ ⊆ D. We next define the α-mixing coefficient for random fields.

Definition 5.1 Let DNT be the location set. For D′ ⊆ DNT and D′′ ⊆ DNT define the

respective σ-fields as ANT = σ(Yj : j ∈ D′) and BNT = σ(Yj : j ∈ D′′). Let |D| be the

cardinality of a finite set D. Then, the α-mixing coefficient for random fields is given by

αk,l,N,T (m) = supSNT|P (A ∩B)− P (A)P (B)| ,

where SNT = {(A,B) : A ∈ ANT , B ∈ BNT ,∣∣D′∣∣ ≤ k, ∣∣D′′∣∣ ≤ l, d(D′, D′′) ≥ m}.

8In particular, the analysis is based on increasing domain asymptotics. See Assumption 1 of Jenish and

Prucha (2009).

14

This definition differs from the standard time-series version in that it depends on the

cardinalities of the index setsD′ andD′′, as well. Intuitively, given a fixed distance between

two sets, the dependence between larger sets will be at least as high as the dependence

between smaller sets, due to accumulation of dependence. Hence, αk,l,N,T (m) is increasing

in k and l (see Doukhan (1994)). Note that since the location set DNT depends on N ,T

it makes sense to define the mixing coefficient dependent on these as well. We will not

worry about this in the remainder since the focus will be on αk,l(m) = supN,T αk,l,N,T (m).

Then, the underlying random field is α-mixing if limm→∞ αk,l(m) = 0. Assumptions on

this mixing rate are listed next.

Assumption 5.1 (i) For some δ > 0,∑∞

m=1m3α1,2 (m)δ/(3+δ) <∞; (ii) ρ1 = ρ2 = ρ3 =

1 in Assumption 3.5.

Assumption 5.1(i) is strong enough to provide bounds for all higher order moments

appearing in the asymptotic expansions. By setting ρ1 = 1, Assumption 5.1(ii) effectively

implies that the integrated likelihood estimator is√NT -consistent. Primitive conditions

for such a CLT can be found in Jenish and Prucha (2009). Based on these assumptions,

the small sample bias of the integrated likelihood estimator can now be determined.

Theorem 5.2 Under Assumptions 3.1-3.5, 5.1, A.1 and A.2,

E[θIL − θ0]− E[BNT ] = O

(1

T 2

), as N,T →∞.

It is shown in the proof of this theorem that, under the stated assumptions, both

E[CNT ] and E[DNT ] are O(N−1T−1) = O(T−2). Hence, under the particular weak time-

series and cross-section dependence setting considered here, the sole bias term is BNT and

the Arellano-Bonhomme robust priors are still effective.

5.3 Strong Cross-Section Dependence

The setting of the previous section allows dependence to decrease across either dimension

as the distance increases. Another possibility is to allow for this only across time, while

cross-sectional distance does not have any relationship with the magnitude of dependence.

To illustrate, consider the normalised cross-sectional sum of covariances for some random

variable xit at some point in time: N−2∑Ni=1

∑Nj=1Cov(xit, xjt) = O(N−α), where α ∈

[0, 1], assuming that the covariances are all finite. If data are cross-sectionally mixing or

independent, then under standard conditions α = 1. The other extreme, where α = 0,

is what we consider as strong cross-section dependence.9 The magnitude of dependence

9This idea of strong dependence is similar to the one used by Chudik, Pesaran and Tosetti (2011) and

Bailey, Kapetanios and Pesaran (2012) in a factor modelling framework. Unfortunately, although it is ahighly popular option for modelling dependence, it is not possible to consider a factor-type dependence inthis study. The main reason is that, unless one knows the particular model which the likelihood function isbased on (e.g. probit), it is impossible to deduce the dependence structure of the likelihood function fromthe factor structure that governs the data, xit. As our theoretical analysis is based on a generic likelihoodfunction, using a factor dependence structure is therefore not an option.

15

between any two observations across time (xis and xjt), on the other hand, is assumed to

depend entirely on the time distance, |t− s| , for all i, j, s, t. To illustrate, let rIBM,t and

rMSFT,t be the daily returns on the IBM and Microsoft equities, respectively, on day t.

Then, for example, Cov(rIBM,t, rMSFT,t+m) and Cov(rMSFT,t, rMSFT,t+m) are assumed to

be of the same order of magnitude. This setting would make sense for financial variables,

especially during financial downturns when such variables are known to exhibit increased

correlation.

We now formalise these ideas. Let

αi,j(m) = suptα(F ti,−∞,F

∞j,t+m),

αij,k(m) = suptα(F tij,−∞,F

∞k,t+m) and αi,jk(m) = sup

tα(F ti,−∞,F

∞jk,t+m),

where i, j, k = 1, ..., N , F ti,−∞ = σ(..., xi,t−1, xit), F∞i,t+m = σ(xi,t+m, xi,t+m+1, ...), F

tij,−∞ =

σ(..., xi,t−1, xj,t−1, xit, xjt) and F∞ij,t+m = σ(xi,t+m, xj,t+m, xi,t+m+1, xj,t+m+1, ...). Intu-

itively, αi,j(m) measures the dependence between m-period apart observatons on indi-

viduals i and j. On the other hand, αij,k(m) measures the dependence between m-period

apart observations that belong to (i) the sigma-field generated by individuals i and j

together and (ii) the sigma-field generated by individual k. Such coefficients will be use-

ful for understanding the dependence properties of non-standard covariances, such as

Cov(xitxjt, xk,t+m). Notice that, by definition, αii,j = αi,j .

Assumption 5.2 (i) For all i, j, k = 1, ..., N we have (a) limm→∞ αij,k(m) = 0 and

limm→∞ αi,jk(m) = 0, and (b)∑∞

m=1mαij,k(m)δ/3+δ < ∞ and∑∞

m=1mαi,jk(m)δ/3+δ <

∞ for some δ > 0 (which is not necessarily the same across all of the preceding summa-

tions); (ii) ρ1 = ρ2 = ρ3 = 0 in Assumption 3.5.

Assumption 5.2(i) is strong enough to provide bounds for all the higher order expec-

tations that appear in the asymptotic expansion for E[θIL− θ0]. The mixing assumptions

take care of the dependence structure across time. As for contemporaneous dependence,

the existence of all contemporaneous covariances that appear in the asymptotic expan-

sions is already guaranteed by Assumption A.1. In the absence of any further restrictions,

these terms will all be uniformly O(1), implying strong dependence. In line with this

strong dependence setting, Assumption 5.2(ii) imposes√T -consistency, implying that

cross-sectional information makes no contribution to the speed of convergence. Similar

assumptions on the convergence rate have also been considered by Goncalves (2011) and

Engle, Pakel, Shephard and Sheppard (2014). Providing primitive conditions for an ap-

propriate CLT is beyond our scope. However, given that N = O(T ), one possibility would

be to consider a triangular array setting and then utilise the literature on triangular array

CLTs for mixing processes (e.g. Bosq, Merlevede and Peligrad (1999)).

16

Theorem 5.3 Under Assumptions 3.1-3.5, 5.2, A.1 and A.2,

E[θIL − θ0]− E[BNT ] = ν−1E[(H − ν)S]ν−1 − 1

2ν−1E[M ] +O

(1

T 2

)as N,T →∞,

where ν−1E[(H − ν)S]ν−1 and E[M ] are both O(T−1).

Hence, too much cross-section dependence leads to extra first-order bias terms. The

robust prior will not correct for these terms. However, it is still possible to bias correct

θIL in a subsequent stage, by using the analytical expressions provided here.

5.4 Clustered Samples

Finally, we consider the common setting of clustered samples; see e.g. Bertrand, Duflo and

Mullainathan (2004), Hansen (2007) and, for surveys, Wooldridge (2003) and Cameron

and Miller (2011). Let the dataset consist of GN = O(Nα) cross-sectional groups (clusters)

where 0 ≤ α ≤ 1. We define Gg as the index set for group g, where g = 1, ..., GN . Let,

furthermore, Lg,N be the number of members of group g. For simplicitly, we focus on the

case where Lg,N = LN for all g. Hence, LN = N/GN .

Assumption 5.3 Let i ∈ Ga, j ∈ Gb and k ∈ Gc. Let, furthermore, f(·), g(·) and h(·) be

some zero-mean and continous functions. We assume that (i) f(xis), g(xjt) and h(xkq)

are mutually independent for all s, t, q unless when a = b = c; (ii) the dependence structure

between f(xis) and g(xjt) is determined by Assumptions 3.4 and 5.2(i) whenever a = b

(and similarly for other pairs of functions); (iii) ρ1 = ρ2 = ρ3 = α in Assumption 3.5.

Hence, by following the standard convention, we assume dependence within clusters

and independence between clusters. In particular, within-cluster dependence is assumed

to be as in Section 5.3. By Assumption 5.3(iii), the integrated likelihood estimator is now

assumed to be√NαT -consistent. Hence, this setting can be considered as a mix of the two

polar cases analysed before, with a convergence rate somewhere between√T and

√NT .

Theorem 5.4 Under Assumptions 3.1-3.5, 5.3, A.1 and A.2

E[θIL − θ0]− E[BNT ] = ν−1E[(H − ν)S]ν−1 − 1

2ν−1E[M ] +O

(1

T 2

)as N,T →∞,

where ν−1E[(H − ν)S]ν−1 and ν−1E[M ] are both O(N−αT−1).

Hence, the orders of the extra bias terms depend on the asymptotic ratio of the num-

ber of clusters to the sample size. This is not surprising, given that the total amount of

cross-section dependence also depends on the same ratio. Since LN = O(N1−α), α = 1

implies that as N → ∞ group size remains fixed while the number of groups tends to

infinity. Hence, cross-section dependence vanishes asymptotically. This brings us back to

17

the independence/weak dependence setting. When α = 0, the number of groups remains

fixed while group size inflates with N. Asymptotically, there will be a few groups with in-

finitely many members, leading to accumulation of cross-section dependence. The ensuing

setting is reminiscent of strong dependence. Any other α will yield something in between.

5.5 Simulation Analysis

In this section, we conduct a short simulation analysis for the dynamic autoregressive

panel model, before moving to the main empirical application of volatility modelling. The

data generating process is given by,

yit = µi + θyi,t−1 + εit, εit =εit√

1 + γ2∑Nj=1Aij

, εit = γ

N∑j=1

ηjtAij + ηit, (11)

where ηitiid∼ N (0, 1) and εit is normalised to εit to obtain unit-variance. Here, Aij = 1

if i and j are connected (dependent) and zero otherwise. By convention, Aii = 0 for

all i. Therefore, {Aij}Ni,j=1 yields an adjacency matrix. If, for example, Aij = 0 for all

i, j, then data are cross-sectionally independent. If, on the other hand, Aij = 1 for all

i 6= j, strong dependence ensues. We consider independence, cluster dependence and

strong dependence. Cluster dependence setting is such that GN = O(N1/2), implying

N3/2 connections between N observations.10

1,000 replications of the above data generating process are simulated for θ = 0.5, and

γ = 0.5 while µi are iid draws from the standard normal distribution. The parameter θ is

estimated by the maximum likelihood and robust integrated likelihood methods. In each

case, we report the average bias, root-mean-square error (RMSE) and the implied value of

ρ. The implied ρ is calculated as follows: let A1 = K1/(N1/2T 1/2) be the leading term in

(8) under weak dependence/independence and A2 = K2/(NρT 1/2) be the corresponding

term under NρT 1/2-consistency, where K1,K2 are Op(1). Then, for lnK1 − lnK2 ≈ 0,

1

2+

lnA1 − lnA2

lnN≈ ρ. (12)

The implied ρ is given by the average of this term across all replications. If the assumed

convergence rates are correct, then ρ calculated using (12) should be on average around

0.25 for cluster dependence and 0 for strong dependence.

The results are presented in Table 1. As expected, the robust prior is quite successful

in removing the incidental parameter bias and leading to a lower RMSE at the same

time. Importantly, these observations hold not only in the independence setting but also

across the two dependence settings considered here. As for the magnitude of the extra

10In practice, the exact number of groups has to be slightly different than

√N, as for some N the exact

value of√N will not be an integer. In this case, the number of groups will be equal to

√N rounded to

the nearest integer.

18

Max

imu

mL

ikel

ihood

Rob

ust

Inte

gra

ted

Lik

elih

ood

Ind

epen

den

ceC

lust

erS

tron

gIn

dep

end

ence

Clu

ster

Str

ong

TN

Bia

sR

MS

EB

ias

RM

SE

Bia

sR

MS

EB

ias

RM

SE

Bia

sR

MS

Eρ

Bia

sR

MS

Eρ

10

50−.1

61

.167

−.1

70.2

01

−.2

30.3

70.0

02.0

40−.0

02

.094

.282−.0

18

.266

.010

25

50−.0

63

.069

−.0

67.0

93

−.0

90.1

95−.0

01.0

25−.0

04

.060

.264−.0

20

.161

.019

50

50−.0

30

.036

−.0

34.0

56

−.0

47.1

34.0

00.0

18−.0

03

.043

.286−.0

15

.121

.014

100

50−.0

16

.020

−.0

16.0

34

−.0

24.0

87−.0

01.0

12.0

00.0

30

.275−.0

09

.082

.034

200

50−.0

08

.012

−.0

08.0

23

−.0

13.0

60.0

00.0

10.0

00.0

22

.311−.0

05

.059

.059

10

100

−.1

62

.165

−.1

70.1

93

−.2

23.0

28.0

00.0

28−.0

03

.083

.262−.0

14

.267

.001

25

100

−.0

63

.065

−.0

65.0

85

−.1

07.0

17.0

00.0

17−.0

02

.050

.265−.0

35

.177

−.0

05

50

100

−.0

31

.034

−.0

31.0

50

−.0

49.0

12.0

00.0

12−.0

01

.037

.254−.0

16

.120

.016

100

100

−.0

16

.018

−.0

16.0

31

−.0

26.0

09.0

00.0

09−.0

01

.026

.263−.0

10

.086

.002

200

100

−.0

08

.010

−.0

08.0

20

−.0

11.0

07.0

00.0

07−.0

01

.019

.272−.0

04

.060

.021

10

200

−.1

62

.164

−.1

67.1

85

−.2

35.0

20.0

00.0

20−.0

02

.071

.250−.0

14

.271

.018

25

200

−.0

62

.064

−.0

65.0

78

−.0

93.0

13.0

00.0

13−.0

02

.041

.266−.0

22

.175

.001

50

200

−.0

31

.032

−.0

32.0

44

−.0

58.0

09.0

00.0

09−.0

01

.030

.262−.0

25

.122

.010

100

200

−.0

15

.016

−.0

15.0

27

−.0

25.0

06.0

00.0

06.0

00.0

22

.264−.0

09

.085

.010

200

200

−.0

08

.009

−.0

07.0

17

−.0

13.0

05.0

00.0

05.0

00.0

16

.278−.0

05

.059

.021

Tab

le1:

Aver

age

bia

san

dro

ot-m

ean-s

qu

are

erro

r(R

MS

E)

resu

lts

for

esti

mati

on

of

the

dyn

am

icau

tore

gre

ssiv

ep

an

elm

od

el(1

1):y it

=µi+θyi,t−

1+ε it,

wh

ereε it

=εit

√ 1+γ2∑ N j

=1A

ij

,ε it

=γ∑ N j

=1η jtAij

+η it

an

dη itiid ∼N

(0,1

).Aij

isth

ero

wi

colu

mnj

entr

yof

the

ad

jace

ncy

matr

ixA

.F

or

each

rep

lica

tion

µi

are

dra

wn

from

theN

(0,1

)d

istr

ibu

tion

wh

ileθ

=0.

5an

dγ

=0.

5acr

oss

all

rep

lica

tion

s.T

he

“ro

bu

stin

tegra

ted

like

lih

ood”

met

hod

isth

ein

tegra

ted

like

lih

ood

met

hod

bas

edon

the

Are

llan

o-B

onh

omm

ero

bu

stp

rior.

Th

e“in

dep

end

ence

”ca

seis

base

donAij

=0

for

alli,j.

Th

e“cl

ust

er”

case

isb

ase

d

oncl

ust

ered

dat

aw

her

eth

enu

mb

erof

grou

ps

iseq

ual

to√N,

imp

lyin

gO

(N3/2)

con

nec

tion

sin

the

ad

jace

ncy

matr

ix.

Wh

enev

er√N

isn

ot

an

inte

ger

,it

isro

un

ded

toth

en

eare

stin

tege

r.F

inal

ly,

the

“str

ong”

dep

end

ence

case

isb

ase

don

sam

ple

sw

her

eall

of

theN

ob

serv

ati

on

sare

con

nec

ted

.T

he

imp

lied

valu

eofρ,

calc

ula

ted

by

usi

ng

(12)

,is

give

nbyρ.

All

resu

lts

are

base

don

1,0

00

rep

lica

tion

s.

19

bias due to cross-section dependence, there are several observations. Firstly, although

cross-section dependence indeed leads to some extra bias, this bias is only minimal in the

cluster-dependence setting used here. It is the strong-dependence setting which leads to a

substantial increase in the bias. This is in line with the theoretical results that the order

of the extra bias term is Op(T−1) under strong dependence while in a cluster setting with

GN = Op(√N), the order of this bias decreases to a negligible Op(T

−3/2). Secondly and

more importantly, even in the strong dependence case, the magnitude of the new type of

bias for θIL is smaller than the incidental parameter bias. In fact, the worst average bias

across all sample sizes is −.035, which is acceptable. For the fixed effects estimator, on

the other hand, the combined bias can be quite substantial for small T : when T = 10, the

combined bias is almost 50%. These points are in favour of using the integrated composite

likelihood method in small samples, even when data are strongly dependent. Finally, ρ is

in general in line with the conjectured values, which further confirms that the assumed

convergence rates are valid in the case at hand.

6 Volatility Modelling in Small Samples

Variants of generalised autrogressive conditional heteroskedasticity (GARCH) type models

(Engle (1982) and Bollerslev (1986)) are commonly employed to model the volatility of

financial and macro series. As explained previously, GARCH-type models require hundreds

of observations, due to high levels of persistence in financial data. This is a problem for

datasets that are inherently short (e.g. hedge fund returns, industrial production etc)

as lack of data causes identification issues and small sample bias. Here, we propose a

novel method for modelling volatility in small samples based on a bias-corrected panel

estimation approach.

Our analysis is based on the standard GARCH(1,1) model.11 Let

yt = µt + εt, µt = E[yt|Ft−1], εt = σtηt,

where Ft is the information set at time t and ηt is some zero-mean and unit-variance iid

innovation process. For simplicity, henceforth we let µt = 0. This is a reasonable assump-

tion for, for example, daily stock returns. The GARCH(1,1) model for the conditional

variance σ2t = V ar(yt|Ft−1) is given by

σ2t = ω + αε2

t−1 + βσ2t−1 where ω > 0; α,β ≥ 0 and α+ β < 1. (13)

Commonly, (ω, α, β) is estimated by the quasi maximum likelihood (QML) method, by

assuming that ηt ∼ N(0, 1). Under regularity conditions, the QML estimator is consistent

11There are virtually countless GARCH-type models and, unfortunately, considering even a selection of

these would be beyond the limits of this work. The standard GARCH(1,1) model has the appeal that itis a successful and yet simple variant of this large family. See Francq and Zakoıan (2010) for a detailedtreatment of and more references on GARCH-type models.

20

even if the normality assumption is violated (Bollerslev and Wooldridge (1992)).

Univariate volatility modelling is traditionally a time-series topic, as evident from (13)

being a time-series model. In a recent paper Pakel, Shephard and Sheppard (2011) propose

a panel data approach, by assuming homogeneity of (α, β) across assets, motivated by the

empirical observation that estimates of these parameters take on similar values across

equities. Their model is given by

yit = εit, εit = σitηit, E[ηit] = 0, V ar(ηit) = 1, (14)

σ2it = λi (1− α− β) + αε2

i,t−1 + βσ2i,t−1, (15)

where similar parameter restrictions given by λi > 0 ∀i; α, β ≥ 0 and α+β < 1 apply. The

cross-section dependence structure considered by Pakel, Shephard and Sheppard (2011)

is similar to the strong dependence setting considered in Section 5.3. In particular, they

assume√T -consistency. The implication of the model in (15) is that long-run variances

(determined by λi) are individual-specific while volatility dynamics (determined by (α, β))

are uniform. Possible examples of this would be returns of firms operating in the same

industry or of hedge funds following similar investment strategies. The parameterisa-

tion of the intercept term in (15) is called the “variance-targeting” representation (Engle

and Mezrich (1996)). This monotonic transformation greatly simplifies estimation since

E[y2it] = λi. Then, (α, β) can be estimated by a two-step composite likelihood estimator:

(α, β) = arg maxα,β

1

NT

N∑i=1

T∑t=1

ìt(α, β, λi) where λi =1

T

T∑t=1

y2it. (16)

As a novel method for modelling volatility in small samples, we propose to estimate

the above GARCH panel model by the integrated composite likelihood method. Essen-

tially, the GARCH panel model in (14)-(15) is a nonlinear dynamic panel model with

the incidental parameter λi. This means that bias correction techniques can be used to

deal with the small-T bias in GARCH estimation. As the simulation results will reveal,

this approach is able to estimate the common parameters with as little as 150 time-series

observations, which is beyond what standard GARCH estimation methods are capable of.

This is our main contribution to the financial econometrics literature.

We will not investigate the validity of or find primitive conditions for Assumptions

3.1-3.5 and 5.2 for the particular case of GARCH(1,1) panels as this is well beyond our

scope. Primitive conditions nevertheless do exist for the central assumption of α-mixing

dependence across time. In addition to the parameter restrictions we have already as-

sumed, if ηit is iid across time and has a positive continuous density, then the likelihood

derivatives will be β-mixing (which implies α-mixing) for any i across time (Carrasco and

Chen (2002)).12 Finding primitive conditions for the remaining assumptions remains an

12Corradi and Iglesias (2008), for instance, use this result to obtain the α-mixing property for likelihood

derivatives.

21

open task. Our approach here is to investigate the validity of the proposed approach by

simulation analysis and comparison of predictive ability.

Correction of the cross-section dependence bias, in addition to the incidental parameter

bias, is an option that will not be considered in this application. There are two reasons

for this. Firstly, correction of this bias would only make sense under√T -consistency,

as Theorem 5.3 shows. However, if the correct speed of convergence is higher than√T

and one incorrectly assumes√T -consistency, then bias correction using Theorem 5.3 will

lead to an extra O(T−1) bias term. In the simulation analysis, dependence is generated

by a single-factor structure across the cross-section. However, although it is known that

this factor structure leads to strongly dependent data, there is no guarantee that this

will translate into strong dependence in the likelihood function and its derivatives, as

well. Secondly and more importantly, as the simulation analysis will later reveal, in our

particular case the cross-section dependence bias does not have a major impact.

Utilisation of cross-sectional information in modelling conditional variances has pre-

viously also been considered by e.g. Engle and Mezrich (1996), Bauwens and Rombouts

(2007) and Engle (2009). There are also other panel data studies which involve GARCH

effects, such as Cermeno and Grier (2006) and Hospido (2012). However, this is the first

study to model conditional volatility explicitly in a panel structure using a bias-correction

approach. Hospido (2012) also considers GARCH errors in analysing earning dynamics;

however, that study assumes cross-section independence and does not analyse the effects

of time-series dependence on the incidental parameter bias.

6.1 Simulation Setting

We consider the following estimation methods: (i) (CL) the composite likelihood method

of Pakel, Shephard and Sheppard (2011) as in (16), (ii) (IPCL) the integrated (pseudo)

likelihood method based on prior (7), which is the method proposed in this study, and

(iii) the infeasible composite likelihood (InCL) method, which is the same as CL except

that the composite likelihood function is based on λi0, rather than its estimate.13 This

last option is used as the benchmark. In addition, a second version of IPCL, which will

be introduced below, will also be considered.

Data are generated according to (14)-(15). The unconditional variance, λi0, is used

as the initial value for the conditional variance, σ2i0. For all replications, θ0 = (α0, β0) =

(0.05, 0.93) while λi0 are drawn from a uniform distribution such that the corresponding

annual volatility is between 15% and 80%, which provides a reasonable interval for most

stock returns. Cross-section dependence is generated by a single-factor model where

ηit = ρiut +

√1− ρ2

i τit, utiid∼ N(0, 1) and τit

iid∼ N(0, 1).

For any i 6= j, this implies a time-invariant correlation of ρiρj between ηit and ηjt while

13For details of integrated likelihood estimation, see Section 2 in the Supplementary Appendix.

22

α = 0.05, β = 0.93, α+ β = 0.98CL InCL IPCL IPCL*

T α β a+ β α β a+ β α β a+ β α β a+ β

N=100

400 .045 .924 .969 .048 .932 .980 .046 .935 .981 .050 .929 .979200 .038 .913 .951 .046 .932 .978 .042 .940 .982 .049 .926 .976150 .034 .901 .935 .048 .927 .975 .040 .940 .980 .051 .919 .969100 .017 .886 .902 .046 .925 .972 .031 .947 .978 .049 .904 .95375 .009 .850 .860 .048 .920 .967 .026 .950 .976 .051 .868 .919

N=50

400 .046 .924 .969 .048 .932 .979 .046 .935 .981 .050 .929 .979200 .039 .912 .950 .047 .930 .977 .042 .939 .981 .050 .924 .974150 .033 .900 .933 .047 .929 .976 .039 .942 .981 .050 .920 .969100 .019 .876 .895 .048 .923 .971 .032 .943 .975 .050 .896 .94575 .008 .854 .863 .046 .920 .967 .024 .943 .967 .050 .857 .906

N=25

400 .045 .923 .969 .048 .932 .979 .046 .935 .981 .050 .928 .978200 .039 .911 .950 .047 .931 .977 .042 .939 .980 .050 .924 .976150 .034 .893 .927 .047 .927 .974 .039 .938 .978 .051 .915 .966100 .020 .864 .884 .048 .923 .970 .031 .936 .968 .050 .891 .94175 .012 .833 .844 .046 .921 .967 .025 .936 .961 .051 .846 .897

Table 2: Average parameter estimates for α, β and α+β by Composite Likelihood (CL), InfeasibleCL (InCL), standard Integrated CL (IPCL) and Integrated CL based on integrating out of theinitial value (IPCL*). Based on 500 replications of cross-sectionally dependent GARCH panels,where dependence is generated using the single-factor model outlined in Section 6.1 with thecorrelation parameter ρi drawn from the uniform distribution U(0.5, 0.9). The true parametervalues are given by α0 = 0.05 and β0 = 0.93 while λi0 are drawn from a uniform distribution suchthat the corresponding annual volatility is between 15% and 80%.

Cov(ηit, ηjs|ρi, ρj) = 0 for all t 6= s. Serial dependence is generated by the autoregressive

structure of σ2it. The ρi are drawn from a Uniform distribution, ρi ∼ U(0.5, 0.9). Therefore,

the correlation between any two series can be between 25% and 81%.

In order to construct the likelihood function, one requires an estimate of the initial

value σ2i0. It is common to assume that σ2

i0 = λi0, in which case the obvious choice is

σ2i0 = T−1∑T

t=1 y2it. However, the integrated likelihood method offers an interesting op-

tion to bypass this step: since the initial value is assumed to be the same as the incidental

parameter, it is possible to integrate out not only the λi that acts as the incidental pa-

rameter in (15), but also the λi that acts as the initial value. We denote this option as

IPCL* and investigate it alongside CL, IPCL and InCL. If successful, this approach will

be useful in small sample applications where initial value selection is more influential.

6.2 Simulation Results

Simulation results are given in Tables 2 and 3. All results are based on 500 replications.

Table 2 presents the average values of α, β and α + β across replications; the last one

23

CL

InC

LIP

CL

IPC

L*

CL

InC

LIP

CL

IPC

L*

Tσα

σβ

σα

σβ

σα

σβ

σα

σβ

Rα

Rβ

Rα

Rβ

Rα

Rβ

Rα

Rβ

N=

100

400

.007

.011

.006

.010

.007

.011

.007

.011

.008

.013

.007

.010

.008

.012

.007

.011

200

.010

.018

.009

.013

.009

.016

.009

.017

.016

.025

.009

.013

.012

.019

.009

.017

150

.014

.026

.011

.017

.011

.022

.012

.042

.022

.039

.011

.017

.015

.024

.012

.043

100

.016

.061

.012

.019

.013

.035

.015

.089

.037

.075

.013

.019

.023

.039

.015

.093

75.0

18.1

13.0

14.0

22.0

15.0

27.0

20.1

62.0

45.1

38.0

14.0

25.0

28.0

34.0

20.1

73

N=

50

400

.008

.012

.007

.011

.007

.011

.008

.012

.009

.014

.007

.011

.008

.012

.008

.012

200

.012

.021

.010

.014

.010

.017

.010

.021

.016

.028

.010

.014

.013

.020

.010

.022

150

.014

.036

.010

.015

.011

.021

.012

.039

.022

.047

.011

.015

.016

.024

.012

.040

100

.020

.084

.014

.021

.015

.035

.018

.120

.037

.100

.014

.023

.023

.038

.018

.125

75.0

15.0

90.0

16.0

25.0

16.0

47.0

22.1

75.0

44.1

18.0

16.0

27.0

31.0

49.0

22.1

89

N=

25

400

.009

.014

.008

.012

.008

.014

.008

.014

.010

.016

.008

.012

.009

.014

.008

.015

200

.013

.023

.011

.016

.011

.020

.012

.025

.017

.029

.011

.016

.014

.022

.012

.026

150

.018

.076

.012

.020

.013

.038

.014

.055

.024

.084

.012

.020

.017

.039

.016

.057

100

.021

.116

.016

.026

.018

.074

.020

.115

.036

.133

.016

.027

.026

.075

.020

.121

75.0

21.1

30.0

18.0

27.0

21.0

85.0

25.2

02.0

44.1

63.0

18.0

28.0

32.0

85.0

25.2

19

Tab

le3:

Sam

ple

stan

dar

der

ror

(lef

tp

anel

)an

dro

ot-m

ean

-squ

are

erro

r(r

ight

pan

el)

forα

an

dβ

by

Com

posi

teL

ikel

ihood

(CL

),In

feasi

ble

CL

(In

CL

),st

and

ard

Inte

grat

edC

L(I

PC

L)

and

Inte

grat

edC

Lb

ase

don

inte

gra

tin

gou

tof

the

init

ial

valu

e(I

PC

L*).

Base

don

500

rep

lica

tion

sof

cross

-sec

tion

all

yd

epen

den

tG

AR

CH

pan

els,

wh

ere

dep

end

ence

isge

ner

ate

du

sin

gth

esi

ngle

-fact

or

mod

elou

tlin

edin

Sec

tion

6.1

wit

hth

eco

rrel

ati

on

para

met

erρi

dra

wn

from

the

unif

orm

dis

trib

uti

onU

(0.5,0.9

).T

he

tru

epar

am

eter

valu

esare

giv

enbyα0

=0.

05

an

dβ0

=0.

93

wh

ileλi0

are

dra

wn

from

au

nif

orm

dis

trib

uti

on

such

that

the

corr

esp

ond

ing

annu

alvo

lati

lity

isb

etw

een

15%

an

d80%

.

24

0.01 0.03 0.05

α, T=150, N=100

0.53 0.73 0.93

β, T=150, N=100

CL

InCL

IPCL

IPCL*

Figure 1: Average likelihood plots for T = 150 and N = 100. Based on the average likelihoodsfor 500 replications of cross-sectionally dependent GARCH panels. In the plot for α, β is fixed at0.93 while the plot for β is based on α = 0.05. CL is evaluated at the estimated values of λi, whileInfeasible CL is evaluated at the true values of λi. Vertical lines are drawn at the true parametervalues of α0 = 0.05 and β0 = 0.93.

is generally considered as a measure of “persistence” and is an important parameter in

volatility estimation. These results confirm that GARCH panels suffer from a small-T

bias which is not systematically alleviated by increasing N . As soon as estimation of λi is

side-stepped (InCL), this bias almost vanishes. IPCL is effective in removing a substantial

portion of the bias, especially when T is small. Note that the bias correction here is based

on calculation of the bias term by numerical methods - as is recognised in the literature,

this approach is generally unable to remove bias completely. The better alternative of

obtaining the bias in closed form is not possible here as no closed form expression of

likelihood derivatives exists for GARCH(1,1). The sample standard errors of estimates

across replications in Table 3 demonstrate that bias correction does not increase estimator

variance. This also is in line with common observations in the literature. The likelihood

plots given in Figure 1, which are based on the simulation results, also confirm that CL

suffers from a miscentred likelihood function and that the integrated likelihood method is

able to re-centre the likelihood function correctly.

IPCL* delivers interesting results: The average bias of α virtually disappears, inde-

pendent of T . The performance of β is mixed although IPCL* exhibits better performance

than IPCL as T and N increase. In estimating α+ β, IPCL is the clear winner. An inter-

esting observation is that IPCL* leads to an increase in variance for β, which is uncommon

for analytical bias correction. Nevertheless, this is not too surprising given that IPCL*

involves more than standard bias correction. For small T , in some cases this increase is

large enough to lead to a higher root-mean-square error (RMSE) compared to even CL

(see the right panel of Table 3). On the other hand, IPCL* performs better than IPCL

when it comes to the RMSE for α. Therefore, IPCL*’s combined performance in estimat-

25

0 0.05 0.10

20

40

60

α , T=100 , N=100

CL

InCL

IPCL

IPCL*

0.8 0.85 0.9 0.95 10

10

20

30

40

50

β , T=100 , N=100

0.9 0.95 10

20

40

60

80

100

α+β , T=100 , N=100

0 0.05 0.10

20

40

60

α , T=150 , N=100

0.8 0.85 0.9 0.95 10

10

20

30

40

50

β , T=150 , N=100

0.9 0.95 10

20

40

60

80

100

α+β , T=150 , N=100

0 0.05 0.10

20

40

60

α , T=200, N=100

0.8 0.85 0.9 0.95 10

10

20

30

40

50

β , T=200, N=100

0.9 0.95 10

20

40

60

80

100

α+β , T=200, N=100

Figure 2: Sample distributions of α, β and α+ β using the Composite Likelihood (CL), InfeasibleCL (InCL), Integrated Pseudo CL (IPCL) and the Integrated Pseudo CL with adjustment forinitial value selection (IPCL*). The vertical lines are drawn at the true parameter values. Basedon 500 replications of cross-sectionally dependent GARCH panels where α0 = 0.05 and β0 = 0.93.

ing α and β can potentially turn out to be superior compared to IPCL. This will further

be investigated in a forecast exercise.

To have a better understanding of estimator properties, some sample distributions of

the four estimators are plotted in Figure 2. The striking observation is that the sample

distributions of IPCL* estimators are centred on or very close to the true parameter values.

This is not reflected in the averages of parameter estimates as the sample distributions

are (apparently) asymmetric. In general, simulation results suggest that using robust

priors to take care of initial value selection does indeed have a systematic effect on the

estimator. Despite the increase in the variance of β, “integrating out the initial value”

centres the distribution of the estimator very close to the true parameter values. This is

not observed for IPCL, where the location of the distribution improves compared to CL,

but not in the same fashion as IPCL*. In addition, IPCL systematically underpredicts α

26

and overpredicts β, which is not observed for IPCL*, at least not in terms of the medians

of the sample distributions of α and β.

We note that our results for another run of simulations based on cross-sectionally

independent panels suggest that the real impact of cross-section dependence is on variance.

This is expected, due to the change in the rate of convergence. The effect on bias, on the

other hand, is generally not substantial, except for IPCL* - however, even then the impact

diminishes as T increases. See Tables 1 and 2 in the Supplementary Appendix.

Clearly, the integrated likelihood method stands out as a viable option for modelling

volatility in samples with as little as 150 time-series observations. Importantly, although

the robust priors were originally suggested for cross-sectionally independent panels, simu-

lations suggest that they work well in a possibly strongly dependent framework, as well.

7 Empirical Analysis

This section presents two empirical applications. First, we consider stock market volatility

forecasting. This is followed by an analysis of monthly hedge fund volatility.

7.1 Analysis of Predictive Ability

7.1.1 Dataset

We use daily returns data for a selection of stocks traded in the Dow Jones Industrial

Average between 1 February 2001 and 28 September 2009. The included stocks are Alcoa,

American Express, Bank of America, Coca Cola, Du Pont, Exxon Mobil, General Electric,

JP Morgan, IBM and Microsoft. The dataset has been downloaded from the Oxford-Man

Institute’s Realized Library (produced by Heber, Lunde, Shephard and Sheppard (2009))

and is the same as the data used by Noureldin, Shephard and Sheppard (2012).

7.1.2 The Test Procedure

We will compare the available methods on the basis of how well they can predict one-day

ahead volatility. These predictions are given by

σ2it = λi(1− α− β) + αε2

i,t−1 + βσ2i,t−1.

In addition to CL, IPCL and IPCL*, we also include forecasts based on the standard

method of estimating the GARCH parameters individually for each stock, using time-series

information only (QML). This automatically allows all parameters to be heterogeneous.

Forecasts are constructed using a rolling window scheme, where the in-sample size is

fixed at 150. Specifically, the first forecast is calculated using estimates that are based on

observations t = 1 to t = 150. The second forecast is then calculated using estimates that

are based on observations t = 2 to t = 151, and so on. The dataset consists of 2, 176

observations, implying a total of 2, 026 forecasts for each of the ten stocks. As standard,

27

predictive ability is measured by the accuracy (or the lack thereof) of forecasts with respect

to the actual volatility. For this purpose, we use the QLIKE loss function:

L(σ2i,t+1, σ

2i,t+1) = log σ2

i,t+1 +σ2i,t+1

σ2i,t+1

.

As volatility cannot be observed, even ex-post, a proxy has to be employed instead. We use

realised variance for this purpose which is a highly accurate estimator of ex-post volatility

based on high-frequency intra-daily data (e.g. Barndorff-Nielsen and Shephard (2002)).14

We use the Giacomini and White (2006) test (GW test henceforth) to compare forecast

performances of different methods (QML, CL, IPCL and IPCL*) that estimate the same

model (GARCH). Let RVit be the realised variance for stock i at time t and let ∆Li,t+1 =

L(RVi,t+1, σ21,i,t+1)−L(RVi,t+1, σ

22,i,t+1), where σ2

1,i,t+1 and σ22,i,t+1 are the one-step ahead

forecasts by two competing methods. The GW test is used to test the null hypothesis that

the two methods have equal predictive ability, that is E[∆Li,t+1] = 0 across the forecast

horizon. See Giacomini and White (2006) for details.

7.1.3 Results

Table 4 presents the results of the GW test and the associated test-statistics in parantheses.

All tests are done individually for each of the ten stocks and at 5% level of significance.

For a given comparison, a dash means that the null of equal predictive ability cannot

be rejected. A positive test statistic means that the loss due to the first method under

consideration (method (i)) leads to higher loss compared to the second method (method

(ii)). Similarly, a negative test statistic implies higher loss by the second method. For

example, in the comparison between CL and IPCL, the test statistic is positive for Bank

of America indicating that for this stock CL leads to higher forecast loss. When the null

of equal predictive ability is rejected, the sign of the test-statistic indicates which method

leads to smaller loss or, equivalently, which method has better predictive ability.

The comparison between QML and CL indicates that there are gains from using the

panel rather than the time-series structure. However, the picture is not entirely sharp,

as the test remains inconclusive in the majority of cases. Still, the test statistics indicate

that QML almost always leads to higher loss compared to CL. The comparison between

bias-corrected panel estimation and standard time-series estimation, on the other hand,

leaves no space for doubt: IPCL and IPCL* clearly dominate QML. Another interesting

comparison is that of panel estimators with and without bias correction. The test results

indicate that IPCL* is more successful in that comparison: IPCL beats CL in five cases

14An important advantage of the dataset we use is that it includes the realised variances for each stock.

Calculation of realised variance is a non-trivial and time-consuming task. Also, further complications arisewhen a given stock is not liquidly traded. Therefore, although it would be desirable to base the analysis ona larger cross-section of assets, we restrict ourselves to the available dataset. For more information on thedataset see Noureldin, Shephard and Sheppard (2012). We choose not to use the more convenient proxyof squared returns as this is a potentially misleading measure (Patton (2011)).

28

(i)QML (i)CL (i)QML (i)CL (i)QML (i)IPCL

vs vs vs vs vs vs

Stock (ii)CL (ii)IPCL (ii)IPCL (ii)IPCL* (ii)IPCL* (ii)IPCL*

Alcoa -(1.123)

IPCL(2.596)

IPCL(2.198)

IPCL*(2.709)

IPCL*(2.229)

-(0.728)

American Express -(1.878)

IPCL(3.662)

IPCL(3.888)

IPCL*(2.981)

IPCL*(3.390)

-(−0.726)

Bank of America -(1.367)

IPCL(2.807)

IPCL(3.302)

IPCL*(3.570)

IPCL*(4.145)

-(1.931)

Coca Cola CL(2.075)

-(1.558)

IPCL(2.145)

IPCL*(2.755)

IPCL*(3.006)

-(1.536)

Du Pont -(1.215)

IPCL(2.114)

IPCL(2.389)

IPCL*(2.350)

IPCL*(2.636)

-(1.226)

General Electric CL(1.995)

-(1.300)

IPCL(2.009)

IPCL*(2.093)

IPCL*(2.558)

-(1.195)

IBM -(−0.179)

IPCL(3.321)

IPCL(3.344)

IPCL*(4.365)

IPCL*(4.427)

IPCL*(2.108)

JP Morgan CL(3.202)

-(−0.911)

-(1.267)

-(−0.318)

-(1.670)

-(1.696)

Microsoft CL(2.806)

-(−1.375)

-(0.920)

-(−0.317)

IPCL*(2.036)

IPCL*(2.786)

Exxon Mobil -(1.557)

-(1.198)

-(1.826)

IPCL*(2.172)

IPCL*(2.550)

IPCL*(2.511)

Table 4: Giacomini-White test results for 5% level of significance. The methods under consider-ation are quasi maximum likelihood (QML), composite likelihood (CL) and integrated compositelikelihood. For the integrated composite likelihood method, estimation of α and β is based on twodifferent methods for choosing the initial value of the volatility process. The first version (IPCL)

uses the moment estimator of the long-run variance, T−1∑Tt=1 r

2it, as the initial value, whereas the

second version (IPCL*) integrates the initial values out along with λi, as explained in Section 6.1.Each column contains results of the comparison of predictive ability between two chosen methods,(i) and (ii), across all stocks. The value of the test statistic for each comparison is stated in paran-theses, beneath the result of the test. For a given comparison, a positive test statistic means thatthe loss due to method (i) under consideration achieves a higher loss compared to method (ii).Similarly, a negative t-stat implies a higher loss by method (ii). A dash signifies that the test isinconclusive. Realised volatility is used as the volatility proxy.

while IPCL* achieves superior performance in eight cases against CL. Finally, in the

comparison between the two bias-corrected estimators, IPCL* is chosen over IPCL three

out of ten times, while the remaining comparisons are inconclusive. However, almost all

test-statistics are in favour of IPCL*. Hence, on the whole our analysis suggests that

although there are gains from moving to panel estimation framework, one has to employ

bias correction in order to reap these benefits fully. Moreover, the test results indicate a

preference towards IPCL* over IPCL. This encourages future research into understanding

the mechanism behind robust priors that deal with the issue of initial value selection.

29

7.2 Hedge Fund Analysis

Hedge funds are alternative investment vehicles that are subject of an active research area

in financial economics; for a selection of references see Lo (2010, Chapter 1.4). Hedge fund

returns are usually reported at monthly frequency, and datasets do not go back any earlier

than 1994. Consequently, GARCH-type volatility analysis has hitherto been impossible

for hedge funds.15 In this section, we use the bias-corrected panel GARCH estimator as a

solution to this problem and conduct a short novel analysis of hedge fund volatilities.

Even when one is not directly interested in hedge fund volatilities, an accurate volatility

estimator might still be called for. For example, the conditional mean of hedge fund

returns is usually modelled linearly as a factor model (e.g. Fung and Hsieh (2004)). There

the generalised least squares estimation method, although desirable, is not available as

shortness of data makes it impossible to explicitly model heteroskedastic error terms.

Moreover, in some applications volatility is also used as a proxy for risk exposure (e.g.

Agarwal, Daniel and Naik (2011)). Again, insufficient time-series variation forces the use

of more basic estimators, such as the cross-sectional sample standard errors of monthly

returns. In both examples, our method offers better modelling capabilities.

7.2.1 Data Description

The dataset consists of monthly returns for 27, 396 funds between February 1994 and

April 2011, implying at most 207 monthly returns for any fund. The funds are classified

into ten vendor-reported investment strategies: Security Selection, Global Macro, Rela-

tive Value, Directional Trading, Fund of Funds, Multi-Process, Emerging Markets, Fixed

Income, Commodity Trading Advisors (CTA) and Other. This provides a convenient cri-

terion for grouping funds into separate panels. For details of the dataset, investment

strategies and mapping of the funds to investment strategies see Ramadorai (2013).

For a chosen panel length T , we only consider the funds that have been reporting in

the last T periods. Therefore, all fund panels are balanced. We will report parameter

estimates for panels with T = 150, 175, 195, but we will focus on T = 150 only, to achieve

maximum cross-section variation. The strategies Relative Value and Other are excluded

from the analysis as only a handful of funds are available for the considered panel lengths.

Further information on the data construction process is provided in the Supplementary

Appendix. In particular, data are treated in a preliminary step for standard issues in

hedge fund data, such as the incubation and backfill biases and performance smoothing.

7.2.2 Results

Parameter estimates and the number of included funds for the three sample sizes are

reported in Table 5. Estimates of α vary between .061 and .249, while β takes on values

15An earlier example that tries to overcome this issue is Huggler (2004) who constructs representative

proxies (that have GARCH errors) for hedge fund portfolios.

30

T = 150 T = 175 T = 195

Strategy # α β # α β # α β

Security selection 52 .202 .788 34 .174 .820 26 .179 .815Macro 25 .114 .884 17 .093 .907 15 .105 .893

Directional Traders 51 .208 .771 24 .153 .840 16 .161 .832Fund of funds 78 .153 .847 41 .143 .857 25 .152 .836Multi-process 28 .176 .824 19 .165 .835 15 .230 .770

Emerging 19 .220 .772 11 .176 .794 7 .176 .801Fixed income 13 .249 .751 8 .195 .805 5 .229 .768

CTA 41 .090 .910 22 .061 .939 15 .072 .928

Table 5: Integrated composite likelihood parameter estimates for hedge fund data. Estimationis based on the following three samples periods: (i) November 1998 - April 2011 (150 time-seriesobservations) given in columns 2− 4, (ii) October 1996 - April 2011 (175 time-series observations)given in columns 5 − 7 and (iii) February 1995 - April 2011 (195 time-series observations) givenin columns 8− 10. Number of funds included each case is given in column ‘#’.

between .751 and .939. All strategies exhibit high memory as α + β is generally close

to 1, across all T (note that although the parameter estimates seem to add up to 1,

α + β is restricted to be less than 1 in sum, following the standard GARCH parameter

restrictions.). Furthermore, values of (α, β) tend to change as T varies. However, this

should not entirely be attributed to changes in the sample size. The composition of

the panel changes too, as funds with less than the necessary number of observations

are dropped from the sample. Some strategies such as Fixed Income, Emerging Markets

and Security Selection are typically more responsive to past shocks as their estimated α

parameters are larger. CTA, Macro and Fund of Funds, on the other hand, have the lowest

sensitivity to past shocks and higher responsiveness to past conditional variance.

We now focus more closely on the results for T = 150. Figure 3 presents the mini-

mum, 10th percentile, median, 90th percentile and maximum of the sample distribution of

fitted monthly volatility across funds at each point in time. The sample distribution plots

indicate volatility clustering. With the exception of Emerging Markets and Directional

Traders, median volatility is generally not higher than 5%. Moreover, across all strate-

gies, the sample distribution of volatility is right-skewed. Also, all volatility distributions

exhibit time-varying behaviour in their right tails. This behaviour is generally correlated

with major economic and political events of 2000s, such as the burst of the dotcom bub-

ble (2000), the start of the Second Gulf War (2003) and the credit crunch (2007-2008).

Not surprisingly, economically or politically turbulent periods are marked by higher lev-

els of volatility. The time-varying behaviour is most discernible for the 90th and 100th

percentiles, although other percentiles exhibit some reaction, as well. Another interesting

implication of the percentile plots is that even within the same strategy funds exhibit

variation in terms of volatility. One reason could be that the “self-reported” strategies

provide too general a classification and funds following the same strategy simply do not

have similar volatility characteristics. Another possibility is that, despite following the

31

1999 2001 2003 2005 2007 2009 2011

0

10

20

30

Security Selection (52)

1999 2001 2003 2005 2007 2009 2011

0

10

20

30

Macro (25)

1999 2001 2003 2005 2007 2009 2011

0

10

20

30

Directional Traders (51)

1999 2001 2003 2005 2007 2009 2011

0

10

20

30

Fund of Funds (78)

1999 2001 2003 2005 2007 2009 2011

0

10

20

30

Multi−Process (28)

1999 2001 2003 2005 2007 2009 2011

0

10

20

30

Emerging (19)

1999 2001 2003 2005 2007 2009 2011

0

10

20

30

Fixed Income (13)

1999 2001 2003 2005 2007 2009 2011

0

10

20

30

CTA (41)

Max

90%

Median

10%

Min

Figure 3: Plots of the 0, 10, 50 (median), 90 and 100 percentiles of the sample distribution of fittedmonthly volatilities across funds. Fitted volatilities are reported on the vertical axis. The numberof funds in each strategy is given in parentheses. Only those funds that have reported returnsbetween November 1998 and April 2011 are included (150 observations). Volatility estimation isbased on the version of the integrated composite likelihood method that integrates out the initialvalue (IPCL*).

same strategy, some funds are more liable to be volatile due to specific market conditions

or manager characteristics.

32

8 Conclusion

This paper has made a number of theoretical and empirical contributions. Firstly, we

have extended the analysis of the incidental parameter bias in nonlinear and dynamic

panels to various types of cross-section dependence. Theoretical analysis reveals that

cross-section dependence leads to a new type of first-order bias only if dependence is strong

enough to induce√T -consistency. Interestingly, at this rate of convergence the incidental

parameter bias becomes a pure time-series small-sample bias. The dependence types we

have considered are relevant for a various strands of literature. The analysis is conducted

with respect to a generic likelihood estimation problem and our results are potentially

relevant for a multitude of applications. Our results also contribute to the literatures

on the integrated and composite likelihood methods. Secondly, we have proposed a new

volatility estimation method which works with as little as 150 time-series observations.

This is well below the several hundreds of observations required by standard estimation

methods. The proposed method is shown to perform well against its alternatives in a

volatility forecasting excercise. Finally, we have conducted a novel volatility analysis

of monthly hedge fund returns. Estimation results suggest that volatility behaviour of

funds exhibits variation, both within and between investment strategies. In addition,

volatility distributions generally exhibit large right tails, which tend to become larger

during episodes marked by important economic and political events.

References

Agarwal, V., N. D. Daniel, and N. Y. Naik (2011): “Do hedge funds manage their reported returns?,”Review of Financial Studies, 24, 3281–3320.

Arellano, M., and S. Bonhomme (2009): “Robust priors in nonlinear panel data models,” Economet-rica, 77, 489–536.

(2011): “Nonlinear panel data analysis,” Annual Review of Economics, 3, 395–424.

Arellano, M., and J. Hahn (2006): “A likelihood-based approximate solution to the incidental param-eter problem in dynamic nonlinear models with multiple effects,” working paper.

(2007): “Understanding bias in nonlinear panel models: some recent developments,” in Advancesin Economics and Econometrics: Theory and Applications, Ninth World Congress - Volume III, ed. byR. Blundell, W. Newey, and T. Persson, pp. 381–409. Cambridge University Press.

Bai, J. (2009): “Panel data models with interactive fixed effects,” Econometrica, 77, 1229–1279.

Bailey, N., G. Kapetanios, and M. H. Pesaran (2012): “Exponent of cross-sectional dependence:estimation and inference,” working paper.

Barndorff-Nielsen, O. E. (1983): “On a formula for the distribution of the maximum likelihood esti-mator,” Biometrika, 70, 343–65.

Barndorff-Nielsen, O. E., and D. R. Cox (1994): Inference and Asymptotics. Chapman & Hall,London.

Barndorff-Nielsen, O. E., and N. Shephard (2002): “Econometric analysis of realized volatility andits use in estimating stochastic volatility models,” Journal of the Royal Statistical Society, Series B, 64,253–280.

Bauwens, L., and J. V. K. Rombouts (2007): “Bayesian clustering of many GARCH models,” Econo-metric Reviews, 26, 365–386.

Bertrand, M., E. Duflo, and S. Mullainathan (2004): “How much should we trust differences-in-differences estimates,” Quarterly Journal of Economics, 119, 249–275.

33

Bester, C. A., T. G. Conley, and C. B. Hansen (2011): “Inference with dependent data using clustercovariance estimators,” Journal of Econometrics, 165, 137–151.

Bester, C. A., and C. B. Hansen (2009): “A penalty function approach to bias reduction in nonlinearpanel models with fixed effects,” Journal of Business and Economic Statistics, 27, 131–148.

Bollerslev, T. (1986): “Generalised autoregressive conditional heteroskedasticity,” Journal of Econo-metrics, 51, 307–327.

Bollerslev, T., and J. M. Wooldridge (1992): “Quasi maximum likelihood estimation and inferencein dynamic models with time varying covariances,” Econometric Reviews, 11, 143–172.

Bosq, D., F. Merlevede, and M. Peligrad (1999): “Asymptotic normality for density kernel estima-tors in discrete and continuous time,” Journal of Multivariate Analysis, 68, 78–95.

Cameron, A. C., and D. L. Miller (2011): “Robust inference with clustered data,” in Handbook ofEmpirical Economics and Finance, ed. by A. Ullah, and D. E. Giles, pp. 1–28. CRC Press.

Carrasco, M., and X. Chen (2002): “Mixing and moment properties of various GARCH and stochasticvolatility models,” Econometric Theory, 18, 17–39.

Carro, J. M. (2007): “Estimating dynamic panel data discrete choice models with fixed effects,” Journalof Econometrics, 140, 503–528.

Cermeno, R., and K. B. Grier (2006): “Conditional heteroskedasticity and cross-sectional dependencein panel data: an empirical study of inflation uncertainty in the G7 countries,” in Panel Data Econo-metrics, ed. by B. H. Baltagi, vol. 10, pp. 259–278. Elsevier.

Chandrasekhar, A., and R. Lewis (2011): “Econometrics of sampled networks,” Working paper.

Chudik, A., M. H. Pesaran, and E. Tosetti (2011): “Weak and strong cross-section dependence andestimation of large panels,” Econometrics Journal, 14, C45–C90.

Conley, T. G. (1999): “GMM estimation with cross sectional dependence,” Journal of Econometrics, 92,1–45.

Corradi, V., and E. M. Iglesias (2008): “Bootstrap refinements for QML estimators of theGARCH(1,1) parameters,” Journal of Econometrics, 144, 500–510.

Cox, D. R., and N. Reid (1987): “Parameter orthogonality and approximate conditional inference (withdiscussion),” Journal of the Royal Statistical Society, Series B, 49, 1–39.

(2004): “A note on pseudolikelihood constructed from marginal densities,” Biometrika, 91, 729–737.

Dhaene, G., and K. Jochmans (2011): “An adjusted profile likelihood for non-stationary panel datamodels with fixed effects,” working paper.

Doukhan, P. (1994): Mixing, Properties and Examples. Springer.

Engle, R. F. (1982): “Autoregressive conditional heteroskedasticity with estimates of the variance of theUnited Kingdom inflation,” Econometrica, 50, 987–1007.

(2009): “High dimensional dynamic correlations,” in The Methodology and Practice of Economet-rics: Papers in Honour of David F Hendry, ed. by J. L. Castle, and N. Shephard, pp. 122–148. OxfordUniversity Press.

Engle, R. F., and J. Mezrich (1996): “GARCH for groups,” Risk, 9, 36–40.

Engle, R. F., C. Pakel, N. Shephard, and K. K. Sheppard (2014): “Fitting vast dimensional time-varying covariance models,” Working paper.

Fernandez-Val, I. (2009): “Fixed effects estimation of structural parameters and marginal effects inpanel probit models,” Journal of Econometrics, 150, 71–85.

Fernandez-Val, I., and M. Weidner (2013): “Individual and time effects in nonlinear panel datamodels with large N,T,” working paper.

Francq, C., and J.-M. Zakoıan (2010): GARCH Models: Structure, Statistical Inference and FinancialApplications. Wiley.

Fung, D. A., and W. Hsieh (2004): “Hedge fund benchmarks: a risk based approach,” Financial AnalystJournal, 60, 65–80.

Giacomini, R., and H. White (2006): “Tests of conditional predictive ability,” Econometrica, 74, 1545–1578.

34

Goldsmith-Pinkham, P., and G. W. Imbens (2013): “Social networks and the identification of peereffects,” Journal of Business and Economic Statistics, 31, 253–264.

Goncalves, S. (2011): “The moving blocks bootstrap for panel linear regression models with individualfixed effects,” Econometric Theory, 27, 1048–1082.

Hahn, J., and G. Kuersteiner (2002): “Asymptotically unbiased inference for a dynamic panel modelwith fixed effects when both n and T are large,” Econometrica, 70, 1639–1657.

(2011): “Bias reduction for dynamic nonlinear panel models with fixed effects,” EconometricTheory, 27, 1152–1191.

Hahn, J., and W. K. Newey (2004): “Jackknife and analytical bias reduction for nonlinear panelmodels,” Econometrica, 72(4), 1295–1319.

Hansen, C. B. (2007): “Asymptotic properties of a robust variance matrix estimator for panel data whent is large,” Journal of Econometrics, 141, 597–620.

Heber, G., A. Lunde, N. Shephard, and K. K. Sheppard (2009): Oxford Man Institute’s RealizedLibrary. Oxford-Man Institute: University of Oxford, Version 0.1.

Hospido, L. (2012): “Modelling heterogeneity and dynamics in the volatility of individual wages,” Journalof Applied Econometrics, 27, 386–414.

Huggler, B. (2004): “Modelling hedge fund returns,” University of Zurich masters thesis.

Jenish, N., and I. R. Prucha (2009): “Central limit theorems and uniform laws of large numbers forarrays of random fields,” Journal of Econometrics, 150, 86–98.

Kelejian, H. H., and I. R. Prucha (2007): “HAC estimation in a spatial framework,” Journal ofEconometrics, 140, 131–154.

Lancaster, T. (2000): “The incidental parameter problem since 1948,” Journal of Econometrics, 95,391–413.

Lindsay, B. G. (1988): “Composite likelihood methods,” in Statistical Inference from Stochastic Processes,ed. by N. U. Prabhu, pp. 221–239. Amercian Mathematical Society, Providence, RI.

Lo, A. W. (2010): Hedge Funds: An Analytic Perspective. Princeton University Press.

McCullagh, P. (1987): Tensor Methods in Statistics. Chapman & Hall, London.

Neyman, J., and E. L. Scott (1948): “Consistent estimates based on partially consistent observations,”Econometrica, 16, 1–16.

Noureldin, D., N. Shephard, and K. K. Sheppard (2012): “Multivariate high-frequency-based volatil-ity (HEAVY) models,” Journal of Applied Econometrics, 27, 907–933.

Pace, L., and A. Salvan (1997): Principles of Statistical Inference from a Neo-Fisherian Perspective.World Scientific, Singapore.

Pakel, C., N. Shephard, and K. K. Sheppard (2011): “Nuisance parameters, composite likelihoodsand a panel of GARCH models,” Statistica Sinica, 21, 307–329.

Patton, A. J. (2011): “Volatility forecast comparison using imperfect volatility proxies,” Journal ofEconometrics, 160, 246–256.

Phillips, P. C. B., and D. Sul (2007): “Bias in dynamic panel estimation with fixed effects, incidentaltrends and cross section dependence,” Journal of Econometrics, 137, 162–188.

Ramadorai, T. (2013): “Capacity constraints, investor information, and hedge fund returns,” Journal ofFinancial Economics, 107, 401–416.

Sartori, N. (2003): “Modified profile likelihoods in models with stratum nuisance parameters,”Biometrika, 90, 533–549.

Severini, T. A. (1999): “On the relationship between Bayesian and non-Bayesian elimination of nuisanceparameters,” Statistica Sinica, 9, 713–724.

(2005): Elements of Distribution Theory. Cambridge University Press, New York.

(2007): “Integrated likelihood functions for non-Bayesian inference,” Biometrika, 94, 529–542.

Severini, T. A., and W. H. Wong (1992): “Profile likelihood and conditionally parametric models,”Annals of Statistics, 20, 1768–1802.

Tierney, L., R. E. Kass, and J. B. Kadane (1989): “Fully exponential laplace approximations toexpectations and variances of nonpositive functions,” Journal of the American Statistical Association,84, 710–716.

35

Varin, C., N. Reid, and D. Firth (2011): “An overview of composite likelihood methods,” StatisticaSinica, 21, 5–42.

White, H. (2001): Asymptotic Theory for Econometricians. Academic Press, Orlando, 2 edn.

Wooldridge, J. M. (2003): “Cluster-sample methods in applied econometrics,” American EconomicReview, 93, 133–188.

Woutersen, T. (2002): “Robustness against incidental parameters,” working paper.

Xu, X., and N. Reid (2011): “On the robustness of maximum composite likelihood estimate,” Journalof Statistical Planning and Inference, 141, 3047–3054.

A Mathematical Appendix

A.1 Existence and Moment Assumptions

Assumption A.1 For some δ > 0 (which is not necessarily the same across all of thefollowing expressions) we have, for all combinations of p1, p2, p3, p4, p5 ∈ {1, ..., P},

E[|ìt(θ, λ)|2+δ

]<∞, E

∣∣∣∣∣d3ìt(θ, λ)

dλ3

∣∣∣∣∣2+δ <∞, E

∣∣∣∣∣d4ìt(θ, λ)

dλ4

∣∣∣∣∣2+δ <∞,

E

∣∣∣∣∣ d4ìt(θ, λ)

dλ2dθp1dθp2

∣∣∣∣∣2+δ <∞, E

∣∣∣∣∣ d4ìt(θ, λ)

dλdθp1dθp2dθp3

∣∣∣∣∣2+δ <∞,

E

∣∣∣∣∣ d5ìt(θ, λ)

dλdθp1dθp2dθp3dθp4

∣∣∣∣∣2+δ <∞, E

∣∣∣∣∣ d5ìt(θ, λ)

dλ2dθp1dθp2dθp3

∣∣∣∣∣2+δ <∞,

E

∣∣∣∣∣ d5ìt(θ, λ)

dθp1dθp2dθp3dθp4dθp5

∣∣∣∣∣2+δ <∞, E

∣∣∣∣∣ d6ìt(θ, λ)

dλ2dθp1dθp2dθp3dθp4

∣∣∣∣∣2+δ <∞,

E

[∣∣∣∣dìt(θ, λ)

dθp1

∣∣∣∣3+δ]<∞, E

[∣∣∣∣dìt(θ, λ)

dλ

∣∣∣∣3+δ]<∞,

E

∣∣∣∣∣d2ìt(θ, λ)

dλ2

∣∣∣∣∣3+δ <∞, E

∣∣∣∣∣d2ìt(θ, λ)

dλdθp1

∣∣∣∣∣3+δ <∞, E

∣∣∣∣∣d2ìt(θ, λ)

dθp1dθp2

∣∣∣∣∣3+δ <∞,

E

∣∣∣∣∣ d3ìt(θ, λ)

dλdθp1dθp2

∣∣∣∣∣3+δ <∞, E

∣∣∣∣∣d3ìt(θ, λ)

dλ2dθp1

∣∣∣∣∣3+δ <∞, E

∣∣∣∣∣ d3ìt(θ, λ)

dθp1dθp2dθp3

∣∣∣∣∣3+δ <∞,

uniformly for all i, t, θ and λ.

Assumption A.2 (i) For all (m, k) such that m ≤ 4 and |k| ≤ 4 or m = 0 and |k| = 5

there exist functions M(m,k)it (θ, λ), possibly dependent on xit, such that∣∣∣Z(m,k)

it (θ, λ)∣∣∣ ≤M (m,k)

it (θ, λ) <∞,

for all i, t, λ and θ; (ii) for all (m, k)such that m ≤ 2 and |k| ≤ 4 there exist functions

H(m,k)iT (θ, λ), possibly dependent on {xi1, ..., xiT }, such that∣∣∣∣∣ dm+|k|

dλmdθk11 ...dθ

kPP

lnπi(λ|θ)

∣∣∣∣∣ ≤ H(m,k)iT (θ, λ) <∞, for all i, T, λ and θ.

36

A.2 Proof of Theorem 4.1

The first consistency result that max1≤i≤N |λi(θ)− λiT (θ)| p→ 0 follows from Theorem 4.1of Engle et al. (2014) which applies to the general case of a vector nuisance parameter.

To prove the main consistency result, first we have to prove the uniform convergenceresult that for all η > 0,

P

[max

1≤i≤Nsupθ∈Θ

∣∣∣ÌiT (θ)− E[ÌiT (θ)]∣∣∣ > η

]= o(1). (17)

To prove (17), notice that from Lemma B.2 of Engle et al. (2014), for all η > 0,

P

[max

1≤i≤Nsup

(θ,λ)∈Ψ|ìT (θ, λ)− E[ìT (θ, λ)]| > η

]= o(1),

where Ψ = Θ× Λ. This, in turn, implies that for all η > 0,

P

[max

1≤i≤Nsupθ∈Θ|ìT (θ)− E[ìT (θ)]| > η

]= o(1).

Now, by Lemma A.7 introduced below, Assumptions 3.1(iv) and A.2 imply that

ÌiT (θ) = ìT (θ, λi(θ)) + op(1) as T →∞.

Moreover, given that max1≤i≤N |λi(θ)−λiT (θ)| p→ 0, by standard arguments ìT (θ, λi(θ)) =

ìT (θ) + op (1). Therefore, ÌiT (θ) = ìT (θ) + op(1) and E[ÌiT (θ)] = E[ìT (θ)] + o(1) as

T →∞, implying ÌiT (θ)− E[ÌiT (θ)] = ìT (θ)− E[ìT (θ)] + ZiT (θ), where ZiT (θ) = op(1)for all i and θ as T →∞. Now,

max1≤i≤N

supθ∈Θ

∣∣∣ÌiT (θ)− E[ÌiT (θ)]∣∣∣ = max

1≤i≤Nsupθ∈Θ|ìT (θ)− E[ìT (θ)] + ZiT (θ)|

≤ max1≤i≤N

supθ∈Θ|ìT (θ)− E[ìT (θ)]|+ max

1≤i≤Nsupθ∈Θ|ZiT (θ)| .

Hence,

P

[max

1≤i≤Nsupθ∈Θ

∣∣∣ÌiT (θ)− E[ÌiT (θ)]∣∣∣ > η

]≤ P

[max

1≤i≤Nsupθ∈Θ|ìT (θ)− E[ìT (θ)]|

+ max1≤i≤N

supθ∈Θ|ZiT (θ)| > η

]≤ P

[max

1≤i≤Nsupθ∈Θ|ìT (θ)− E[ìT (θ)]| > η

2

]+P

[max

1≤i≤Nsupθ∈Θ|ZiT (θ)| > η

2

]= o(1),

which proves (17).The rest proceeds along the same lines as the proof of Theorem 3 of Hahn and Kuer-

37

steiner (2011) and Theorem 4.1 of Engle et al. (2014). By (17) we have for any ε > 0,

P

[max

1≤i≤Nsupθ∈Θ

∣∣∣ÌiT (θ)− E[ÌiT (θ)]∣∣∣ < ε/2

]= 1− o(1).

Then, with probability 1− o(1),

sup{θ:||θ−θ0||>η}

1

N

N∑i=1

ÌiT (θ) ≤ 1

N

N∑i=1


ÌiT (θ)

<1

N

N∑i=1


E[ÌiT (θ)] +ε

2

<1

N

N∑i=1

E[ÌiT (θ0)]− ε

2

<1

N

N∑i=1

ÌiT (θ0)

≤ maxθ∈Θ

1

N

N∑i=1

ÌiT (θ),

where we repeatedly use (17) and the unique identification condition of Assumption 3.2.This implies that with probability 1− o(1), ||θ − θ0|| < η for any η > 0.


Remark A.1 The continuity condition of Assumption 3.1(ii) ensures that the derivativesof the likelihood function are all measureable, which in turn ensures that all likelihoodderivatives considered here inherit the mixing properties of xit. By the size, 2 + ε ordermoment existence and variance conditions of Assumptions 3.4 and A.1, α-mixing CLTsapply to all the zero-mean likelihood derivatives that appear in the proofs of Lemmas A.6,A.7, A.8 and A.9. As such, these terms are all Op(T

−1/2) as T → ∞, uniformly in i.Similarly, the size and 1 + ε moment existence assumptions ensure the existence of α-mixing LLNs. These results are used heavily in the proofs. See, for example, Corollary3.48 and Theorem 5.20 in White (2001). Notice that, by Assumption A.2 all likelihoodterms used in this paper are assumed to be bounded, which implies that their (and theirexpectations’) averages will be O(1) in any case. Orders of other terms which are notmixing, such as the prior function, are already determined by the bound conditions ofAssumption A.2. Finally, the existence of 3 + ε order moments for several derivatives inAssumption A.1 is necessary to bound expectations of higher order terms.

A.3.1 Some Preliminary Lemmas

The mathematical proofs are based on asymptotic expansions involving second- and third-order moments of likelihood terms. The mixing assumption provides a convenient tool fordetermining the orders of such terms. The following results will be useful for this purpose.By Assumption 3.4, data are always assumed to be α-mixing across time, independent ofthe particular type of cross-section dependence considered.

38

Lemma A.1 Let xit, t = 1, ..., T be an α-mixing random sequence such that

∞∑m=1

mαi(m)δ/3+δ <∞,

uniformly for all i = 1, ..., N, for some δ > 0. Let f(·), g(·) and h(·) be some mea-surable functions of xit where E[f(·)] = E[g(·)] = E[h(·)] = 0. In addition assume that

E[|f(xit)|3+δ] <∞, E[|g(xit)|

3+δ] <∞ and E[|h(xit)|3+δ] <∞ for all i,t. Then,

E

[1

T 2

T∑t=1

T∑t=1

f(xis)g(xit)

]= O

(1

T

), (18)

E

1

T 3

T∑s=1

T∑t=1

T∑q=1

f(xis)g(xit)h(xiq)

= O

(1

T 2

). (19)

Proof. See the Supplementary Appendix.

Lemma A.2 Let xit, i = 1, ..., N , t = 1, ..., T be an α-mixing random field such that, for

some δ > 0,∑∞

m=1mαi(m)δ/3+δ <∞ uniformly for all i and∑∞

m=1m3α1,2(m)δ/3+δ <∞.

Let f(·), g(·) and h(·) be some measurable functions of xit where E[f(·)] = E[g(·)] =

E[h(·)] = 0. In addition, let E[|f(xit)|3+δ] <∞, E[|g(xit)|

3+δ] <∞ and E[|h(xit)|3+δ] <∞

for all i,t. Then,

E

1

N2T 2

N∑i=1

N∑j=1

T∑s=1

T∑t=1

f(xis)g(xjt)

= O

(1

NT

), (20)

E

1

N2T 3

N∑i=1

N∑j=1

T∑s=1

T∑t=1

T∑q=1

f(xis)g(xjt)h(xjq)

= O

(1

NT 2

), (21)

E

1

N3T 3

N∑i=1

N∑j=1

N∑k=1

T∑s=1

T∑t=1

T∑q=1

f(xis)g(xjt)h(xkq)

= O

(1

N2T 2

). (22)


Lemma A.3 Let, for any i, xit, t = 1, ..., T be an α-mixing random sequence such thatlimm→∞ αij,k(m) = 0, limm→∞ αi,jk(m) = 0, and for some δ > 0

∞∑m=1

mαij,k(m)δ/3+δ <∞ and∞∑m=1

mαi,jk(m)δ/3+δ <∞,

uniformly for all i, j, k = 1, ..., N. Let f(·), g(·) and h(·) be some measurable functions of

xit where E[f(·)] = E[g(·)] = E[h(·)] = 0. In addition assume that E[|f(xit)|3+δ] < ∞,

39

E[|g(xit)|3+δ] <∞ and E[|h(xit)|

3+δ] <∞ for all i,t. Then,

E

[1

T 2

T∑s=1

T∑t=1

f(xis)g(xjt)

]= O

(1

T

), (23)

E

1

T 3

T∑s=1

T∑t=1

T∑q=1

f(xis)g(xjt)h(xkq)

= O

(1

T 2

), (24)

E

1

N2T 2

N∑i=1

N∑j=1

T∑s=1

T∑t=1

f(xis)g(xjt)

= O

(1

T

), (25)

E

1

N3T 3

N∑i=1

N∑j=1

N∑k=1

T∑s=1

T∑t=1

T∑q=1

f(xis)g(xjt)h(xkq)

= O

(1

T 2

), (26)

where (23) and (24) hold for all i, j,k = 1, ..., N.


Lemma A.4 Let the assumptions of Lemma A.3 hold. In addition, assume that the datacan be divided into GN groups of size LN where GN = O(Nα) and Gg is the index set forgroup g, g = 1, ..., GN . Finally, assume that for any i ∈ Gg1 , j ∈ Gg2 and k ∈ Gg3 , f(xis),g(xjt) and h(xkq) are mutually independent unless when g1 = g2 = g3. Then,

E

1

N2T 2

N∑i=1

N∑j=1

T∑s=1

T∑t=1

f(xis)g(xjt)

= O

(1

NαT

), (27)

E

1

N3T 3

N∑i=1

N∑j=1

N∑k=1

T∑s=1

T∑t=1

T∑q=1

f(xis)g(xjt)h(xkq)

= O

(1

N2αT 2

). (28)


Lemma A.5 Let Assumptions 3.1(i)-(iii), 3.3(ii), 3.4(i), A.1 and A.2 hold. Let, fur-thermore, λiT (θ) be the mean value between λi(θ) and λiT (θ) where max1≤i≤N |λi(θ) −λiT (θ)| p→ 0. Then,

`λλλiT (θ, λiT (θ)) = Op(1) and `λλλλiT (θ, λiT (θ)) = Op(1).

Note that the mean value is not necessarily the same for both of the terms considered above.

Proof. By standard arguments (see, for example, Lemma B.2 of Engle, Pakel, Shephardand Sheppard (2014)), under the given assumptions, one can obtain uniform laws of large

numbers for `λλλiT (θ, λ) and `λλλλiT (θ, λ). Then, since λiT (θ) is the mean value between λi(θ)

and λiT (θ), point convergence of λi(θ)−λiT (θ) implies that λiT (θ)−λiT (θ)p→ 0. Combining

these results, on obtains `λλλiT (θ, λiT (θ)) = E[`λλλiT (θ, λiT (θ))] + op(1) and `λλλiT (θ, λiT (θ)) =

E[`λλλiT (θ, λiT (θ))] + op(1), proving that both terms are Op(1) as T →∞.

40

Lemma A.6 Suppose max1≤i≤N |λi(θ) − λiT (θ)| p→ 0. Under Assumptions 3.4, A.1 andA.2

λi(θ)− λiT (θ) =1

E[`λλiT (θ)]

[−`λiT (θ) +

V λλiT (θ)`λiT (θ)

E[`λλiT (θ)]− 1

2

[`λiT (θ)]2E[`λλλiT (θ)]

{E[`λλiT (θ)]}2

]+Op(T

−3/2), (29)

[λi(θ)− λiT (θ)]2 =1

{E[`λλiT (θ)]}2

[(`λiT )2 − 2

V λλiT (θ)[`λiT (θ)]2

E[`λλiT (θ)]+


{E[`λλiT (θ)]}2

]+Op(T

−2), (30)

[λi(θ)− λiT (θ)]3 = − [`λiT (θ)]3

{E[`λλiT (θ)]}3+Op(T

−2). (31)

Proof. Expanding `λiT (θ, λi(θ)) around λi(θ) = λiT (θ) yields

`λiT (θ, λi(θ)) = `λiT (θ) + V λλiT (θ)[λi(θ)− λiT (θ)] + E[`λλiT (θ)][λi(θ)− λiT (θ)]

+1

2`λλλiT (θ)[λi(θ)− λiT (θ)]2 +

1

6`λλλλiT (θ, λi(θ))[λi(θ)− λiT (θ)]3,

where λi(θ) ∈ [min(λi(θ), λiT (θ)),max(λi(θ), λiT (θ))]. By Lemma A.5, `λλλλiT (θ, λi(θ)) isOp(1). Then, by rearranging and recursively substituting for λi(θ)− λiT (θ),

λi(θ)− λiT (θ) =1

E[`λλiT (θ)]

(−`λiT (θ) +

V λλiT (θ)`λiT (θ)

E[`λλiT (θ)]− 1

2E[`λλλiT (θ)][λi(θ)− λiT (θ)]2

)

+Op

(1

T 3/2

). (32)

Similarly, taking the square of (32) and recursively substituting for [λi(θ)− λiT (θ)]2 up toorder Op(T

−2) yields,

[λi(θ)− λiT (θ)]2 =1

{E[`λλiT (θ)]}2

[(`λiT (θ))2 − 2

(`λiT (θ))2V λλiT (θ)

E[`λλiT (θ)]

+`λiT (θ)E[`λλλiT (θ)][λi(θ)− λiT (θ)]2]

+Op

(1

T 2

),

=1

{E[`λλiT (θ)]}2

[(`λiT (θ))2 − 2

[`λiT (θ)]2V λλiT (θ)

E[`λλiT (θ)]+


{E[`λλiT (θ)]}2

]

+Op

(1

T 2

),

which proves (30). Next, substituting (30) for [λi(θ)− λiT (θ)]2 in (32) gives (29). Finally,taking the product of (29) and (30) up to a Op(T

−2) remainder yields (31).

A.3.2 The Proof

The theorem will be proved by using a series of results.

41

Lemma A.7 Under Assumptions 3.1(iv) and A.2,

ÌiT (θ)− ìT (θ, λi(θ)) =1

2Tln

(2π

T

)− 1

2Tln[−`λλiT (θ, λi(θ))]

+1

Tlnπi(λi(θ)) +Op

(1

T 2

). (33)

Proof. By Assumption 3.1(iv) and the smoothness conditions of Assumption A.2 on thelikelihood and prior functions, a Laplace approximation yields

ÌiT (θ) =1

Tln

√

2π/T exp[TìT (θ, λi(θ))]√−`λλiT (θ, λi(θ))

[πi(λi(θ)|θ) +Op

(1

T

)]=

1

2Tln

2π

T+ ìT (θ, λi(θ))−

1

2Tln[−`λλiT (θ, λi(θ))] +

1

Tlnπi(λi(θ)|θ) +Op

(1

T 2

).

The second equality is obtained by expanding ln(x+a) around x (see the proof of Lemma 1of Arellano and Bonhomme (2009) and also Tierney, Kass and Kadane (1989) and Theorem9.14 of Severini (2005)). Subtracting ìT (θ, λi(θ)) from this expression yields (33).

Lemma A.8 Suppose max1≤i≤N |λi(θ) − λiT (θ)| p→ 0. Let Assumptions 3.1(ii), 3.4, A.1and A.2 hold. Then, for any i, as T →∞

E[ln{−`λλiT (θ, λi(θ))}] = ln{−E[`λλiT (θ)]}+O

(1

T

),

E[lnπi(λi(θ)|θ)] = lnπi(λiT (θ)|θ) +O

(1

T

).

Proof. By Lemma A.6, we have

λi(θ)− λiT (θ) = − `λiT (θ)

E[`λλiT (θ)]+Op

(1

T

). (34)

Expanding `λλiT (θ, λi(θ)) around λi(θ) = λiT (θ) and some rearrangement yields

`λλiT (θ, λi(θ))

E[`λλiT (θ)]= 1 +

V λλiT (θ) + `λλλiT (θ)(λi(θ)− λiT (θ)) + 1

2`λλλλiT (θ, λi(θ))(λi(θ)− λiT (θ))2

E[`λλiT (θ)],

where, as before, `λλλλiT (θ, λi(θ)) = Op(1) by Lemma A.5. Let AiT (θ) denote the second

term on the right-hand side of the above equation. Notice that AiT (θ) = Op(T−1/2). Now,

by a mean value expansion of ln[1 +AiT (θ)] about AiT (θ) = 0,

ln[1 +AiT (θ)] = ln(1) +AiT (θ)− 1

2

[AiT (θ)]2

(1 + A)2 ,

where A is the mean value between 0 and AiT (θ). Hence, A = op(1) and ln[1 +AiT (θ)] =

AiT (θ) +Op(T−1). This yields,

ln

{`λλiT (θ, λi(θ))

E[`λλiT (θ)]

}=

V λλiT (θ) + `λλλiT (θ)(λi(θ)− λiT (θ))

E[`λλiT (θ)]+Op

(1

T

)

42

=V λλiT (θ)− `λiT (θ)E[`λλλiT (θ)]E[`λλiT (θ)]−1

E[`λλiT (θ)]+Op

(1

T

),

by using (34). Let AiT (θ) =√T{V λλ

iT (θ) − `λiT (θ)E[`λλλiT (θ)]E[`λλiT (θ)]−1} and notice thatAiT (θ) is a zero-mean Op(1) term. Then,

ln{−`λλiT [θ, λi(θ)]} = ln{−E[`λλiT (θ)]}+AiT (θ)

√TE[`λλiT (θ)]

+Op

(1

T

), (35)

implying E[ln{−`λλiT (θ, λi(θ))}] = ln{−E[`λλiT (θ)]}+O(T−1). Also, expanding lnπi(λi(θ)|θ)around λi(θ) = λiT (θ) yields

lnπi(λi(θ)|θ) = lnπi(λiT (θ)|θ) +∂ lnπi(λiT (θ)|θ)

∂λi(λi(θ)− λiT (θ))

+1

2

∂2 lnπi(λ|θ)∂λ2

i

∣∣∣∣∣λ=λi(θ)

(λi(θ)− λiT (θ))2, (36)

where, again, λi(θ) is the mean value between λi(θ) and λiT (θ). Then ∂2

lnπi(λi(θ)|θ)∂λ

2i

= O(1)

by Assumption A.2(ii) and the reminder term is Op(T−1). Hence,

E[lnπi(λi(θ)|θ)] = E[lnπi(λiT (θ)|θ)

]+∂ lnπi(λiT (θ)|θ)

∂λiE[λi(θ)− λiT (θ)] +O

(1

T

)= lnπi(λiT (θ)|θ) +O

(1

T

),

as desired.

Lemma A.9 Suppose max1≤i≤N |λi(θ) − λiT (θ)| p→ 0. Let Assumptions 3.1(ii), 3.4, A.1and A.2 hold. Then, for any i, as T →∞

ìT (θ, λi(θ))− ìT (θ) = −1

2

(`λiT (θ))2

E[`λλiT (θ)]+

1

2

V λλiT (θ)[`λiT (θ)]2

{E[`λλiT (θ)]}2− 1

6


{E[`λλiT (θ)]}3

+Op

(1

T 2

). (37)

Proof. Expanding ìT (θ, λi(θ)) around λi(θ) = λiT (θ) gives

ìT (θ, λi (θ))− ìT (θ) = `λiT (θ)(λi(θ)− λiT (θ)) +1

2`λλiT (θ)(λi(θ)− λiT (θ))2

+1

6`λλλiT (θ)(λi(θ)− λiT (θ))3

+1

24`λλλλiT (θ, λi(θ))(λi(θ)− λiT (θ))4, (38)

where the remainder is O(T−2) by Lemma A.5. Then, using Lemma A.6,

`λiT (λi(θ)− λiT (θ)) = − (`λiT (θ))2

E[`λλiT (θ)]︸︷︷︸Op(T

−1)

+V λλiT (θ)(`λiT (θ))2

{E[`λλiT (θ)]}2︸︷︷︸Op(T

−3/2)

− 1

2

E[`λλλiT (θ)](`λiT (θ))3

{E[`λλiT (θ)]}3︸︷︷︸Op(T

−3/2)

43

+Op

(1

T 2

),

`λλiT (λi(θ)− λiT (θ))2 =(`λiT (θ))2

E[`λλiT (θ)]︸︷︷︸Op(T

−1)

− (`λiT (θ))2V λλiT (θ)

{E[`λλiT (θ)]}2︸︷︷︸Op(T

−3/2)

+(`λiT (θ))3E[`λλλiT (θ)]

{E[`λλiT (θ)]}3︸︷︷︸Op(T

−3/2)

+Op

(1

T 2

),

`λλλiT (λi(θ)− λiT (θ))3 = −(`λiT (θ))3E[`λλλiT (θ)]

{E[`λλiT (θ)]}3︸︷︷︸Op(T

−3/2)

+Op

(1

T 2

),

and substituting these expressions in (38) yields the desired result.Finally, the proof of Theorem 4.2 follows.

Proof of Theorem 4.2. Taking the sum of (33) and (37) yields (5). The expected

value of B(1)iT follows from Lemma A.8. By Assumption A.2 and standard arguments,

ln(−`λλiT (θ, λi(θ)), ln{−E[`λλiT (θ)]}, lnπi(λi(θ)|θ) and lnπi(λiT (θ)|θ) are all O(1). By Lemma

A.1, E[(`λiT (θ))2] is O(T−1) while E[(`λiT (θ))2V λλiT (θ)] and E[(`λiT (θ))3] are both O(T−2).

Hence, the orders of E[B(1)iT (θ)] and E[B(2)

iT (θ)] are confirmed. The final expression for thebias in (6) follows directly.


The proof is based on higher order asymptotic expansions of the integrated likelihoodfunction. As θ is a (P ×1) vector-valued parameter, these expansions get complicated veryquickly. To alleviate this problem, we use the index notation. The main advantage of thisnotation is that it enables working on multi-dimensional arrays in almost the same fashionas scalars. A short overview of this notation is given next. See also McCullagh (1987)and, for a more approachable treatment, Pace and Salvan (1997, Chapter 9).

A.4.1 A Short Overview of Index Notation

The index notation technique is used to represent arrays concisely. For example takesome P -dimensional vector, ν = (ν1, ..., νP )′ . Using the index notation, this vector canbe written more concisely as [νp], p = 1, ..., P. Similarly, for a P × Q matrix A, wherethe row i column j entry is denoted by Aij , the index notation representation is given by[Ai,j ]. In the following, we will also drop the brackets whenever the meaning is clear fromcontext. Although the convenience of this notation is not immediately obvious for one- ortwo-dimensional arrays, it becomes very useful with higher order arrays.

In the case at hand, θ = [θp] where p = {1, ..., P}. To make the likelihood notation lesscumbersome, we use some further notational simplifications. Firstly, for a given functionf (θ) we define the generic mth order derivative as

fp1,...,pm =dmf (θ)

dθp1dθp2 ...dθpmwhere p1, p2, ..., pm ∈ {1, ..., P}.

Notice that [fp1,...,pm ] gives an m-dimensional array. In what follows, we will also drop someindices and superscripts whenever variables can be distinguished by context. Specifically,

44

let r, s, t ∈ {1, ..., P}. We use the following convention:

ì = ìT (θ0), ì;r =dìT (θ)

dθr

∣∣∣θ=θ0

, ì;r,s =d2ìT (θ)

dθrdθs

∣∣∣θ=θ0

,

` =1

N

N∑i=1

ìT (θ0), `r =1

N

N∑i=1

dìT (θ)

dθr

∣∣∣θ=θ0

, `r,s =1

N

N∑i=1

d2ìT (θ)

dθrdθs

∣∣∣θ=θ0

,

and similarly for higher order derivatives. Notice that by this convention, the argumentsof the function are dropped whenever the function is evaluated at (θ0, λiT (θ)) = (θ0, λi0).We also employ the following short hand notation:

νr,s = E[`r,s], νr,s,t = E[`r,s,t], Hr,s = `r,s − νr,s, Hr,s,t = `r,s,t − νr,s,t,

Ui = `λiT (θ0), Vi = V λλiT (θ0), Ei = E[`λλiT (θ0)], Fi = E[`λλλiT (θ0)],

Πi = lnπi(λiT (θ0)|θ0) and Π′i =∂ lnπi(λiT (θ0)|θ0)

∂λi.

Then, for example,

Vi;p1,...,pm =dmV λλ

iT (θ)

dθp1 ...dθpm

∣∣∣∣∣θ=θ0

and Ei;p1,...,pm = E

dm`λλiT (θ)

dθp1 ...dθpm

∣∣∣∣∣θ=θ0

.In what follows, when [νr,s] is concerned, superscripts indicate the corresponding entry ofthe inverse of [νr,s]. Then, the row r and column s entry of the inverse of [νr,s], is given by

ν r,s, where r, r, s, s = {1, ..., P}. Lastly, define δr = (θIL − θ0)r, which is the rth entry ofthe (P × 1) vector (θIL − θ0).

Another convention used here is the Einstein summation convention, which allows fora concise representation of complicated summation operations. To illustrate the mainidea, let p = 1, ..., P , q = 1, ..., Q and consider the summation

∑Pp=1 xpyp,q where [xp]

is a (P × 1) vector while [yp,q] is a (P × Q) matrix. Using the Einstein notation, thissummation would be written as xpyp,q, making the summation across p implicit. Moregenerally, the idea is that, whenever an index appears twice in a product of arrays, it isto be implicitly understood that the product is summed across this index. Indices thatare not repeated within the same product are called free indices, and the number of theseindices determines the dimension of the resulting array. Indices that are repeated, onthe other hand, are called dummy indices. As such, xpyp,q is a vector (one free index, q),

while xp,r,s,typ,qzr,t =∑P

p=1

∑Rr=1

∑Tt=1 xp,r,s,typ,qzr,t is a two-dimensional matrix (two free

indices, q and s). Note that the letters assigned to different indices can be changed freelyas long their relationship is left intact. For example, xp,qyq,r is identical to xq,pyp,r, which

is equal to xy in standard matrix notation; however, xp,qyp,r is equal to x′y.

A.4.2 The Proof

The proof consists of several steps. We start with a preliminary Lemma.

Lemma A.10 Let Assumptions 3.1(ii), 3.4, A.1 and A.2 hold. Then, for any i, as T →

45

∞,

∇θ ln{−E[`λλiT ]} = −Ei;r1Ei

= O (1) ,

∇θθ ln{−E[`λλiT ]} = −Ei;r1,r2Ei

+Ei;r1Ei;r2

E2i

= O (1) ,

∇θ

(V λλiT

E[`λλiT ]

)=

Vi;r1Ei−ViEi;r1E2i

= Op

(1√T

),

∇θθ

(V λλiT

E[`λλiT ]

)=

Vi;r1,r2Ei

−Vi;r1Ei;r2 [2] + ViEi;r1,r2

E2i

+ 2ViEi;r1Ei;r2

E3i

= Op

(1√T

),

∇θ

(`λiTE[`λλλiT ]

{E[`λλiT ]}2

)=

Ui;r1Fi + UiFi;r1E2i

− 2UiFiEi;r1

E3i

= Op

(1√T

),

∇θθ

(`λiTE[`λλλiT ]

{E[`λλiT ]}2

)=

Ui;r1,r2Fi + Ui;r1Fi;r2 [2] + UiFi;r1,r2E2i

−2Ui;r1Ei;r2Fi[2] + UiEi;r2Fi;r1 [2] + UiFiEi;r1,r2

E3i

+6UiFiEi;r1Ei;r2

E4i

= Op

(1√T

),

∇θ

(`λiT

E[`λλiT ]Π′i

)=

Ui;r1Π′i + UiΠ′i;r1

Ei−UiΠ

′iEi;r1E2i

= Op

(1√T

),

∇θθ

(`λiT

E[`λλiT ]Π′i

)=

Ui;r1,r2Π′i + Ui;r1Π′i;r2 [2] + UiΠ′i;r1,r2

Ei+ 2

UiΠ′iEi;r1Ei;r2E3i

−Ui;r1Π′iEi;r2 [2] + UiΠ

′i;r1Ei;r2 [2] + UiΠ

′iEi;r1,r2

E2i

= Op

(1√T

),

∇θ

[(`λiT )2

E[`λλiT ]

]= 2

UiUi;r1Ei

−U2i Ei;r1E2i

= Op

(1

T

),

∇θθ

[(`λiT )2

E[`λλiT ]

]= 2

Ui;r2Ui;r1 + UiUi;r1,r2Ei

−2UiUi;r2Ei;r1 [2] + U2

i Ei;r1,r2E2i

+2U2i Ei;r1Ei;r2

E3i

= Op

(1

T

),

∇θ

[V λλiT (`λiT )2

{E[`λλiT ]}2

]=

Vi;r1U2i + 2ViUiUi;r1E2i

− 2ViU

2i Ei;r1E3i

= Op

(1

T 3/2

),

∇θθ

[V λλiT (`λiT )2

{E[`λλiT ]}2

]=

Vi;r1,r2U2i + 2Vi;r1UiUi;r2 [2] + 2ViUi;r2Ui;r1 + 2ViUiUi;r1,r2

E2i

46

−2Vi;r1U

2i Ei;r2 [2] + 2ViUiUi;r1Ei;r2 [2] + ViU

2i Ei;r1,r2

E3i

+6ViU

2i Ei;r1Ei;r2E4i

= Op

(1

T 3/2

),

∇θ

((`λiT )3E[`λλλiT ]

{E[`λλiT ]}3

)=

3U2i Ui;r1Fi + U3

i Fi;r1E3i

− 3U3i FiEi;r1E4i

= Op

(1

T 3/2

),

∇θθ


{E[`λλiT ]}3

)=

6UiUi;r2Ui;r1Fi + 3U2i Ui;r1,r2Fi + 3U2

i Ui;r1Fi;r2 [2] + U3i Fi;r1,r2

E3i

−33U2

i Ui;r1FiEi;r2 [2] + U3i Fi;r1Ei;r2 [2] + U3

i FiEi;r1,r2E4i

+12U3i FiEi;r1Ei;r2

E5i

= Op

(1

T 3/2

).

Moreover, the third and fourth derivatives satisfy,

∇θθθ ln{−E[`λλiT ]} = O (1) , ∇θθθθ ln{−E[`λλiT ]} = O (1) ,

∇θθθ

(V λλiT

E[`λλiT ]

)= Op

(1√T

), ∇θθθθ

(V λλiT

E[`λλiT ]

)= Op

(1√T

),

∇θθθ

(`λiTE[`λλλiT ]

{E[`λλiT ]}2

)= Op

(1√T

), ∇θθθθ

(`λiTE[`λλλiT ]

{E[`λλiT ]}2

)= Op

(1√T

),

∇θθθ

(`λiT

E[`λλiT ]Π′i

)= Op

(1√T

). ∇θθθθ

(`λiT

E[`λλiT ]Π′i

)= Op

(1√T

),

∇θθθ

[(`λiT )2

E[`λλiT ]

]= Op

(1

T

), ∇θθθθ

[(`λiT )2

E[`λλiT ]

]= Op

(1

T

),

∇θθθ

[V λλiT (`λiT )2

{E[`λλiT ]}2

]= Op

(1

T 3/2

), ∇θθθθ

[V λλiT (`λiT )2

{E[`λλiT ]}2

]= Op

(1

T 3/2

),

∇θθθ


{E[`λλiT ]}3

)= Op

(1

T 3/2

), ∇θθθθ


{E[`λλiT ]}3

)= Op

(1

T 3/2

).

The numbers in brackets denote all possible permutations of the free indices. For example

Ei;r1,r2Ei;r3 [3] = Ei;r1,r2Ei;r3 + Ei;r1,r3Ei;r2 + Ei;r2,r3Ei;r1 .

Proof. See the Supplementary Appendix.The standard procedure to obtain an asymptotic expansion for (θIL − θ0)r would be

to expand N−1∑Ni=1 `

Ii;r1

(θIL) about θIL = θ0 and obtain

1

N

N∑i=1

Ìi;r1(θIL) =1

N

N∑i=1

Ìi;r1(θ0) + δr21

N

N∑i=1

Ìi;r1,r2(θ0) + δr2δr31

2N

N∑i=1

Ìi;r1,r2,r3(θ0)

+δr2δr3δr41

6N

N∑i=1

Ìi;r1,r2,r3,r4(θ0)

47

+δr2δr3δr4δr51

24N

N∑i=1

Ìi;r1,r2,r3,r4,r5(θ), (39)

where r1, ..., r5 ∈ {1, ..., P} and θ is the mean value between θIL and θ0. This can thenbe rearranged to obtain an asymptotic representation for (θIL − θ0)r. Unfortunately, thedifficulty with this approach is that the integrated likelihood is not a familiar concept likethe concentrated likelihood and, therefore, the asymptotic behaviour of its derivatives isnot immediately clear. However, by using the results obtained earlier, it is possible torepresent the integrated likelihood function and its derivatives in terms of objects, theasymptotic properties of which are known.

Lemma A.11 Suppose max1≤i≤N |λi(θ)− λiT (θ)| p→ 0 and the conditions of Lemma A.7hold. Let, also, Assumptions 3.1(ii), 3.4, A.1 and A.2 hold. Then, for any i, as T →∞,

ÌiT (θ) = ìT +ln(2π/T )

2T− 1

2T

[ln{−E[`λλiT (θ) ]}+

V λλiT (θ)

E[`λλiT (θ) ]−`λiT (θ)E[`λλλiT (θ))]

{E[`λλiT (θ) ]}2

]

+1

T

[lnπiT (λiT (θ)|θ)− `λiT (θ)

E[`λλiT (θ) ]

∂ lnπiT (λi|θ)∂λi

∣∣∣λi=λiT (θ)

]− 1

2

(`λiT (θ))2

E[`λλiT (θ) ]

+1

2

V λλiT (θ) (`λiT (θ))2

{E[`λλiT (θ) ]}2− 1

6

(`λiT (θ))3E[`λλλiT (θ) ]

{E[`λλiT (θ) ]}3+Op(T

−2). (40)

Proof of Lemma A.11. This follows directly from (i) combining (33) and (37) to obtain

an expression for ÌiT (θ)− ìT (θ) ; (ii) replacing ln[−`λλiT (θ, λi(θ))] by (35) and lnπi(λi(θ))by (36) in this expression; and finally (iii) substituting (29), up to a Op(T

−1) remainder,

for [λi(θ)− λiT (θ)].Now, (40) and its derivatives can be used as approximations to the integrated likelihood

terms appearing in (39). Substituting these will yield an asymptotic representation for(θIL − θ0)r in terms of the familiar concentrated/target likelihood terms. This is done inthe following lemmas.

Lemma A.12 Let Assumptions 3.1-3.5, A.1 and A.2 hold. Then, as N,T →∞,

−δr2νr1,r2 = ˜r1

+D1;r1+ δr2Hr1,r2 +

1

2δr2δr3νr1,r2,r3 +D3;r1

+ δr2D2;r1,r2

+1

2δr2δr3Hr1,r2,r3 +

1

6δr2δr3δr4`r1,r2,r3,r4 +Op

(1

T 2

). (41)

where

D1;r1=

1

TN

N∑i=1

Ei;r12Ei︸︷︷︸

Op(T−1

)

+1

TN

N∑i=1

Πi;r1︸︷︷︸Op(1/T )

− 1

N

N∑i=1

UiUi;r1Ei︸︷︷︸

Op(1/T )

+1

N

N∑i=1

U2i Ei;r12E2

i︸︷︷︸Op(1/T )

= Op

(1

T

),

D2;r1,r2=

1

TN

N∑i=1

Ei;r1,r22Ei︸︷︷︸

Op(1/T )

− 1

TN

N∑i=1

Ei;r1Ei;r22E2

i︸︷︷︸Op(1/T )

+1

TN

N∑i=1

Πi;r1,r2︸︷︷︸Op(1/T )

48

− 1

N

N∑i=1

Ui;r2Ui;r1 + UiUi;r1,r2Ei︸︷︷︸

Op(1/T )

+1

N

N∑i=1

2Ui(Ui;r1Ei;r2 + Ui;r2Ei;r1

)+ U2

i Ei;r1,r22E2

i︸︷︷︸Op(1/T )

− 1

N

N∑i=1

U2i Ei;r1Ei;r2

E3i︸︷︷︸

Op(1/T )

= Op

(1

T

)

D3;r1=

1

TN

N∑i=1

ViEi;r1 + Ui;r1Fi + UiFi;r1 + UiΠ′iEi;r1

2E2i︸︷︷︸

Op(1/T3/2

)

− 1

TN

N∑i=1

UiFiEi;r1E3i︸︷︷︸

Op(1/T3/2

)

− 1

TN

N∑i=1

Vi;r1 + Ui;r1Π′i + UiΠ′i;r1

Ei︸︷︷︸Op(1/T

3/2)

+1

N

N∑i=1

Vi;r1U2i + 2ViUiUi;r1

2E2i︸︷︷︸

Op(1/T3/2

)

− 1

N

N∑i=1

3U2i Ui;r1Fi + U3

i Fi;r1 + ViU2i Ei;r1

6E3i︸︷︷︸

Op(1/T3/2

)

+1

N

N∑i=1

U3i FiEi;r12E4

i︸︷︷︸Op(1/T

3/2)

= Op

(1

T 3/2

).

Proof. To prove the Lemma, first the derivatives of (40) with respect to θ have toobtained. This is achieved by substituting the results given in Lemma A.10 as necessary.Then,

Ìi;r1 (θ0) = ì;r1 +1

T

[Ei;r12Ei

+ Πi;r1

]−UiUi;r1Ei

+U2i Ei;r12E2

i

+1

T

[ViEi;r1

2E2i

−Vi;r12Ei

+Ui;r1Fi + UiFi;r1

2E2i

−UiFiEi;r1

E3i

+UiΠ

′iEi;r1E2i

−Ui;r1Π′i + UiΠ

′i;r1

Ei

]+Vi;r1U

2i + 2ViUiUi;r1

2E2i

−ViU

2i Ei;r1E3i

−3U2

i Ui;r1Fi + U3i Fi;r1

6E3i

+U3i FiEi;r12E4

i

+Op

(1

T 2

),

Ìi;r1,r2 (θ0) = ì;r1,r2 +1

T

[Ei;r1,r2

2Ei−Ei;r1Ei;r2

2E2i

+ Πi;r1,r2

]−Ui;r2Ui;r1 + UiUi;r1,r2

Ei

+2Ui

(Ui;r1Ei;r2 + Ui;r2Ei;r1

)+ U2Ei;r1,r2

2E2i

−U2i Ei;r1Ei;r2

E3i

+Op

(1

T 3/2

),

Ìi;r1,r2,r3 (θ0) = ì;r1,r2,r3 +Op

(1

T

)and Ìi;r1,r2,r3,r4 (θ0) = ì;r1,r2,r3,r4 +Op

(1

T

).

49

Noting that Ìr1(θIL) = 0 for r1 ∈ {1, ..., P} and substituting the above expansions for theintegrated likelihood derivatives into (39) gives, after some rearrangement,

−δr2νr1,r2 = `r1︸︷︷︸Op(1/

√Nρ1T )

+ δr2Hr1,r2︸︷︷︸Op(1/

√Nρ1+ρ2T

2)

+1

T

[1

N

N∑i=1

Ei;r12Ei

+1

N

N∑i=1

Πi;r1

]︸︷︷︸

Op(1/T )

− 1

N

N∑i=1

UiUi;r1Ei︸︷︷︸

Op(1/T )

+1

N

N∑i=1

U2i Ei;r12E2

i︸︷︷︸Op(1/T )

+1

2δr2δr3νr1,r2,r3︸︷︷︸Op(1/

√N

2ρ1T2)

+ δr2δr31

2Hr1,r2,r3︸︷︷︸

Op(1/√N

2ρ1+ρ3T3)

+1

6δr2δr3δr4νr1,r2,r3,r4︸︷︷︸Op(1/

√N

3ρ1T3)

+1

T

[1

N

N∑i=1

ViEi;r1 + Ui;r1Fi + UiFi;r1 + UiΠ′iEi;r1

2E2i︸︷︷︸

Op(1/√T )

− 1

N

N∑i=1

Vi;r1 + Ui;r1Π′i + UiΠ′i;r1

2Ei︸︷︷︸Op(1/

√T )

− 1

N

N∑i=1

UiFiEi;r1E3i

]︸︷︷︸

Op(1/√T )

+1

N

N∑i=1

Vi;r1U2i + 2ViUiUi;r1

2E2i︸︷︷︸

Op(1/T3/2

)

− 1

N

N∑i=1

3U2i Ui;r1Fi + U3

i Fi;r1 + ViU2i Ei;r1

6E3i︸︷︷︸

Op(1/T3/2

)

+1

N

N∑i=1

U3i FiEi;r12E4

i︸︷︷︸Op(1/T

3/2)

+ δr2︸︷︷︸Op(1/

√Nρ1T )

{1

TN

N∑i=1

Ei;r1,r22Ei︸︷︷︸

Op(1/T )

− 1

TN

N∑i=1

Ei;r1Ei;r22E2

i︸︷︷︸Op(1/T )

+1

TN

N∑i=1

Πi;r1,r2︸︷︷︸Op(1/T )

− 1

N

N∑i=1

Ui;r2Ui;r1 + UiUi;r1,r2Ei︸︷︷︸

Op(1/T )

+1

N

N∑i=1

2Ui(Ui;r1Ei;r2 + Ui;r2Ei;r1

)+ U2

i Ei;r1,r22E2

i︸︷︷︸Op(1/T )

− 1

N

N∑i=1

U2i Ei;r1Ei;r2

E3i︸︷︷︸

Op(1/T )

}

+Op

(1

T 2

),

which is the same as (41). The order of the remainder term can be found as follows:under Assumptions 3.1(i)-(iii), 3.4(i), Lipschitz continuity of ìt;r1,r2,r3,r4,r5(θ) and 2 + δorder moment existence for ìt;r1,r2,r3,r4,r5(θ), Lemma B.2 of Engle et al. (2014) shows that

P

[max

1≤i≤Nsupθ∈Θ

∣∣ìT ;r1,r2,r3,r4,r5(θ)− E[ìT ;r1,r2,r3,r4,r5

(θ)]∣∣] = o(1).

Then, using Theorem 4.2 which requires Assumption 3.1(iv) for the existence of a Laplaceexpansion, it can be shown that ÌiT ;r1,r2,r3,r4,r5

(θ) = ìT ;r1,r2,r3,r4,r5(θ) + op(1). Finally,

50

following the same route as in the Proof of Theorem 4.1 yields

P

[max

1≤i≤Nsupθ∈Θ

∣∣∣ÌiT ;r1,r2,r3,r4,r5(θ)− E[ÌiT ;r1,r2,r3,r4,r5

(θ)]∣∣∣] = o(1).

Combining this uniform convergence result with θILp→ θ0 (which implies that θ

p→θ0 where θ is the mean value between θIL and θ0) finally gives ÌiT ;r1,r2,r3,r4,r5

(θ) =

E[ÌiT ;r1,r2,r3,r4,r5(θ0)] + op(1) = Op(1) as T → ∞. Therefore, the remainder term is of

order Op(N−4ρ1T−2) ≤ Op(T

−2). The orders of magnitude of the remaining terms on theright hand side can be found by utilising standard results on mixing variables (White(2001)) and Assumption 3.5.

Notice that the expansion given by (41) is a polynomial of (θIL − θ0) = [δr], wherer1, r2 = {1, ..., P}. To obtain an expansion for (θIL − θ0) that is not a function of itself,(41) has to be inverted. This is achieved by repeatedly substituting for δr2 , δr3 and δr4 .

Lemma A.13 Let Assumptions 3.1-3.5, A.1 and A.2 hold. Then, as N,T →∞,

δm = −àνa,m + àν

a,bHc,bνc,m −D1;aν

a,m − 1

2àν

a,b`cνc,dνe,b,dν

e,m

+1

2àν

a,b`cνc,dνe,b,dν

e,fHg,fνg,m − àν

a,bHc,bνc,dHe,dν

e,m

−1

2D1;aν

a,b`cνc,dνe,b,dν

e,m +1

2àν

a,bHc,bνc,dèν

e,fνg,d,fνg,m

−1

4àν

a,b`cνc,dνe,b,dν

e,f `gνg,hνi,f,hν

i,m −D3;aνa,m + àν

a,bD2;c,bνc,m

−1

2àν

a,b`cνc,dHe,b,dν

e,m +1

6àν

a,b`cνc,dèν

e,f `g,b,d,fνg,m +Op

(1

T 2

),

where a, b, c, d, e, f, g, h, i, j, k, l,m ∈ {1, ..., P}.

Proof. The objective is to invert (41) and obtain an expression for δm as a function oflikelihood terms only. To do this, first δm has to be isolated on the left-hand side. Thiscannot be done simply by replacing r2 by m as δm appears on both sides of (41). However,notice that if x−1 = [xrs] is the inverse of x = [xrs], then

xrsxst = κrt =

{1 if r = t0 if r 6= t

.

The array κrt is known as Kronecker delta, and [κrt ] is the identity matrix. Hence,

δr2νr1,r2νr1,m = δr2κ

mr2

=

{δm if r2 = m0 if r2 6= m

.

Then, multiplying both sides of (41) by νr1,m

δm = −

(`r1ν

r1,m +D1;r1νr1,m + δr2Hr1,r2ν

r1,m +1

2δr2δr3νr1,r2,r3ν

r1,m +D3;r1νr1,m

+δr2D2;r1,r2νr1,m +

1

2δr2δr3Hr1,r2,r3ν

r1,m +1

6δr2δr3δr4`r1,r2,r3,r4ν

r1,m

)

51

+Op

(1

T 2

). (42)

In the iterative substitution step, we will use the following copies of (42):

δr2 = −(àν

a,r2 +D1;aνa,r2 + δbHa,bν

a,r2 +1

2δbδcνa,b,cν

a,r2

)+Op

(1

T 3/2

),

δr3 = −(èν

e,r3 +D1;eνe,r3 + δfHe,fν

e,r3 +1

2δfδgνe,f,gν

e,r3

)+Op

(1

T 3/2

),

δr4 = −(ìν

i,r4 +D1;iνi,r4 + δjHi,jν

i,r4 +1

2δjδkνi,j,kν

i,r4

)+Op

(1

T 3/2

),

as well as δb = −`yνy,b +Op(T

−1), δc = −`zνz,c +Op(T

−1), δf = −`yνy,f +Op(T

−1), and

δg = −`zνz,g+Op(T

−1). Substituting δr2 , δr3 , δr4 , δb, δc, δf and δg into the right hand sideof δm, arranging the resulting terms according to their orders and redefining the dummyindices to simplify the final expression gives

δm = − àνa,m︸︷︷︸

Op(1/√Nρ1T )

+ àνa,bHc,bν

c,m︸︷︷︸Op(1/

√Nρ1+ρ2T

2)

−D1;aνa,m︸︷︷︸

Op(1/T )

− 1

2àν

a,b`cνc,dνe,b,dν

e,m︸︷︷︸Op(1/

√N

2ρ1T2)

+1

2àν

a,b`cνc,dνe,b,dν

e,fHg,fνg,m︸︷︷︸

Op(1/√N

2ρ1+ρ2T3)

− àνa,bHc,bν

c,dHe,dνe,m︸︷︷︸

Op(1/√Nρ1+2ρ2T

3)

−1

2àν

a,bD1;cνc,dνe,b,dν

e,m︸︷︷︸Op(1/

√Nρ1T

3)

+1

2àν

a,b`cνc,dHe,dν

e,fνg,b,fνg,m︸︷︷︸

Op(1/√N

2ρ1+ρ2T3)

−1

4àν

a,b`cνc,dèν

e,fνg,d,fνg,hνi,b,hν

i,m︸︷︷︸Op(1/

√N

3ρ1T3)

+D1;aνa,bHc,bν

c,m︸︷︷︸Op(1/

√Nρ2T

3)

−1

2D1;aν

a,b`cνc,dνe,b,dν

e,m︸︷︷︸Op(1/

√Nρ1T

3)

+1

2àν

a,bHc,bνc,dèν

e,fνg,d,fνg,m︸︷︷︸

Op(1/√N

2ρ1+ρ2T3)

−1

4àν

a,b`cνc,dνe,b,dν

e,f `gνg,hνi,f,hν

i,m︸︷︷︸Op(1/

√N

3ρ1T3)

− D3;aνa,m︸︷︷︸

Op(1/√T

3)

+ àνa,bD2;c,bν

c,m︸︷︷︸Op(1/

√Nρ1T

3)

−1

2àν

a,b`cνc,dHe,b,dν

e,m︸︷︷︸Op(1/

√N

2ρ1+ρ3T3)

+1

6àν

a,b`cνc,dèν

e,f `g,b,d,fνg,m︸︷︷︸

Op(1/√N

3ρ1T3)

+Op

(1

T 2

), (43)

which proves Lemma A.13.Based on these results, the proof of Theorem 5.1 now follows.

Proof of Theorem 5.1. It follows directly from Lemma A.13 that

δm = − àνa,m︸︷︷︸

Op(1/√Nρ1T )

−D1;aνa,m︸︷︷︸

Op(1/T )

+ àνa,bHc,bν

c,m︸︷︷︸Op(1/

√Nρ1+ρ2T

2)

− 1

2àν

a,b`cνc,dνe,b,dν

e,m︸︷︷︸Op(1/

√N

2ρ1T2)

+Op

(1

T 3/2

),

52

independent of the particular values of ρ1, ρ2 and ρ3. Remember that since we are us-ing the Einstein summation notation, whenever an index appears twice in an expressionit is to be understood that the expression is summed across that index. Now, à isrow α of the (P × 1) vector S while νa,m is the row a column m entry of v−1. There-fore, àν

a,m =∑P

a=1 àνa,m = (v−1)′S = v−1S, since v−1 is by definition symmetric.

Consider now àνa,b`cν

c,dνe,b,dνe,m. Notice that νe,b,d is a three dimensional array of the

expectation of third order derivatives of `NT . Now, let Pb and P ′d be the row b entryof v−1S and column d entry of S′v−1, respectively. Then, àν

a,b`cνc,dνe,b,d is equal to∑P

b=1

∑Pc=1

∑Pd=1 Pbνe,b,dP

′d = Me, row e entry of M using previously defined notation.

Hence , àνa,b`cν

c,dνe,b,dνe,m =

∑Pe=1Meν

e,m = ν−1M. The remaining two terms can beanalysed in exactly the same way. See also Examples C.1-C.4 in Engle, Pakel, Shephardand Sheppard (2014) for further illustrations of moving from the index to matrix notation.Then, (8) follows.

As for the CLT result in (9), notice that by Assumption 3.5√Nρ1Tà

d→ N (0, σ2a)

where σ2a = p limN,T→∞ V ar(

√Nρ1Tà) for all a ∈ {1, ..., P}. Consider the (P × 1) di-

mensional zero-mean vector S. Let σ2S,NT = V ar(

√Nρ1TS). By the Cramer-Wold device,

if we can show that for any vector γ such that γ′γ = 1, γ′σ−1S,NT

√Nρ1TS

d→ γ′Z where

Z ∼ N(0, I), then it follows that σ−1S,NT

√Nρ1TS

d→ N (0, I). Now, γ′σ−1S,NTS is a linear

combination of the terms à = ∂∑N

i=1

∑Tt=1 ìt(θ0, λiT (θ0))/∂θa for a = 1, ..., P. There-

fore, a CLT would apply to γ′σ−1S,NTS, as well. In addition, V ar(γ′σ−1

S,NT

√Nρ1TS) =

γ′σ−1S,NT

√Nρ1TE[SS′]

√Nρ1Tσ−1

S,NTγ = γ′γ = 1. Therefore, γ′σ−1S,NT

√Nρ1TS

d→ N (0, 1)

for any γ. Thus, σ−1S,NT

√Nρ1TS

d→ N (0, I) or, equivalently,√Nρ1TS

d→ N (0, σ2S) where

σ2S = p limN,T→∞ σ

2S,NT . Hence, by Slutsky’s Theorem,

√Nρ1Tν−1S

d→ N (0, ν−1σ2S ν−1).

Now, if ρ1 = ρ2 = 0, then

√T (θIL − θ0) = −ν−1

√TS +Op

(1√T

)d→ N (0, ν−1σ2

S ν−1) as N,T →∞.

On the other hand, for all other values of ρ1 and ρ2,

√Nρ1T (θIL − θ0) = −ν−1

√Nρ1TS +

√Nρ1T

T 2 TBNT︸︷︷︸Op(1)

+ op (1)d→ N (ρβ, ν−1σ2

S ν−1),

as N,T → ∞ as desired. The use of the robust prior removes the incidental parameterbias term and (10) follows.

A.5 Proofs of Theorems 5.2, 5.3 and 5.4

In the following proofs, the idea is to take (43) and then use Lemmas A.2-A.4 to determinethe order of the bias. For sake of conciseness, define (at the cost of notational abuse),

M1 = àνa,bHc,bν

c,m; M2 = àνa,b`cν

c,dνe,b,dνe,m;

M3 = àνa,b`cν

c,dνe,b,dνe,fHg,fν

g,m; M4 = àνa,bHc,bν

c,dHe,dνe,m;

M5 = àνa,bHc,bν

c,dèνe,fνg,d,fν

g,m; M6 = àνa,b`cν

c,dνe,b,dνe,f `gν

g,hνi,f,hνi,m;

M7 = àνa,b`cν

c,dHe,b,dνe,m; M8 = àν

a,b`cνc,dèν

e,f `g,b,d,fνg,m.

53

Remark A.2 The Proofs of Theorems 5.2-5.4 are based on similar ideas. Therefore, onlythe Proof of Theorem 5.2 is presented here.

Proof of Theorem 5.2. When calculating the orders of the expected values of M1−M8,all non-zero-mean terms can be bounded by the suprema of their absolute values, whichare finite by Assumption A.2. This leaves out the zero-mean likelihood derivatives, whichare dealt with by Lemma A.2 (the same idea applies to the remaining dependence types).Then, by (20) in Lemma A.2, the expected values of M1 and M2 are O(N−1T−1). Similarly,by (22) M3-M8 are all O(N−2T−2) in expectation. Now,

E[D1;a`c] = E

[1

TN

N∑i=1

`cEi;a2Ei

+1

TN

N∑i=1

`cΠi;a −1

N

N∑i=1

`cUiUi;aEi

+1

N

N∑i=1

`cU2i Ei;a

2E2i

].

Notice that the first two terms on the right hand side disappear since Ea and Πa are bothnon-random, and ˜

c is zero-mean. The remaining terms are of the form

E

1

N2T 3

∑1≤i,j≤N

∑1≤s,t,q≤T

f(xis)g(xjt)h(xjq)φ(xjT )

≤ φN,TE

1

N2T 3

∑1≤i,j≤N

∑1≤s,t,q≤T

f(xis)g(xjt)h(xjq)

,where xjT = (xj1, ..., xjT ), φ(xjT ) = O(1) as T → ∞ uniformly for all i and φN,T =

supj φ(xjT ). By (21), the above term and hence E[D1;a`c] areO(N−1T−2). E[àνa,bD2;c,bν

c,m]can be shown to be of the same order by using the same method. Finally,

E[D3;aνa,m] = νa,mE

[1

N

N∑i=1

Vi;aU2i + 2ViUiUi;a

2E2i

− 1

N

N∑i=1

3U2i Ui;aFi + U3

i Fi;a + ViU2i Ei;a

6E3i

+1

N

N∑i=1

U3i FiEi;a

2E4i

],

where each term is of the form

1

N

N∑i=1

E

1

T 3

∑1≤s,t,q≤T

f(xis)g(xit)h(xiq)φ(xiT )

≤ φN,T

1

N

N∑i=1

E

1

T 3

∑1≤s,t,q≤T

f(xis)g(xit)h(xiq)

,where xiT , φ(xiT ) and φN,T are defined similarly as before. Then, (44) is O(T−2) by (19).

Hence, since N = O(T ), E[δm]− E[−D1;aνa,m] = O(T−2), as desired.

Proof of Theorem 5.3. See the Supplementary Appendix.Proof of Theorem 5.4. See the Supplementary Appendix

54

Documents

Bias Reduction in Nonlinear and Dynamic Panels in the ... · proaches. We show that the composition of the rst-order bias changes only under strong cross-section dependence, but not