Consistent Model Selection for Marginal …sites.science.oregonstate.edu/~xuel/Xue_Qu_Zhou_Revision.pdfR1 0 l (x)dx =0. In model (1), the additive nonparametric functions are formulated

Consistent Model Selection for Marginal GeneralizedAdditive Model for Correlated Data

Lan Xue 1

Department of Statistics, Oregon State University

Corvallis, OR 97331

Annie Qu

Department of Statistics, University of Illinois at Urbana-Champaign

Champaign, IL 61820

Jianhui Zhou

Department of Statistics, University of Virginia

Charlottesville, VA 22904

1Xue’s research was supported by the National Science Foundation (DMS-0906739). Qu’s research was supported by

the National Science Foundation (DMS-0906660). Zhou’s research was supported by the National Science Foundation

(DMS-0906665).

Abstract

We consider the generalized additive model when responses from the same cluster are correlated.

Incorporating correlation in the estimation of nonparametric components for the generalized additive

model is important since it improves estimation efficiency and increases statistical power for model

selection. In our setting, there is no specified likelihood function for the generalized additive model

since the outcomes could be non-normal and discrete, which makes estimation and model selection

very challenging problems. We propose consistent estimation and model selection which incorpo-

rate the correlation structure. We establish an asymptotic property with L2-norm consistency for the

nonparametric components, which achieves the optimal rate of convergence. In addition, the pro-

posed model selection strategy is able to select the correct generalized additive model consistently.

That is, with probability approaching to 1, the estimators for the zero function components converge

to 0 almost surely. We will illustrate our method using numerical studies with both continuous and

binary responses, and a real data application of binary periodontal data.

Supplemental materials including technical details are available online.

Keywords: L2-norm consistency; model selection; nonparametric function; oracle property; poly-

nomial spline; quadratic inference function; SCAD.

Running title: Correlated Generalized Additive Model.

1 Introduction

Longitudinal or correlated data occur very frequently in biomedical studies. The challenge in an-

alyzing correlated data is that the likelihood function is difficult to specify or formulate for non-

normal responses with large cluster size. To allow richer and more flexible model structures, the

generalized additive model (Hastie and Tibshirani, 1990) takes the additive form of nonparametric

functions. This is appealing for both model interpretation and prediction. However, when some

of the predictor variables are redundant, traditional estimation methods for the generalized additive

model (Stone 1985, Linton and Härdle 1996, Horowitz and Mammen 2007) are unable to produce

an accurate and efficient estimator. Since even a single predictor variable is typically associated

with a large number of unknown parameters in the nonparametric functions, inclusion of redundant

variables can hinder accuracy and efficiency for both estimation and inference. Therefore, variable

selection is a crucial step in generalized additive modeling.

In general, even when the data is not correlated, variable selection with discrete responses is

quite challenging since it involves numerical approximation and can be computationally intensive

(Meier, van de Geer and Bühlmann, 2008). Variable selection for the generalized additive model is

more challenging since the dimension for the nonparametric components could be infinite. Huang,

Horowitz and Wei (2009) propose additive model selection using B-spline bases when the number of

variables and additive components is larger than the sample size. Meier, van de Geer and Bühlmann

(2009) develop estimation procedures for high-dimensional generalized additive models by penaliz-

ing the sparse terms. Other approaches to variable selection in additive models include Huang and

Yang (2004) and Xue (2009). However, these approaches either require the specification of the like-

lihood function, or mainly target continuous outcomes using least squares. In general, these methods

cannot handle correlated discrete types of outcomes in model selection and estimation since there

is no closed form of the likelihood function, and/or it is practically infeasible to perform numeri-

cal approximation for infinite dimension problems. In this paper, we are interested in developing

a general framework for estimation and model selection for generalized additive models which can

handle correlated categorical responses in addition to continuous ones.

1

Furthermore, the current literature on incorporating correlation for the generalized additive

model is rather limited. This is mainly because nonparametric modeling can be highly computa-

tionally intensive and introduces high dimensional nuisance parameters associated with nonpara-

metric forms. This makes it difficult to incorporate additional correlation structure into the model.

However, ignoring correlation could lead to inefficient estimation and impair statistical power for

the selection of correct models. In addition, for smoothing spline and seemingly unrelated kernel

approaches (Wang, 2003), ignoring the correlation could also lead to biased estimation (Zhu, Fung

and He, 2008). Specifically, Wang (2003) shows that selection of the smoothing parameter could

fail since it is rather sensitive for even a small departure from the true correlation structure, and this

is reflected by over-fitting the nonparametric estimator in order to reduce the overall bias. In contrast

to the parametric setting, these problems could be more serious for the nonparametric generalized

additive model since the true model might not be easily verified.

To incorporate correlation, Berhane and Tibshirani (1998) proposed estimation of the gener-

alized additive model using a smoothing spline generalized estimating equation (GEE; Liang and

Zeger, 1986) approach. However, it does not address the variable selection problem. Wang, Li and

Huang (2008) proposed a nonparametric varying coefficient model for longitudinal data, but it is

mainly applicable for continuous outcomes.

The quadratic inference function approach (QIF; Qu, Lindsay and Li, 2000) can efficiently take

the within-cluster correlation into account and is more efficient than the GEE approach when the

working correlation is misspecified. In this paper, we propose consistent estimation and variable

selection based on the penalized quadratic inference function for the generalized additive model

for correlated data. The proposed method is able to shrink small coefficients of the nonparametric

functional components to zero. In addition, it effectively takes the within-cluster correlation into

consideration and provides efficient estimators for the non-zero components.

We provide an asymptotic consistency property for nonparametric function estimation, and es-

tablish the optimal rate of convergence. Huang (1998) and Xue and Yang (2006) established asymp-

totic theory for additive models with polynomial splines when the responses are continuous and

independent. However, the asymptotic property has not been well studied in the current literature

2

for the correlated generalized additive model using polynomial splines. The theoretical technique for

the spline approach is very different from the kernel approach (Horowitz, 2001) for the generalized

additive model. In addition, since the objective function we minimize is a penalized quadratic dis-

tance function, the theoretical proof is very different from the penalized least squares approach (Xue,

2009). The techniques involved for asymptotic property development are quite challenging because

of the curse of the dimensionality of nonparametric functions, the nonlinear relation between the

response and covariates, and the correlated nature of measurements.

The advantage of our approach is that the model selection is accomplished through SCAD penal-

ization (Fan and Li, 2001), and it achieves the oracle property. That is, the proposed method is able to

select the correct generalized additive model consistently, and the estimators of the non-zero compo-

nents achieve the same rate of L2 convergence as if the true model is known in advance. In addition,

since the nonparametric function forms are selected group-wise for each functional component, the

choice of the basis functions of the polynomial spline is not critical to the final model selection in

our approach. In addition, our numerical studies show that our approach performs extremely well in

estimation efficiency and model selection when the dimension of functional components is high.

The paper is organized as follows. Section 2 describes a marginal framework for the generalized

additive model for correlated data. Section 3 introduces a penalized quadratic inference function

method for simultaneous estimation and variable selection of marginal generalized additive models.

Asymptotic theories are developed and issues for implementation are discussed. Section 4 illustrates

simulation studies for continuous and binary responses. Section 5 demonstrates model selection and

estimation for binary periodontal data. The last section provides concluding remarks and discus-

sion. The proofs are given in the Appendix, and more technical details are available in the online

supplemental files.

2 A marginal generalized additive model for longitudinal data

Suppose there are n clusters with the ith (i = 1, . . . , n) cluster containing mi observations. Let

Yi j be a response variable, and Xi j =�

X (1)i j , . . . , X (d)i j

�Ta d × 1 vector of covariates for the j-th

3

( j = 1, . . . , mi) observation in the i-th cluster. Denote

Yi =

Yi1...

Yimi

, Xi =

XTi1...

XTimi

.

Then the basic assumption of the marginal generalized additive model is that observations from

different clusters are independent, and the first two moments of Yi j are specified as E�

Yi j|Xi j,Xi

�=

E�

Yi j|Xi j

�= µ0

i j, and Var(Yi j|Xi j,Xi) = Var(Yi j|Xi j) = φV�µ0

i j

�, where φ is a scale parameter

and V (·) is a known variance function, and the marginal mean µ0i j relates to the covariates through

the known link function g (·) ,

η0i j = g

�µ0

i j

�= α0+

d∑

l=1

αl

�x (l)i j

�,

µ0i j = g−1

�η0

i j

�. (1)

Here α0 is an unknown constant, and�αl (·)

dl=1 are unknown smooth functions. Since the esti-

mation of nonparametric functions is usually obtained on a compact support, we assume without

loss of generality that the covariates X (l)i j can be scaled into the interval [0, 1], for 1 ≤ l ≤ d. For

model (1) to be identified, without loss of generality, we assume that each αl (·) is centered with∫ 1

0αl (x) d x = 0.

In model (1), the additive nonparametric functions are formulated to model the covariate effects.

If each αl (·) is a linear function, then model (1) reduces to the generalized linear models for clus-

tered data. Berhane and Tibshirani (1998) extended the generalized estimating equation approach

and applied the smoothing spline to estimate the model (1). We propose an efficient estimation

procedure which approximates the nonparametric functions using polynomial splines. In addition,

within-cluster correlation is incorporated to obtain efficient parameter estimation via the quadratic

inference function (Qu et al., 2000). Unlike the GEE approach (Berhane and Tibshirani, 1998), our

proposed method does not require estimation of the working correlation parameters, and is more

efficient than the GEE approach when the working correlation is misspecified.

Another focus of this paper is to perform variable selection for the marginal generalized addi-

tive model (1). With advanced technology, we encounter massive high-throughput data with large

4

dimensional covariates quite frequently. Identification of important variables is a crucial step in

analyzing high-dimensional data, since each redundant variable involves an infinite dimension of

parameters for nonparametric components. Here a predictor variable X l is said to be redundant in

model (1), if and only if αl�X l�= 0 almost surely. Otherwise, a predictor variable X l is said to be

relevant. Suppose only an unknown subset of predictor variables in model (1) is relevant. Our goal

is to consistently identify such subsets of relevant variables and estimate their unknown function

components in (1) at the same time.

3 Methodology and theory

In our estimation procedure, we approximate the smooth functions�αl (·)

dl=1 in (1) by polynomial

splines. For each 1≤ l ≤ d, let υl be a partition of the interval [0, 1], with Nn interior knots

υl =¦

0= υl,0 < υl,1 < · · ·< υl,Nn< υl,Nn+1 = 1

©.

Using υl as knots, the polynomial splines of order p + 1 are functions with p-degree (or less) of

polynomials on intervals [υl,i,υl,i+1), i = 0, . . . , Nn − 1, and�υl,Nn

,υl,Nn+1

�, and have p − 1 con-

tinuous derivatives globally. We denote the space of such spline functions by ϕ l = ϕp([0, 1] ,υl).

Denote ϕ0l =

ns ∈ ϕ l :

∫ 1

0s (x) d x = 0

o, which consists of centered spline functions. The advan-

tage of polynomial splines is that they often provide a good approximation of smooth functions with

a limited number of knots.

3.1 An Initial Estimator

Let¦

Bl j (·)©Jn

j=1be a set of spline bases of ϕ0

l with Jn = Nn+ p. Then

αl (·)≈ sl (·) =Jn∑

j=1

β l jBl j (·) ,

with a set of coefficients β l =�β l1, . . . ,β lJn

�T. As a result,

η0i j ≈ ηi j

�β�= β0+

d∑

l=1

Jn∑

j=1

β l jBl j

�x (l)i j

�,

5

where β =�β0,β T

1 , . . . ,β Td

�T. Or equivalently, the mean function µ0

i j in (1) can be approximated

by

µ0i j ≈ µi j

�β�= g−1

(β0+

d∑

l=1

Jn∑

j=1

β l jBl j

�x (l)i j

�)

.

In matrix notation, we write µi

�β�=�µi1

�β�

, . . . ,µimi

�β��T

. To estimate unknown coefficients

β , we apply the quadratic inference function (Qu et al. 2000), which efficiently incorporates the

within-cluster correlation. To simplify notation, we first assume equal cluster size with mi = m <

∞. The implementation for data with unequal cluster size will be discussed later in Subsection 3.5.

Let R be a common working correlation. The QIF approach approximates the inverse of R by a

linear combination of some basis matrixes, i.e.

R−1 ≈ a0I+ a1M1+ . . .+ aKMK ,

where I is the identity and Mi are known symmetric basis matrices. For example, if R is of compound

symmetric structure with correlation ρ, then R−1 can be represented as a0I+ a1M1 with M1 being a

matrix with 0 on the diagonal and 1 off the diagonal, a0 = −�(m− 2)ρ+ 1

/k1, and a1 = ρ/k1,

where k1 = (m− 1)ρ2 − (n− 2)ρ − 1 and m is the dimension of R. The linear representation of

R−1 is also applicable for the AR-1 and the block diagonal correlation structures. The advantage of

the QIF approach is that it does not require the estimation of the linear coefficients ai’s associated

with the working correlation matrix, which are treated as nuisance parameters here.

For any x=�

x1, . . . , xd�T , let B (x) =

�1, B11

�x1

�, . . . , B1Jn

�x1

�, . . . , Bd1

�xd�

, . . . , BdJn

�xd��T

.

Denote Bi =�

B�xi1�

, . . . ,B�

ximi

��Tfor i = 1, . . . , n. Define

Gn�β�=

1

n

n∑

i=1

gi�β�=

1

n

n∑

i=1

BTi ∆

Ti

�β�

A−1i

�Yi −µi

�β�

BTi ∆

Ti

�β�

A− 1

2i M1A

− 12

i

�Yi −µi

�β�

...

BTi ∆

Ti

�β�

A− 1

2i MKA

− 12

i

�Yi −µi

�β�

with∆i�β�= diag

¦µi1

�β�

, . . . , µimi

�β�©

and µi j

�β�

being the first derivative of g−1 evaluated

at BT�

xi j

�β ; and Ai = diag

¦V�µi1

�, . . . , V

�µimi

�©. Since there are more estimation equations

than the number of unknown parameters, the QIF approach estimates β by setting Gn as close to

6

zero as possible, in the sense of minimizing the quadratic inference function Qn�β�

, that is

eβ = argminβ

Qn�β�= argmin

βnGT

n

�β�

C−1n

�β�

Gn�β�

, (2)

where

Cn�β�=

1

n

n∑

i=1

gi�β�

gTi

�β�

.

As a result, for any x ∈[0, 1]d , the estimator of the unknown components in (1) is given as

eα0 = eβ0, eαl

�x (l)�=

Jn∑

j=1

eβ l jBl j

�x (l)�

, l = 1, . . . , d,

and

eα (x) = eβ0+d∑

l=1

eαl

�x (l)�

. (3)

Using a spline basis we convert a problem in the continuum to one that is governed by only

a finite number of parameters. For any given set of spline bases, eβ can be obtained using the

Newton-Raphson algorithm developed in Qu et al. (2000). QIF estimation through spline basis

expansion produces functional estimators which achieve optimal nonparametric properties on rate

of convergence as shown in Theorem 1. This can be used in its own right, or as the initial value for

an interactive scheme as developed in the next subsection for simultaneous variable selection and

functional estimation.

To study the rate of convergence for eα0, eαl , and eα, we first introduce some notation and present

regularity conditions. We assume equal cluster sizes (mi = m <∞), and�Yi,Xi

�, i = 1, . . . , n are

i.i.d. copies of (Y,X) with Y=�Y1, . . . , Ym

�T , and X =�

XT1 , . . . ,XT

m

�T. The asymptotic results still

hold for data with unequal cluster sizes mi using a cluster-specific transformation as discussed later

in Subsection 3.5. For any matrix A, ‖A‖ denotes the modulus of the largest singular value of A. To

prove the theoretical arguments, we need the following assumptions:

(C1) The covariates X =�

XT1 , . . . ,XT

m

�Tare compactly supported, and without loss of generality,

we assume that each X j has support χ = [0, 1]d . The density of X j, denoted by f j (x), is

absolutely continuous and there exist constants c1 and c2 such that 0 < c1 ≤ minx∈χ f j (x) ≤maxx∈χ f j (x)≤ c2 <∞ for all j = 1, . . . , m.

7

(C2) For each l= 1, . . . , d, αl (·) has r continuous derivatives for some r ≥ 2.

(C3) Let e= Y−µ0 (X). Then Σ= EeeT is positive definite and for some δ > 0, E ‖e‖2+δ <+∞.

(C4) The knots sequences υl =¦

0= υl,0 ≤ υl,1 ≤ · · · ≤ υl,Nn≤ υl,Nn+1 = 1

©for l = 1, . . . , d, are

quasi-uniform, that is, there exists c3 > 0, such that

maxl=1,...,d

max�υl, j+1−υl, j, j = 0, . . . , Nn

�

min(υl, j+1−υl, j, j = 0, . . . , Nn)≤ c3.

Furthermore, the number of interior knots Nn � n1/(2r+1), where � means both sides have the

same order. In particular, let h= 1/Nn � n−1/(2r+1).

(C5) Let Γ(k)0 = ∆T0 V(k)0 ΣV(k)0 ∆0 where V(k)0 = A−1/2

0 MkA−1/20 and ∆0, A0 are evaluated at µ =

µ0 (X). Then for sufficiently large n, the eigenvalues of E�Γ(k)0

�are bounded away from 0

and infinity for any k = 0, . . . , K .

(C6) There exists a positive constant c4 such that 0< c4 ≤ infi, j V�µi j

�≤ supi, j V

�µi j

�<∞. The

functions V (·) and g−1 (·) have bounded second derivatives.

(C7) Let M =�

MT0 , . . . ,MT

K

�T. Assume the modular of the singular value of M is bounded away

from 0 and infinity.

These conditions are common in the polynomial spline estimation literature. Assumptions simi-

lar to (C1)-(C4) are also considered in Huang (1998), Huang (2003) and Xue (2009). In particular,

the smoothness condition in (C2) determines the rate of convergence of the spline estimators eα0, eαl ,

and eα. Conditions (C5) and (C6) are similar to assumptions (A3) and (A4) in He et al. (2005), and

can be easily verified for a broad family of distributions. Condition (C7) is needed for the weighting

matrix Cn in (2) so it maintains positive definite asymptotically. This holds for the choice of basis

matrices of exchangeable and AR-1 correlation structures discussed in this subsection.

Theorem 1 Under conditions (C1-C7), there exists a local minimizer of (2) such that

|eα0−α0| = Op

�n−r/(2r+1)

�

max1≤l≤d

1

n

n∑

i=1

mi∑

j=1

¦eαl

�x (l)i j

�−αl

�x (l)i j

�©2= Op

�n−2r/(2r+1)

�.

8

For mi = m = 1, it reduces to a special case where the responses are i.i.d.. The rate of con-

vergence given here is the same optimal rate as that obtained for polynomial spline regression for

independent data (Huang 1998, Xue 2009). The main advantage of the QIF approach is that it incor-

porates within-cluster correlation by optimally combining estimating equations without estimating

the correlation parameters. However, when there are redundant predictor variables, it fails to pro-

duce efficient and accurate estimators due to model complexity. In the next subsection, we propose

the penalized QIF method which automatically sets small estimated functions to zero and selects a

parsimonious model.

3.2 Penalized QIF for Marginal Generalized Additive Models

We propose a new variable selection approach by penalizing the quadratic inference function in

(2). This enables one to perform simultaneous estimation and variable selection in the generalized

additive model. The proposed method is able to shrink small components of estimated functions to

zero, thus performing variable selection. In addition, it produces efficient estimators of the non-zero

components by taking the within-cluster correlation into consideration. The penalized quadratic

inference function estimator is defined as

β = argminβ

(Qn�β�+ n

d∑

l=1

pλn

� β l

Kl

�), (4)

in which pλn(·) is a given penalty function depending on a tuning parameter λn, and

β l

2

Kl=

β Tl Klβ l , where Kl =

1n

∑ni=1

1mi

∑mi

j=1 Bl

�X(l)i j

�BT

l

�X(l)i j

�with Bl (·) =

�Bl1 (·) , . . . ,BlJn

(·)�T

. Note

that β l

Kl=

1

n

n∑

i=1

1

mi

mi∑

j=1

s2l

�X(l)i j

�!1/2

= sl

n ,

which is the empirical norm of the spline function sl . The reason we choose to penalize on sl

n is

that it does not depend on a particular choice of spline bases, and intuitively, shrinking sl

n to 0

entails sl is 0 almost surely. Note that the proposed penalization strategy achieves the same effect

as the group-wise model selection approach (Yuan and Lin, 2006), since each additive component

containing many coefficients is treated as a whole group in model selection.

9

There are a number of choices for the penalty function pλn(·). For the linear models, the L1

penalty pλn(|·|) = λn |·| provides a LASSO-type of estimator, and the L2 penalty pλn

(|·|) = λn |·|2

gives a ridge-type estimator. Here we choose the smoothly clipped absolute deviation (SCAD; Fan

and Li, 2001) penalty, which is defined by the derivative of pλn,

p′λn(θ) = λn

¨I�θ ≤ λn

�+

�aλn− θ

�+

(a− 1)λnI�θ > λn

�«

,

where a is a constant and is usually set to be a = 3.7 (Fan and Li 2001, Xue 2009), and λn > 0

is a tuning parameter. After obtaining the estimator β through penalization in (4), for any given

x ∈[0, 1]d , an estimator of the unknown components in (1) is given as

α0 = β0, αl

�x (l)�=

Jn∑

j=1

β l jBl j

�x (l)�

, l = 1, . . . , d.

Fan and Li (2001) established the asymptotic property of the regression parameter estimator

using SCAD for the linear model. In the following, we establish the asymptotic properties of the

estimators of the nonparametric components for the marginal generalized additive model. We denote

η0�

xi j

�= α0 +

∑dl=1αl(x

(l)i j ) = α0 +

∑sl=1αl

�x (l)i j

�+∑d

l=s+1αl

�x (l)i j

�, where, without loss of

generality, αl = 0 almost surely for l = s+ 1, . . . , d, and s is the total number of nonzero function

components. We first show that the penalized QIF estimators�αld

l=1 achieve the same rate of

convergence as the un-penalized ones�αld

l=1. Furthermore, we prove that the penalized estimators�αld

l=1 in Theorem 2 possess the sparsity property, that is, αl = 0 almost surely for l = s+1, . . . , d.

The sparsity property ensures that the proposed model selection is consistent, that is, it selects the

correct variables with probability tending to 1 as the sample size goes to infinity.

Theorem 2 Under the same assumptions of Theorem 1, and if the tuning parameter λn → 0, then

there exists a local minimizer of (4) such that

|α0−α0| = Op

�n−r/(2r+1)

�

max1≤l≤d

1

n

n∑

i=1

mi∑

j=1

¦αl

�x (l)i j

�−αl

�x (l)i j

�©2= Op

�n−2r/(2r+1)

�.

Theorem 3 Under the same assumptions of Theorem 1, and if the tuning parameter λn → 0, and

λnnr/(2r+1)→+∞, then with probability approaching 1, αl = 0 almost surely for l = s+ 1, . . . , d.

10

Theorem 3 also implies that the above generalized additive model selection possesses the con-

sistency property. The results in Theorems 2 and 3 are similar to those for penalized least squares

established in Xue (2009). However, the theoretical proof is very different from the penalized least

squares approach due to the nonlinear structure of the problem. Therefore a more thorough study of

the properties of the quadratic inference functions for the infinite dimension case is indeed necessary

here to overcome the difficulties.

3.3 An Algorithm Using Local Quadratic Approximation

In this subsection, we provide an algorithm to minimize the penalized quadratic inference function

defined as in (4) using the local quadratic approximation ( Fan and Li, 2001). Let β0 be an initial

value which is close to the true minimizer of (4). For example, we could take eβ , the QIF estimator,

as the initial value. Denote β k =�β k

0,β kT1 , . . . ,β kT

d

�Tas the value at the kth iteration. If β k

l is

close to zero in the sense that β k

l

Kl≤ ε, for some small threshold value ε, then set β k+1

l to 0.

In the implementation, we have used ε = 10−6. Without loss of generality, suppose β k+1l = 0, for

l = dk + 1, . . . , d , and write β k+1 =�β k+1

0,

�β k+1

1

�T, . . . ,

�β k+1

dk

�T,�β k+1

dk+1

�T, . . . ,

�β k+1

d

�T�T

=�β k+1

0,

�β k+1

11

�T,�β k+1

22

�T�T

, with β k+122 containing the d − dk zero components. Let β =

�β0,,β

T11,β T

22

�Tbe the same partition of any β with the same length as β k+1.

We obtain a value for the non-zero component β k+111 using the following quadratic approxima-

tion. For β k

l

Kl> ε, one can locally approximate the penalty function by

pλn

� β l

Kl

�

≈ pλn

� β kl

Kl

�+

1

2p′λn

� β kl

Kl

� β kl

−1

Klβ kT Kl

�β l −β k

l

�

≈ pλn

� β kl

Kl

�+

1

2p′λn

� β kl

Kl

� β kl

−1

Kl

�β T

l Klβ l −β kTl Klβ

kl

�,

where p′λn

is the first order derivative of pλn. Therefore, the objective function in (4) can be locally

approximated (up to a constant) by a quadratic function

Qn

�β k�+∇Qn

�β k�T �

β11−β k11

�+

1

2

�β11−β k

11

�T ∇2Qn

�β k��β11−β k

11

�

+1

2nβ T

11Λ�β k�β11,

11

where ∇Qn

�β k�=

∂Qn(βk)∂β11

, ∇2Qn

�β k�=

∂ 2Qn(βk)∂β11∂β

T11

, and

Λ�β k�= diag

�p′λn

� β k1

Kl

� β k1

−1

Kl, . . . , p

′λn

� β kdk

Kl

� β kdk

−1

Kl

�. (5)

One can obtain β k+111 by minimizing the above quadratic function. Specifically,

β k+111 = β

k11−

¦∇2Qn

�β k�+ nΛ

�β k�©−1 ¦∇Qn

�β k�+ nΛ

�β k�β k

11

©.

One can repeat the above iteration process until convergence is reached. Here we use the conver-

gence criterion such thatq�β k+1−β k

�T �β k+1−β k

�≤ 10−6. From our experience, the algo-

rithm is very stable and fast to compute. It usually reaches a reasonable convergence tolerance

within a few iterations. However, the computational time increases as the number of covariates in-

creases. In one of our numerical examples in Section 4, it takes about 25 iterations on average to

converge when there are 100 covariates in the model.

3.4 Tuning Parameter Selection

It is important to select sensible tuning parameters in the implementation, since the performance

of the proposed method depends on the choice of tuning parameters critically. The QIF method in

subsection 3.1 requires an appropriate choice of the knot sequences�υld

l=1 for the spline approx-

imation. For the penalized QIF method in subsection 3.2, in addition to�υld

l=1, one also needs to

choose tuning parameter λn in the SCAD penalty function.

For knot selection in the QIF method, we use equally spaced knots and select only the number

of interior knots Nn. A similar strategy can also be found in Huang, Wu and Zhou (2004), Xue

and Yang (2006) and Xue (2009). For the penalized QIF, we use the same knot sequences selected

in the QIF procedure and select only the tuning parameter λn for simplicity. For any given Nn,

denote the solution which minimizes (2) by eβNn. Similarly, for any given λn, denote the estimator

which minimizes (4) by βλn. Motivated by Qu and Li (2006) and Wang, Li and Tsai (2008), we use

a consistent BIC procedure to select the optimal tuning parameters. Since the quadratic inference

function Qn behaves equivalently to minus twice the log-likelihood function (Qu et al. 2000), we

12

define the BIC in the QIF procedure as

BIC1

�Nn�=Qn

�eβNn

�+ log (n)DFNn

/n,

where DFNn= 1 + dJn is the total number of unknown parameters in (2). Then one selects the

optimal knot number bNn = argminNnBIC1

�Nn�

. From our experience, only a small number of knots

are needed. In the numerical examples in Section 4, the selected the optimal knot numbers are in the

range of 2 to 5. Similarly for the penalized QIF procedure, we define

BIC2

�λn�=Qn

�βλn

�+ log (n)DFλn

/n.

The effective degrees of freedom is defined as

DFλn= t race

n�∇2Qn

�βλn

�+ nΛ

�βλn

��−1∇2Qn

�βλn

�o,

where ∇2Qn (·) and Λ(·) are defined similarly as in (5) for the non-zero components in βλn.

Consequently, we choose the optimal λn such that the BIC value reaches the minimum, that is,

bλn = argminλnBIC2

�λn�

.

3.5 Unequal cluster sizes

In the cases where the clusters have unequal cluster sizes, we utilize cluster-specific transformation

matrices to reformulate the data with unequal cluster size. That is, for the ith cluster with cluster size

mi, the m×mi transformation matrix Ti is defined by deleting the corresponding columns (indexed

by the missing observations in the cluster) of the m×m identity matrix, where m is the largest size

of a fully observed cluster, and is assumed to be bounded. Using the transformation matrices, we

define Y ∗i = TiYi, ∆∗i (β) = Ti∆i(β), B∗i = TiBi, µ∗i (β) = Tiµi(β) and A∗i = TiAi for the ith cluster.

Note that the entries in Y ∗i , ∆∗i (β), B∗i , µ∗i (β) and A∗i can be either obtained if the measurements are

observed or 0 if the corresponding observations are missing. We substitute Yi, ∆i(β), Bi, µi(β) and

Ai with Y ∗i , ∆∗i (β), B∗i , µ∗i (β) and A∗i , respectively, in the previous procedures and in Lemma 10 and

11 provided in the supplemental material. Following the arguments in the proofs of the theorems, it

can be shown that the asymptotic results of Theorems 1, 2 and 3 still hold for correlated data with

unequal cluster sizes for parameter estimation and variable selection procedures.

13

4 Simulation studies

In this section, we conduct simulation studies for both continuous and binary outcomes. The total

averaged integrated squared error (TAISE) is evaluated to assess estimation efficiency. Let α(r) be

the estimator of a nonparametric function α in the r-th (1 ≤ r ≤ R) replication and�

xmngrid

m=1 be the

grid points where α(r) are evaluated. We define

AISE (α) =1

R

R∑

r=1

1

ngrid

ngrid∑

m=1

¦α�

xm�− α(r) �xm

�©2,

and TAISE =∑d

l=1 AISE�αl�

. Let S and S0 be the selected and true index set containing relevant

variables, respectively. We say S is correct if S = S 0; S overfits if S0⊂ S and S0 6= S ; and Sunderfits if S0 6⊂ S . In all of the simulation studies, the number of replications is 100.

4.1 Example 1: Continuous Response and Moderate Dimension of Covari-

ates

In this example, the continuous responses¦

Yi j

©are generated from

Yi j =d∑

l=1

αl

�X (l)i j

�+ εi j, i = 1, . . . , n, j = 1, . . . , 5, (6)

where d = 10 and the number of clusters n= 100, 250, or 500. The additive functions are

α1 (x) = 2x − 1, α2 (x) = 8(x − 0.5)3, α3 (x) = sin (2πx) ,

and αl (x) = 0 for l = 4, . . . , 10. Thus the last 7 variables in this model are null variables and do

not contribute to the model. The covariates Xi j =�

X (1)i j , . . . , X (10)i j

�Tare generated independently

from Uniform�[0, 1]10�. The error εi =

�εi1, . . . ,εi5

�T follows a multivariate normal distribution

with mean 0, a common marginal variance σ2 = 1 and an exchangeable correlation structure with

correlation ρ = 0.7.

We apply the penalized QIF with SCAD penalty (SCAD) and fit both the linear splines (p = 1)

and cubic splines (p = 3). To illustrate the effect on estimation efficiency of incorporating within-

cluster correlation, we compare the estimation efficiency of using basis matrices from different

14

working correlation structures: exchangeable (EC), AR-1 and independent (IND). In addition, we

compare the penalized QIF approach with the QIF estimations of a full model (FULL) and an ora-

cle model (ORACLE). Here the full model consists of all ten variables and the oracle model only

contains the first three relevant variables. The oracle model is only available in simulation studies

where the true information is known.

Table 1 summarizes the estimation results of all procedures. It shows that the estimation proce-

dures with a correct EC working correlation have the smallest TAISEs, and therefore the estimators

are more efficient than their counterparts with IND working correlation which ignore within-cluster

correlation. For example, the efficiency gained by incorporating correlation could be doubled in the

cubic splines approach. Estimation based on a misspecified AR-1 correlation structure will lead to

some efficiency loss compared to using the true EC structure, but it is still more efficient than as-

suming independent structure. In general, the efficiency of estimation using cubic splines is higher

than that using linear splines. Furthermore, with the same type of working correlation, ORACLE

has the smallest TAISEs compared to the SCAD or FULL model approaches. This is not surprising

since ORACLE takes advantage of the true information from the data generation, and it contains

only three relevant variables. However, the efficiency of the SCAD is closer to that of ORACLE,

with much smaller TAISEs than those of the FULL model.

From one selected simulated data set with n = 250, Figure 1 plots the first four estimated func-

tional components from the SCAD, FULL and ORACLE models using linear spline and exchange-

able working correlation. Note that for the fourth variable, both the true and estimated component

functions from SCAD are zero. This clearly indicates that the proposed method estimates unknown

functions reasonably well.

Table 2 provides variable selection results for the SCAD procedures. It shows that for the EC,

AR-1 or IND type of working correlation, the percentage of correct model fitting goes to 1 quickly

as the sample size increases. This confirms the consistency theorem of variable selection provided

in subsection 3.2. Furthermore, for the cubic spline approach, the correct model-fitting percentage

using EC correlation structure is noticeably higher than that using the IND when the sample size

is moderate. Lastly, we give the computing time of the algorithm proposed in Subsection 3.3. For

15

SCAD with p = 1 and EC working correlation, it takes 0.5, 1.2, and 3.8 seconds respectively to

complete one run of the simulation for n= 100, 250, and 500. The computing time for cubic spline

and other correlation structures are similar.

4.2 Example 2: Continuous Response with High Dimension of Covariates

To assess our method in more challenging cases for high dimensional data, we consider a model

with the dimension of functional components d = 100 in (6). However, only the first three variables

are relevant and take the same functional forms as in Example 1. We consider the model (6) with

number of clusters n= 200 and cluster size 5, with errors¦εi =

�εi1, . . . ,εi5

�T©200

i=1generated inde-

pendently from a multivariate normal distribution with mean 0, and an AR-1 correlation structure

with cor r(εi j1 ,εi j2) = 0.7| j1− j2| for 1≤ j1, j2 ≤ 5.

We apply the linear spline QIF to estimate the full (FULL) and oracle (ORACLE) models with

working independent (IND), exchangeable (EC) or AR-1 working correlation. For variable selec-

tion, we consider the penalized linear spline QIF with SCAD penalty with basis matrices from IND,

EC or AR-1 working correlations. Table 3 reports TAISEs for FULL, ORACLE and SCAD and

variable selection results on correct, overfit and underfit percentages of the SCAD approach for both

IND and AR-1 working correlations. Table 3 clearly indicates that the improvement from incorpo-

rating within-cluster correlation is very significant. In particular, the estimation procedures with a

correctly-specified AR-1 structure always give smaller TAISEs than those with a misspecified EC

or IND working correlation. For variable selection, the SCAD with an AR-1 working correlation

also performs noticeably better than the one with EC or independent working correlation. Further-

more, Table 3 also shows that the SCAD procedure dramatically improves the estimation accuracy

for this high-dimensional case, with TAISEs from the SCAD less than one tenth of the TAISEs

from the FULL model. Finally, for computing time, it takes 19.8 seconds to compute SCAD with

p = 1 and EC working correlation in one run of simulation. The computing times for other correla-

tion structures are similar. However, compared with Example 1, the computation burden increases

dramatically as the number of covariates increases.

16

4.3 Example 3: Binary Response

To assess the performance of our method for binary outcomes, we generate a random sample of 250

clusters in each simulation run. Within each cluster, binary responses¦

Yi j

©20

j=1are generated from a

marginal logit model

logit P�

Yi j = 1|Xi j = xi j

�=

10∑

l=1

αl

�x (l)i j

�,

where α1 (x) =�

exp(x + 1)− �exp(2)− exp(1)��/16, α2 (x) = cos (2πx)/4, α3 (x) = x(1−

x)− 1/6, α4 (x) = 2 (x − 0.5)3, and the remaining 6 covariates are null variables with αl (x) =

0 for l = 5, . . . , 10. The covariates Xi j =�

X (1)i j , . . . , X (10)i j

�Tare generated independently from

Uniform�[0, 1]10�. The algorithm provided by Macke et al. (2008) is applied to generate correlated

binary responses with exchangeable correlation structure with a correlation parameter of 0.5.

We applied the penalized QIF with a SCAD penalty and linear spline (SCAD) for the variable

selection, and the linear spline QIF for estimation of the full (FULL) and oracle (ORACLE) models.

To illustrate how different working correlations could affect our estimation and variable selection

results, we consider minimizing (2) and (4) using AR-1 (AR-1) working correlation and independent

(IND) structures, in addition to the true exchangeable (EC) correlation structure.

Table 4 gives the TAISEs for the SCAD, ORACLE and FULL models with three different

working correlations. Similar to the previous continuous simulation study, the estimation based

on correctly specified exchangeable correlation structure has the smallest TAISEs. For the SCAD

approach, the efficiency is almost doubled if the correct correlation information is incorporated.

Estimation based on misspecified AR-1 correlation structure will lead to some efficiency loss com-

pared to using the true structure, but it is still more efficient than assuming independent structure.

However, this could be a different case for the GEE if the true exchangeable correlation were mis-

specified as AR-1, since GEE requires one to estimate the correlation ρ for misspecified AR-1, and

the estimator of ρ may not be valid. Consequently the estimator from the misspecified correlation

structure could be less efficient than assuming independence.

Furthermore, similar to the previous study, TAISEs calculated based on the SCAD approach are

also shown to be close to the TAISEs from ORACLE, and much smaller than those from the FULL

17

model. The relative efficiency ratio between the SCAD and FULL model approaches is close to

2. This shows that the SCAD approach can gain estimation accuracy significantly by effectively

removing the zero component variables. Table 5 gives the frequency of appearance for each of the

10 variables in the selected model. Overall, the SCAD procedures work reasonably well, and the

SCAD with EC and AR-1 working correlation structures provides better variable selection results

than the SCAD with IND working structure. Additionally, the SCAD approach with EC working

correlation performs the best in selecting non-zero component variables (X1, . . . , X4). Note that the

model selection assuming IND working structure performs poorly for selecting variable X4.

For one simulation run, Figure 2 plots the first four estimated functional components from the

SCAD approach with the three different working correlations. For the fourth functional component,

the SCAD with IND structure does not estimate the non-zero component correctly since it apparently

shrinks all coefficients of the function form to zero. In general, estimation assuming IND structure

provides poorer results than estimation assuming EC or AR-1 working correlations. For the EC and

AR-1 correlation structures, the SCAD approach provides reasonable estimations of the unknown

functions. Lastly, for computing time, it takes 9.7 seconds to compute SCAD with p = 1 and EC

working correlation in one run of simulation. Overall, the proposed algorithm is fast to compute.

5 Real data analysis

To illustrate the proposed method, we apply it to an observational study of periodontal disease

(Stoner, 2000). The partial data set consists of 528 patients with chronic periodontal disease who

had an initial periodontal exam during 1988-1992. Each patient was followed annually for five years

after the initial exam. One of the study goals is to identify risk factors which influence tooth loss

in order to improve the treatment of periodontal disease. Let a binary response Yi j = 1 if patient

i in jth year has at least one tooth extraction, and Yi j = 0 otherwise. The covariates of interest

include age at time of initial exam (Age), number of teeth present at time of initial exam (Teeth),

number of diseased sites (Sites), mean pocket depth in diseased sites (Pddis), mean pocket depth in

all sites (Pdall), number of non-surgical periodontal procedures in a year (Nonsurg), number of non-

18

periodontal dental treatments in a year (Dent), and date of initial exam in fractional years (Entry). It

is obvious that the last variable (Entry) is not related to the model prediction, but it is included here

so we can verify whether the model selection procedure will exclude this variable as expected.

To model the binary responses Yi j, we consider a marginal generalized additive model with a logit

link in (1). We apply the proposed penalized QIF with SCAD penalty and fit both linear and cubic

splines to select relevant variables, in addition to the QIF generalized additive model with linear

spline based on the full model with all eight covariates. In both procedures, we use exchangeable

working correlation. Procedures with other types of working correlation are also considered and

lead to similar conclusions, therefore they are not reported here.

Figure 3 plots estimated function components from the SCAD approach with linear spline

(dashed) and cubic spline (dotdash), and the QIF estimation using the full model (solid line). Fig-

ure 3 shows that for both SCAD procedures, the selected relevant covariates are Age, Sites, Pdall

and Dent. The others are not selected since their function components are shrunken to zero by the

penalized QIF procedures. Note that the variable Entry is not selected, as expected. Furthermore,

we also calculate the 95% point-wise confidence intervals of the estimated component functions

using the QIF procedure with linear spline based on 500 bootstrapped samples. Notice that for the

four variables which are not selected, their 95% bootstrap confidence intervals contain the zero line

completely. This confirms our findings on variable selection based on the penalized QIF procedure.

6 Discussion

We propose efficient estimation and model selection procedures for generalized additive models

when the responses are correlated. We provide a SCAD penalty in model selection for the additive

functional forms, and we are able to select the functional components groupwise. We incorporate

correlation from the clustered data through optimally combining moment conditions which contain

correlation information from the correlated data. The advantage of this approach is that we do not

require explicit estimation of the correlation parameters. This improves estimation efficiency signifi-

cantly as well as saving computational cost. We found that it may be quite computationally intensive

19

in high-dimensional variable selection settings, since the dimension of the parameters involved in

the nonparametric forms increases significantly compared to parametric model selection settings.

We show that the nonparametric estimator is consistent with an optimal L2-norm convergence

rate of n−r/(2r+1), which is the same optimal rate of L2-norm convergence as in nonparametric mod-

els for independent and continuous data. In addition, we show that the model selection approach is

consistent, that is, with probability tending to 1, it selects the correct model with non-zero functional

forms converging to 1 almost surely.

Another advantage of our approach is that estimation and model selection are achieved simulta-

neously, in contrast to other model selection procedures such as AIC or BIC. Our simulation studies

show that the proposed estimator recovers a significant amount of efficiency by performing model

selection. This is reflected in that the total average integrated square error from the SCAD model

selection approach is much closer to the true ORACLE model, and is smaller than the square er-

ror from the full model. In addition, we also show that estimation is more accurate when the true

correlation structure is taken into consideration.

Finally, the theoretical techniques in deriving asymptotic properties of the proposed estimator

are quite different from the existing nonparametric approaches of which we are aware. We should

also mention that because there is no closed form of the likelihood function, the numerical inte-

gration technique to obtain a maximum likelihood estimator for a setting with a small number of

parameters is not applicable here, since there are infinite dimensions of parameters involved in each

functional form. In addition, the conventional penalized least squares approach for continuous data

is not applicable for correlated categorical response data since it could result in biased estimation

and inconsistent model selection.

Appendix: Proofs of theorems

The necessary lemmas for the following proofs are given in the supplemental materials. We first

introduce some notation.

20

For any real-valued function f on [0, 1], let f ∞ = supx∈[0,1]

�� f (x)�� . Let ‖·‖2 be the usual

L2 norm for functions and vectors. Let L2 ([0, 1]) be the space of all square integrable functions

defined on [0, 1]. For any α1,α2 ∈ L2 ([0, 1]), let

α1,α2

�=

∫ 1

0

α1 ( x)α2 ( x) d x , and ‖α‖22 =

∫ 1

0

α2 ( x) d x . (A.1)

Let H0 be the space of square integrable constant functions on [0, 1], and let L 02 be the space of

square integrable functions on [0, 1] which is orthogonal to H0. That is, let L 02 = {α : ⟨α, 1⟩ =

0,α ∈ L2}, where 1 is the identity function defined on [0, 1]. Define the additive model spaceMas,

M =

(α (x) = α0+

d∑

l=1

αl�

x l�

;α0 ∈H0,αl ∈ L 02

).

Then the regression function η (x) in (1) of the main text is modeled as an element ofM . Define

the polynomial spline approximation spaceMn, as

Mn =

(s (x) = s0+

d∑

l=1

sl�

x l�

; s0 ∈H0, sl ∈ ϕ0,nl

),

in which ϕ0,nl =

�sl (·) : sl ∈ ϕl ,

sl ,1�= 0

, the centered polynomial spline space. Note that the

definitions of M and Mn do not depend on the constraints that αl ∈ L 02 , sl ∈ ϕ0,n

l . We impose

those constraints to ensure that there is an unique additive decomposition of the functions in Mand Mn almost surely. Furthermore, for any α ∈ M , define two norms which will be used later:

‖α‖2 = E�αT (X)α (X)

and ‖α‖2

n =1n

∑ni=1α

T �Xi�α�Xi�

, where α (X) =�α�X1

�, . . . ,α

�Xm��T

for X=�

XT1 , . . . ,XT

m

�T.

A.1 Proof of Theorem 1.

Lemma A.7 in the supplement and triangular inequality given that for each l = 1, . . . , d,

1

n

n∑

i=1

m∑

j=1

�eαl

�x (l)i j

�−αl

�x (l)i j

��2 ≤ 2

n

n∑

i=1

m∑

j=1

�BT

l

�x (l)i j

��eβ l −β0l

��2+ ch2r .

Therefore, it is sufficient to show that BT

l

�eβ l−β0l

� 2

n=

1

n

n∑

i=1

m∑

j=1

�BT

l

�x (l)i j

��eβ l −β0l

��2

= Op

�(nh)−1

�. (A.2)

21

Note that Lemmas A.10 and A.11 in the supplement entail that for any ε > 0, there exists a suffi-

ciently large C > 0 such that, as n→∞,

P

(inf

β :‖BT(eβ−β0)‖n=C(nh)−1/2

Qn�β�>Qn

�β0

�)> 1− ε.

Therefore

P§ BT

�eβ−β0

� n≤ C (nh)−1/2

ª> 1− ε,

which entails that BT

�eβ−β0

� n= Op

�(nh)−1/2

�. Furthermore, Lemma A.5 in the supplement

entails that for each l = 1, . . . , d, there exists a constant C > 0, such that BT

l

�eβ l −β0l

� n≤ C

BT�eβ−β0

� n= Op

�(nh)−1/2

�.

Therefore, (A.2) is proven.


Let Ln�β�=Qn

�β�+ n∑d

l=1 pλn

� β l

Kl

�be the object function in (4). Let

β∗= argminβ=(β0,β T

1 ,...,β Ts ,0T ,...,0T)T

Qn�β�

,

which leads to the spline QIF estimator of the first s function components, knowing that the rest are

zero terms. As a special case of Theorem 1, one has BT

�β∗−β0

� n= Op

�1/p

nh�

. We want to

show that for large n and any ε > 0, there exists a constant C large enough such that

P

inf

β:‖BT(β−β∗)‖n=C(nh)−1/2Ln�β�> Ln

�β∗�!≥ 1− ε. (A.3)

As a result, this implies that Ln (·) has a local minimum in the ballnβ : ‖BT

�β − β∗

�‖n ≤ C(nh)−

12

o. Thus

BT�β − β∗

� n= Op

�1/p

nh�

. Further, the trian-

gular inequality gives BT β −α

n ≤

BT�β − β∗

� n+ BT

�β∗−β0

� n+ BTβ0−α

n =

Op

�1/p

nh+ hr�

. It proves the theorem by noting that h= n−1/(2r+1).

To show (A.3), using pλn(0) = 0 and pλn

(·)≥ 0, one has

Ln�β�− Ln

�β∗�

≥ Qn�β�−Qn

�β∗�+

s∑

l=1

n�

pλn

� β l

Kl

�− pλn

� β∗l

Kl

��.

22

By similar arguments in the proof of Theorem 2 in Xue (2009), if λn → 0, then for any β with

‖BT�β − β∗

�‖n = C(nh)−1/2, one has

β l

Kl≥ aλn, and

bβ∗l

Kl

≥ aλn for each l = 1, . . . , s,

when n is large enough. Therefore, when n is large enough,

s∑

l=1

�pλn

� β l

Kl

�− pλn

� bβ∗l

Kl

��= 0,

by the definition of the SCAD penalty function. Furthermore,

Qn�β�−Qn

�β∗�

(A.4)

=�β − β∗

�TQn

�β∗�+

1

2

�β − β∗

�TQn

�β∗��β − β∗

�¦1+ op(1)

©

with Qn and Qn being the gradient vector and Hessian matrix of Qn, respectively. Following Qu et

al. (2000) and Lemma A.9 in the supplement, for any β , with ‖BT�β − β∗

�‖n = C(nh)−1/2, one

has

�β − β∗

�TQn

�β∗�

= n�β − β∗

�TGT

n

�β∗�

C−1n

�β∗�

Gn

�β∗�¦

1+ op(1)©� Ch−1,

and

�β − β∗

�TQn

�β∗��β − β∗

�

= n�β − β∗

�TGT

n

�β∗�

C−1n

�β∗�

Gn

�β∗��β − β∗

�¦1+ op(1)

©

� C2h−1

where Gn is the first order derivative of Gn. Therefore, by choosing C large enough, the second term

on the right hand side of (A.4) dominates its first term. Therefore (A.3) holds when C and n are

large enough. This completes the proof of Theorem 2.


Let Θ1 =nβ : β =

�β0,β T

1 , . . . ,β Ts ,0T , . . . ,0T

�T, BT �β −β0

� n = Op

�1/p

nh�o

. For l = s+

1, . . . , d, define Θl =nβ : β =

�0,0, . . . ,0,β T

l ,0, . . . ,0�T

, BTβ

n = Op

�1/p

nh�o

.

23

It is sufficient to show that uniformly for any β ∈Θ1 and β∗l ∈Θl , Ln�β�≤ Ln

�β +β∗l

�, with

probability tending to 1 as n→∞. Note that, similar to the proof of Theorem 2,

Ln

�β +β∗l

�− Ln

�β�

= Qn

�β +β∗l

�−Qn

�β�+ npλn

� β l

Kl

�

= β∗Tl Qn

�β∗�+

1

2β∗Tl Qn

�β∗�β∗l¦

1+ op(1)©+ npλn

� β l

Kl

�

= nλn

BTβ l

n

(Rn

λn+

p′λn(w)

λn

)¦

1+ op(1)©

,

where

Rn =β∗Tl Qn

�β∗�+ 1

2β∗Tl Qn

�β∗�β∗l

n BTβ l

n

= Op

�1/p

nh�

,

and w is a value between 0 and β l

Kl

. We complete the proof by observing that plimn→∞ Rn/λn =

0, while lim infn→∞ lim infw→0+ p′λn(w)/λn = 1.

Acknowledgements

This work is supported in part by National Science Foundation grants. The authors thank Dr. Julie

Stoner of University of Oklahoma for providing the periodontal disease data for the data analysis

part. The paper is substantially improved thanks to the constructive suggestions and comments from

two referees, an Associated Editor and the Editor.

Supplemental Materials

WebAppendix.pdf: The web appendix provides necessary lemmas and their proofs to support the

proofs of the theorem in the Appendix.

24

References

Berhane, K., and Tibshirani, R. J. (1998), "Generalized additive models for longitudinal data," Cana-

dian Journal Statistics, 26, 517-535.

de Boor, C. (2001), A Practical Guide to Splines, New York: Springer.

Fan, J., and Li, R. (2001), "Variable selection via nonconcave penalized likelihood and its oracle

properties," Journal of the American Statistical. Association, 96, 1348–1360.

Hastie, T. J., and Tibshirani, R. J. (1990), Generalized additive models, London: Chapman and Hall.

He, X., Fung, W. K., and Zhu, Z. (2005), "Robust estimation in generalized partial linear models for

clustered data," Journal of the American Statistical. Association, 100, 1176-1184.

He, X., and Shi, P. (1996), "Bivariate tensor-product B-splines in a partly linear model," Journal of

Multivariate Analysis, 58, 162-181.

Horowitz, J. L. (2001), "Nonparametric estimation and a generalized additive model with an un-

known link function," Econometrica, 69, 499-513.

Horowitz, J. L., and Mammen, E. (2007), "Rate-optimal estimation for a general class of nonpara-

metric regression models with unknown link functions," The Annals of Statistics, 35, 2589-2619.

Huang, J. Z. (1998), "Projection estimation in multiple regression with application to functional

ANOVA models," The Annals of Statistics, 26, 242-272.

Huang, J. Z. (2003), "Local asymptotics for polynomial spline regression," The Annals of Statistics,

31, 1600-1635.

Huang, J. Z., Wu, C. O., and Zhou, L. (2004), "Polynomial spline estimation and inference for

varying coefficient models with longitudinal data," Statistica Sinica, 14, 763-788.

Huang, J. Z., and Yang, L. (2004), "Identification of nonlinear additive autoregressive models,"

Journal of the Royal Statistical Society, Ser. B, 66, 463-477.

Huang, J., Horowitz, J. L., and Wei, F. (2010+), "Variable selection in nonparametric additive mod-

25

els," The Annals of Statistics, Under revision.

Liang, K. Y., and Zeger, S. L. (1986), Longitudinal data analysis using generalized linear models.

Biometrika, 73, 13-22.

Linton, O. B., and Härdle, W. (1996), "Estimation of additive regression models with known links,"

Biometrika, 83, 529-540.

Macke, J. H., Berens, P., Ecker, A. S., Tolias, A. S., and Bethge, M. (2008), "Generating spike-trains

with specified correlation-coefficients," Neural Computation, 21, 397-423.

Meier, L., van de Geer, S., and Bühlmann, P. (2008), "The group Lasso for logistic regression,"


Meier, L., van de Geer, S., and Bühlmann, P. (2009), "High-dimensional additive modeling," The

Annals of Statistics, 37, 3779-3821.

Qu, A., Lindsay, B. G., and Li, B. (2000), "Improving generalised estimating equations using

quadratic inference functions," Biometrika, 87, 823-836.

Qu, A., and Li, R. (2006), "Quadratic inference functions for varying-coefficient models with longi-

tudinal data," Biometrics, 62, 379-391.

Stone, C. J. (1985), "Additive regression and other nonparametric models," The Annals of Statistics,

13, 689 - 705.

Stoner, J. A. (2000), "Analysis of clustered data: A combined estimating equations approach," Ph.D.

thesis, University of Washington.

Wang, H., Li, R., and Tsai, C. (2008), "Tuning parameter selectors for the smoothly clipped absolute

deviation method," Biometrika, 94, 553-568.

Wang, L., Li, H., and Huang, J. Z. (2008), "Variable selection in nonparametric varying-coefficient

models for analysis of repeated measurements," Journal of the American Statistical. Associa-

tion, 103, 1556-1569.

Wang, N. (2003), "Marginal nonparametric kernel regression accounting for within-subject correla-

26

tion," Biometrika, 90, 43-52.

Xue, L. (2009), "Variable selection in additive models," Statistica Sinica, 19, 1281-1296.

Xue, L., and Yang, L. (2006), "Additive coefficient modeling via polynomial spline," Statistica

Sinica, 16, 1423-1446.

Yuan, M., and Lin, Y. (2006), "Model selection and estimation in regression with grouped variables,"


Zhu, Z., Fung, W. K., and He, X. (2008), "On the asymptotics of marginal regression splines with

longitudinal data," Biometrika, 95, 907-917.

27

Table 1: Example 1: Continuous response and moderate dimension of covariates. The TAISEs

(standard errors in parentheses) of SCAD, ORACLE and FULL with exchangeable (EC), AR-1 or

independent (IND) working correlation using linear or cubic splines. The number of replications is

100.

n Method EC AR-1 IND

Linear spline 100 SCAD 0.0385(.0017) 0.0468(.0019) 0.0496(.0018)

ORACLE 0.0263(.0004) 0.0311(.0005) 0.0377(.0007)

FULL 0.0522(.0010) 0.0649(.0012) 0.0776(.0015)

250 SCAD 0.0121(.0002) 0.0143(.0004) 0.0173(.0006)

ORACLE 0.0103(.0001) 0.0117(.0002) 0.0164(.0004)

FULL 0.0205(.0004) 0.0247(.0005) 0.0385(.0007)

500 SCAD 0.0093(.0001) 0.0095(.0001) 0.0119(.0002)

ORACLE 0.0085(.0001) 0.0089(.0001) 0.0117(.0002)

FULL 0.0134(.0001) 0.0150(.0002) 0.0218(.0003)

Cubic spline 100 SCAD 0.0378(.0042) 0.0422(.0032) 0.0454(.0038)

ORACLE 0.0172(.0009) 0.0217(.0010) 0.0262(.0011)

FULL 0.0778(.0022) 0.0843(.0024) 0.0867(.0044)

250 SCAD 0.0062(.0003) 0.0083(.0004) 0.0099(.0004)

ORACLE 0.0052(.0003) 0.0067(.0003) 0.0099(.0004)

FULL 0.0220(.0007) 0.0284(.0008) 0.0362(.0008)

500 SCAD 0.0038(.0001) 0.0043(.0001) 0.0075(.0002)

ORACLE 0.0026(.0001) 0.0032(.0001) 0.0056(.0002)

FULL 0.0075(.0001) 0.0096(.0002) 0.0174(.0003)

28

Table 2: Example 1: Continuous response and moderate dimension of covariates. Variable selection

results of SCAD with exchangeable, AR-1 or independent working correlation and linear or cubic

splines. The columns of C, O and U give the percentage of correct-fitting, over-fitting and under-

fitting from 100 replications.

EC AR-1 IND

n C O U C O U C O U

Linear spline 100 0.93 0.00 0.07 0.93 0.01 0.06 0.91 0.01 0.08

250 0.99 0.00 0.01 0.98 0.00 0.02 0.98 0.00 0.02

500 1.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00

Cubic spline 100 0.81 0.09 0.10 0.79 0.17 0.04 0.73 0.18 0.09

250 0.99 0.00 0.01 0.99 0.00 0.01 0.95 0.00 0.05

500 1.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00

Table 3: Example 2: Continuous response and high dimension of covariates. The TAISEs (standard

errors in parentheses), and percentages of correct-fitting (C), under-fitting (U), and over-fitting (O)

from 100 replications.

SCAD ORACLE FULL C O U

AR-1 0.03185.0007 0.0189.0005 0.4125.0005 0.92 0.02 0.06

EC 0.03544.0008 0.0214.0006 0.5241.0006 0.89 0.05 0.06

IND 0.04499.0009 0.0237.0006 0.7278.0006 0.85 0.04 0.11

Table 4: Example 3: Binary response. The TAISEs (standard errors in parentheses) from 100

simulations.

SCAD ORACLE FULL

EC 0.0062.0004 0.0068.0003 0.0191.0008

AR-1 0.0089.0005 0.0083.0004 0.0229.0008

IND 0.0123.0006 0.0112.0005 0.0276.0010

29

Table 5: Example 3: Binary response. The appearance frequency of the variables in the selected

model from 100 simulations.

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

EC 100 100 90 98 0 0 0 2 0 0

AR-1 96 100 79 94 0 1 0 0 0 0

IND 89 100 75 66 0 1 0 1 0 0

30

0.0 0.2 0.4 0.6 0.8 1.0

−1.

0−

0.5

0.0

0.5

1.0

(a)

0.0 0.2 0.4 0.6 0.8 1.0

−1.

0−

0.5

0.0

0.5

1.0

(b)

0.0 0.2 0.4 0.6 0.8 1.0

−1.

0−

0.5

0.0

0.5

1.0

(c)

0.0 0.2 0.4 0.6 0.8 1.0

−1.

0−

0.5

0.0

0.5

1.0

(d)

Figure 1: Example 1: Continuous response and moderate dimension of covariates. The estimated

component functions of (a) α1(x) = 2x − 1, (b) α2(x) = 8 (x − 0.5)3, (c) α3(x) = sin(2πx), and

(d) α4(x) = 0 from SCAD (dot-dash), FULL (dot), and ORACLE (dash) using linear spline and

exchangeable working correlation. The true functions are plotted in solid lines.

31

0.0 0.2 0.4 0.6 0.8 1.0

−0.

3−

0.1

0.0

0.1

0.2

0.3

(a)

0.0 0.2 0.4 0.6 0.8 1.0−

0.3

−0.

10.

00.

10.

20.

3

(b)

0.0 0.2 0.4 0.6 0.8 1.0

−0.

3−

0.1

0.0

0.1

0.2

0.3

(c)

0.0 0.2 0.4 0.6 0.8 1.0

−0.

3−

0.1

0.0

0.1

0.2

0.3

(d)

Figure 2: Example 3: Binary response. The estimated component functions of (a) α1(x) = (ex+1−e2 + e)/16, (b) α2(x) = cos(2πx)/4, (c) α3(x) = x(1− x)− 1/6, and (d) α4(x) = 2(x − 0.5)3

from linear spline SCAD with exchangeable (dot), AR (dash), and independent (dot-dash) working

correlation. The true functions are plotted in solid lines.

32

−1.5 −0.5 0.5 1.0 1.5

−2

−1

01

2

Age

−1.5 −1.0 −0.5 0.0 0.5 1.0−

2−

10

12

Teeth

−1.0 0.0 0.5 1.0 1.5

−4

−2

02

4

Sites

−1.0 0.0 0.5 1.0 1.5 2.0

−2

−1

01

2

Pddis

−1.0 0.0 0.5 1.0 1.5 2.0

−4

−2

02

4

Pdall

−1.0 0.0 0.5 1.0 1.5−

4−

20

24

Dent

−1.5 −0.5 0.5 1.0 1.5

−2

−1

01

2

Entry

−2.0 −1.0 0.0 0.5 1.0 1.5

−2

−1

01

2

Nonsurg

Figure 3: Periodontal disease application. The estimated component functions from the SCAD

procedure with linear spline (dashed line) and cubic spline (dotdash), and the QIF of the full model

(solid line) using linear spline and the exchangeable working correlation. Two SCAD procedures

give exactly zero estimates for variables: Teeth, Pddis, Entry and Nonsurg. Also plotted are 95%

bootstrap confidence intervals of the component functions from the QIF procedure.

33

Web Appendix of"Consistent Model Selection for Marginal

Generalized Additive Model for Correlated Data"Lan Xue, Department of Statistics, Oregon State University

Annie Qu, Department of Statistics, University of Illinois at Urbana-Champaign

Jianhui Zhou, Department of Statistics, University of Virginia

This document gives the technical lemmas and their proofs. The technical lemmas are used in

the proofs of Theorems 1-3 in the paper.

With Jn = Nn + p, let Bl = (Bl,1, . . . , Bl,Jn)T be a set of orthonormal bases of ϕ0,n

l with respect

to inner product as defined in (A.1) of the main text, for l = 1, . . . , d. Let B= (1,BT1 , . . . ,BT

d )T , in

which 1 denotes the identity function defined on χ. Then B is a set of bases forMn. The next several

lemmas present some preliminary results regardingM andMn.

Lemma 1 Under condition (C1), the two norms ‖·‖ and ‖·‖2 are equivalent onM . That is, there

exist constants C ≥ c > 0, such that, for any α ∈M , c ‖α‖2 ≤ ‖α‖ ≤ C ‖α‖2 .

Proof: Notice that ‖α‖2 = E�αT (X)α (X)

�=∑m

j=1 E�α2�

X j

��. Then condition (C1) en-

tails that c1 ‖α‖22 ≤ E

�α2�

X j

��≤ c2 ‖α‖2

2 . Then the lemma follows by taking c = pmc1, and

C =pmc2.

Lemma 2 Under conditions (C1), there exists a constant c > 0, such that

‖α‖2 ≥ c

α2

0+d∑

l=1

αl

2

!, ∀ α= α0+

d∑

l=1

αl ∈M .

Proof: The result follows immediately from Lemma (1) and Lemma 1 of Stone (1985).

Lemma 3 Under conditions (C1) and (C4), there exist constants C ≥ c > 0, such that c β 2

2 ≤ β T B 2 ≤ C

β 2

2 , for any vector β of length 1+ dJn.

Web-App.1

Proof: Let β =�β0,β T

1 , . . . ,β Td

�T, β0 is a scalar and β l =

�β l1, . . . ,β lJn

�T. Note that Lemma

1 entails that there exists a constant C∗ such that

β T B 2 ≤ 2d+1

β2

0+d∑

i=1

β Tl Bl

2

!

≤ 2d+1C∗ β2

0+d∑

i=1

β Tl Bl

2

2

!

= 2d+1C∗ β 2

2 .

On the other hand, Lemmas 1 and 2 entail that there exist constants c∗1 and c∗2 such that,

β T B 2 ≥ c∗1

β2

0+d∑

i=1

β Tl Bl

2

!≥ c∗1c∗2

β2

0+d∑

i=1

β Tl Bl

2

2

!

= c

β2

0+d∑

i=1

β l

2

2

!= c β 2

2 .

This completes the proof.

Lemma 4 As n→∞, one has

sups1,s2∈Mn

��

s1, s2

�n−s1, s2

� s1

s2

��= Op

r

log2 (n)nh

.

In particular, there exist constants 0 < c < 1 < C , such that, except in an event whose probability

tends to zero as n→∞, c ‖s‖ ≤ ‖s‖n ≤ C ‖s‖ , ∀s ∈Mn.

Proof: The proof is similar to the proof of Lemma 4 in Xue and Yang (2006).

Lemma 5 There exists a constant C > 0, such that, except in an event whose probability tends to

zero as n→∞,

‖s‖2n ≥ C

s20 +

d∑

l=1

sl

2

n

!, ∀ s = s0+

d∑

l=1

sl ∈Mn.

Proof: The results follow from Lemmas 2 and 4.

Web-App.2

Lemma 6 There exist constants C ≥ c > 0, such that, except in an event whose probability tends to

zero as n→∞, c β 2

2 ≤ β T B

2

n ≤ C β 2

2 , for any vector β of length 1+ dJn.

Proof: The results follows from Lemmas 3 and 4.

Lemma 7 Let α= α0+∑d

l=1αl ∈M with αl satisfying condition (C2) for each l = 1, . . . , d. Then

there exists an additive spline function s0 = s00 +∑d

l=1 s0l = s0

0 +∑d

l=1β0Tl Bl ∈ Mn and a constant

c > 0, such that α− s0

∞ ≤ Chr .

Proof: According to de Boor (2001, p. 149), for each l = 1, . . . , d, there exists a constant

c > 0 and a spline function sl ∈ ϕ l such that αl − sl

∞ ≤ chr . Let s0

l = sl −sl ,1� ∈ ϕ0

l . Note

that |sl ,1� | ≤ |sl −αl ,1

� |+ |αl ,1� | ≤

αl − sl

∞ ≤ chr . Thus

αl − s0l

∞ ≤

αl − sl

∞ +

|sl ,1� | ≤ 2chr . Let S0 = α0 +

∑dl=1 s0

l ∈ Mn and α− s0

∞ ≤

∑dl=1

αl − s0l

∞ ≤ 2dchr . This

completes the proof.

Let B =�

BT1 , . . . ,BT

n

�Twith Bi =

�BT �xi1

�, . . . ,BT

�ximi

��Tfor i = 1, . . . , n. Write M0 = I.

Then for k, k′= 0, . . . , K , define

Gn,k�β�=

1

n

n∑

i=1

gik�β�

, Ckk′�β�=

1

n

n∑

i=1

gik�β�

gTik′�β�

,

where

gik�β�= gik

�µ�β��

= BTi ∆

Ti

�µi

�β��

A−1/2i

�µi

�β��

MkA−1/2i

�µi

�β��

Yi −µi

�β�

.

Then Gn and Cn in (2) of the main text can be written as Gn�β�=�

Gn,0

�β�

, . . . , Gn,K�β��T

,

Cn�β�=�

Ckk′�β��K

k,k′=0.

Similarly define

g0ik

�β�= BT

i ∆T0,iA

−1/20,i MkA

−1/20,i

¦Yi −µ0

i +∆0,iζi1+∆0,iζi2

�β�©

,

Web-App.3

where ζi1 = Biβ0 −η0i , ζi2

�β�= Bi

�β −β0

�and β0 =

�1,β0T

1 , . . . ,β0Td

�Tdefined in Lemma

7, and ∆0,i, and A0,i are evaluated at µi = µ0i . Let

g0ik1 = BT

i ∆T0,iA

−1/20,i MkA

−1/20,i

�Yi −µ0

i

�,

g0ik2 = BT

i ∆T0,iA

−1/20,i MkA

−1/20,i ∆0,iζi1,

g0ik3

�β�= BT

i ∆T0,iA

−1/20,i MkA

−1/20,i ∆0,iζi2

�β�

,

g0k1 =

1n

∑ni=1 g0

ik,1, and define g0k2, g0

k3

�β�

similarly. Furthermore, for k, k′= 0, . . . , K , define

C0kk′�β�=

1

n

n∑

i=1

g0ik1 g0T

ik′1.

Then define G0n,k, G0

n, C0n , Q0

n similarly as Gn,k, Gn, Cn, Qn but replacing gik with g0ik , and Ckk′

�β�

with C0kk′�β�

.

Lemma 8 Under the conditions of Theorem 1, then one has

g0k1

= Op

�1/p

n�

, g0

k2

= Op (hr) .

Furthermore, let Θn(C) =¦β : BT �β −β0

� n = C/

pnh©

for a constant C . Then there exist

constants 0< c6 ≤ c5 such that for any β ∈Θn(C),

c6C/p

nh≤ g0

k3

�β� ≤ c5C/

pnh

except in an event whose probability goes to 0 as n→∞.

Proof: For any k = 0, . . . , K , and any a ∈ R1+dJn with aT a=1, aT g0k1 has mean 0 , and by

conditions (C5), (C6) and Lemma 3, there exists a constant c > 0, such that,

var�

aT g0k1

�=

1

nEaT g0

ik1 g0Tik1a≤ c

naT E

�BT

i Bi

�a≤ c

naT a= O

�1

n

�.

Note that the above holds for any a ∈ R1+dJn with aT a=1. Therefore g0

k1

= supaT a=1 aT g0k1 =

Op�1/p

n�

. Let Σ0ki = ∆

T0,iA

−1/20,i MkA

−1/20,i ∆0,i for i = 1, . . . , n. Then for any a ∈ R1+dJn with

aT a=1, the Cauchy-Schwartz inequality together with Lemmas 6 and 7 entail that there exists a

Web-App.4

c > 0 such that,

��aT g0k2

�� =��1

n

n∑

i=1

aT BTi ∆

T0,iA

−1/20,i MkA

−1/20,i ∆0,iζi1

��

≤

1

n

n∑

i=1

aT BTi Σ

0kiBia

!1/2 1

n

n∑

i=1

ζTi1Σ

0kiζi1

!1/2

≤ c�

aT a�1/2 BTβ0−η0

∞

≤ chr .

Therefore g0

k2

= Op (hr) . Similarly, there exists a c5 > 0, such that with probability approaching

to 1,

��aT g0k3

�β�� =

��1

n

n∑

i=1

aT BTi ∆

T0,iA

−1/20,i MkA

−1/20,i ∆0,iζi2

�β��

≤

1

n

n∑

i=1

aT BTi Σ

0kiBia

!1/2 1

n

n∑

i=1

ζTi1Σ

0kiζi1

!1/2

≤ c5

�aT a�1/2 BT �β −β0

� n

= c5C/p

nh.

Therefore g0

k3

�β� ≤ c5C/

pnh.

On the other hand, taking a=�β −β0

�/ β −β0

2, then conditions (C5), (C6) and Lemma 3

entail that there exist positive constants c and c6 such that

aT g0k3

�β�

=1

n β −β0

2

n∑

i=1

�β −β0

�T BTi ∆

T0,iA

−1/20,i MkA

−1/20,i ∆0,iBi

�β −β0

�

≥ c

n β −β0

2

n∑

i=1

�β −β0

�T BTi Bi�β −β0

�

=c β −β0

2

BT �β −β0

� 2

n

≥ c6

BT �β −β0

� n

= c6C/p

nh.

Therefore g0

k3

�β� = supaT a=1 aT g0

k3

�β�≥ c6C/

pnh.

Web-App.5

Lemma 9 Under the conditions of Theorem 1, the eigenvalues of C0n are bounded from 0 and infinite

when n is large enough. Furthermore let Θn(C) =¦β : BT �β −β0

� n = C/

pnh©

, for some C

sufficiently large. Then there exist constants 0< c8 ≤ c7, 0< c10 ≤ c9 such that for any β ∈Θn(C),

c8C/p

nh≤ G0

n

�β� ≤ c7C/

pnh, (B.1)

and

c10C/h≤Q0n

�β�≤ c9C/h, (B.2)

except in an event whose probability goes to 0 as n→∞.

Proof: Let eBi = I(K+1)⊗Bi, where I(K+1) is an identity matrix with dimension (K+1)× (K+1) ,

and ⊗ denotes Kronecker product. Furthermore, let M=�

MT0 , . . . ,MT

K

�T. To show that C0

n has

bounded eigenvalues, conditions (C5) and (C6) ensure that it is sufficient to show that there exist

constants c∗ and C∗ such that when n is large enough,

c∗ ≤ 1

n

n∑

i=1

aTeBTi MMTeBia≤C∗, (B.3)

for any a=�

aT0 , . . . ,aT

K

�Twith each ak∈ R(1+dJn) and aT a=1. The eigenvalues of MMT are the

squares of the singular values of M, which are bounded from 0 and +∞ by condition (C7). Then

1n

∑ni=1 aTeBT

i MMTeBia � In, where In =∑n

i=1 aTeBTieBia/n=

∑Kk=0

1n

∑ni=1 aT

k BTi Biak �

∑Kk=0 aT

k ak =

aT a= 1, by Lemma 6. Here denote a � b, if there exist constants c ≥ c∗ > 0 such that c∗b ≤ a ≤ cb.

Therefore (B.3) holds. Finally, (B.1) follows directly from Lemma 8, and (B.2) follows from (B.1)

and the result that C0n has bounded eigenvalues.

Lemma 10 Under the conditions of Theorem 1, for some C sufficiently large, one has

supβ∈Θn(C)

Gn�β�− G0

n

�β� = op

�1/p

nh�

. (B.4)

supβ∈Θn(C)

��Qn�β�−Q0

n

�β�� = op (1/h) . (B.5)

Web-App.6

Proof: It is sufficient to show that (B.4) holds for each of its components Gn,k�β�− G0

n,k

�β�

with k = 0, . . . , K . Note that the regression function can be decomposed as,

ηi

�β�= BT

i β = η0i + ζi

�β�

,

µi

�β�= µ

�η0

i + ζi

�β��

,

where ζi

�β�= ζi1+ ζi2

�β�

, ζi1 = BTi β0−η0

i and ζi2

�β�= BT

i

�β −β0

�.

For any a ∈ R1+dJn with aT a=1, a Taylor expansion of aT Gn,k�β�

gives

aT Gn,k�β�= aT Gn,k

�µ�β��= aT Gn,k

�µ�η0+ ζ

��

=1

n

n∑

i=1

aT BTi ∆

Ti

�µi

�η0

i + ζi

��A−1/2

i

�µi

�η0

i + ζi

��Mk ×

A−1/2i

�µi

�η0

i + ζi

��¦yi −µi

�η0

i + ζi

�©

=1

n

n∑

i=1

aT BTi ∆

T0,iA

−1/20,i MkA

−1/20,i

¦yi −µ0

i

©

+1

n

n∑

i=1

aT BTi ∆

T0,iA

−1/20,i MkA

−1/20,i ∆0,iζi

+1

n

n∑

i=1

ζTi ∆

T0,i

∂ aT BTi ∆

T0,iA

−1/20,i MkA

−1/20,i

∂ µi

¦yi −µ0

i

©

+ R∗n�µ∗�

,

where R∗n�µ∗�= 1

n

∑ni=1 R∗ni

�µ∗i�

, with

R∗ni

�µ∗i�=

1

2ζT

i ∆Ti

∂ 2aT BTi ∆

Ti

�µi

�A− 1

2i

�µi

�MkA

− 12

i

�µi

��yi −µi

∂ µi∂ µTi

∆iζi

evaluated at µ∗i = g−1�η0

i +τiζi

�with 0< τi < 1. Then one has

aT¦

Gn,k�β�− G0

n,k

�β�©

(B.6)

=1

n

n∑

i=1

ζTi ∆

T0,i

∂ aT BTi ∆

T0,iA

−1/20,i MkA

−1/20,i

∂ µi

¦yi −µ0

i

©+ R∗n

�µ∗�

= Ik�β�+ R∗n

�µ∗�

.

For Ik�β�

, one can write

Ik�β�=

1

n

n∑

i=1

ζTi1

�β�Γ0,i

�yi −µ0

i

�+

1

n

n∑

i=1

ζTi2

�β�Γ0,i

¦yi −µ0

i

©

= Ik1+ Ik2

�β�,

Web-App.7

where Γ0,i = ∆T0,i

∂ aT BTi ∆

T0,iA

−1/20,i M jA

−1/20,i

∂ µi. Conditions C5, C6 and Lemma 6.2 of He and Shi (1996) give

that sup1≤i≤n,aT a=1

Γ0,i

= Op

�h−1/2

�. Furthermore, Ik

�β�

has mean 0 and condition C3 and

Lemma 4 ensures that for any β ∈Θn(C),

var�

Ik1

�≤ c

nhE�ζT

i1

�yi −µ0

i

��yi −µ0

i

�Tζi1

�≤ c

nhE�ζT

i1ζi1

�= O (chr) .

Therefore, supaT a=1,β∈Θn(C)

��Ik�β��= op (hr) . Similarly supaT a=1

��Ik2

�β��= op

�1/p

nh�

.

Let F ∗i =∆Ti

�∂ 2aT BT

i ∆Ti (µi)A−1/2

i (µi)MkA−1/2i (µi){yi−µi}

∂ µi∂µTi

�|µi=µ

∗i∆i. Then

R∗n�µ∗�=

1

2n

n∑

i=1

ζTi1F ∗i ζi1+

1

n

n∑

i=1

ζTi2

�β�

F ∗i ζi1+1

2n

n∑

i=1

ζTi2

�β�

F ∗i ζi2

�β�

= I (1)n3 + I (2)n3

�β�+ I (3)n3

�β�

.

Conditions C5, C6 and Lemma 6.2 of He and Shi (1996) entail that

sup1≤i≤n,aT a=1

F ∗i = Op

�h−1/2

�.

Therefore

supaT a=1,β∈Θn(C)

��I (1)n3

�� = Op

�h2r−1/2

�,


��I (2)n3

�β�� = Op

�n−1/2hr−1

�,


��I (3)n3

�β�� = Op

�n−1h−3/2

�.

Combining all approximations into (B.6) and noting that h= O�

n−1/(2r+1)�

, one has

supβ∈Θn(C)

Gn,k�β�− G0

n,k

�β� = op

�1/p

nh�

.

Finally, (B.5) follows from Lemma 8 and equation (B.4).

Lemma 11 Under the conditions of Theorem 1, for any ε > 0, there exists a C = C (ε) , such that

as n→∞P�

infβ∈Θn(C)

Q0n

�β�>Q0

n

�β0

��> 1− ε.

Web-App.8

Proof: Let∆�β�= G0

n

�β�−G0

n

�β0

�. Then 1

n

�Q0

n

�β�−Q0

n

�β0

��=∆T �β�

�C0

n

�−1∆�β�+

2∆T �β��

C0n

�−1G0

n

�β0

�. By Lemma 9, there exists a constant c11 > 0 such that for any β ∈

Θn(C),

��∆T �β��

C0n

�−1G0

n

�β0

�� ≤ c11

K∑

k=0

��

g0k3

�β��T �

g0k1+ g0

k2

��

≤ c11

K∑

k=0

g0k3

�β� � g0

k1

+ g0

k2

� ,

Furthermore, Lemma 9 entails that there exists a constant C1 > 0 such that∆T �β��

C0n

�−1G0

n

�β0

�≤CC1hr/

pnh when n is large enough. On the other hand, condition (C7) entails that there exists a

c12 > 0, such that

∆T �β��

C0n

�−1∆�β�≥ c12

K∑

k=0

�g0

k3

�β��T �

g0k3

�β��= c12

K∑

k=0

g0k3

�β� 2

2 ,

and Lemma 9 gives that there exists a constant C2 > 0 such that

∆T �β��

C0n

�−1∆�β�≥ C2C2/nh.

Therefore we proves the Lemma 11 by noting that h = n−1/2r+1, and choosing C sufficiently large

such that C > C1/C2.

Web-App.9

Documents

Consistent Model Selection for Marginal …sites.science.oregonstate.edu/~xuel/Xue_Qu_Zhou_Revision.pdfR1 0 l (x)dx =0. In model (1), the additive nonparametric functions are formulated