24
7 Semiparametric Estimation of Additive Models Additive models are very useful for approximating the high-dimensional regression mean functions. They and their extensions have become one of the most widely used nonparametric techniques since the excellent monograph by Hastie and Tibshirani (1990) and the companion software as described in Chambers and Hastie (1991). For a recent survey on additive models, see Horowitz (2014). Much applied research in economics and statistics is concerned with the estimation of a conditional mean or quantile function. Specically, let () be a random pair where is a scalar random variable and is a × 1 random vector that is continuously distributed. We are interested in the estimation of either () ( | = ) or () = arg min (·) [ ( ()) | = ] the th conditional quantile function of given = : ( () | = )= In a classical nonparametric additive model, or is assumed to have the form ()= 0 + X =1 ( ) (7.1) or ()= 0 + X =1 ( ) (7.2) where 0 is a constant, is the th element of and 1 are one-dimensional smooth functions that are unknown and estimated nonparametrically. Model (7.1) or (7.2) can be extended to ()= 0 + X =1 ( ) (7.3) or ()= 0 + X =1 ( ) (7.4) where is a strictly increasing function that may be known or unknown. Below we focus on the study of the estimation of model (7.1) and its extension. Then we briey touch upon the models in (7.2)-(7.4). 7.1 The Additive Model and the Backtting Algorithm 7.1.1 The Basic Additive Model In the regression framework, a simple additive model is dened by = 0 + X =1 ( )+ (7.5) where (| 1 )=0 ¡ 2 | 1 ¢ = 2 ( 1 ) the 0 are arbitrary univariate functions that are assumed to be smooth and unknown. Note that we can add a constant to a component or 0 and subtract the constant from another component in (7.5). Thus the 0 1 are not identied without further restrictions. To prevent ambiguity, various identication conditions can be assumed. For example, one can assume that either [ ( )] = 0 =1 (7.6) 142

7 Semiparametric Estimation of Additive Models - … · 7 Semiparametric Estimation of Additive Models Additive models are very useful for approximating the high-dimensional regression

  • Upload
    vutu

  • View
    233

  • Download
    0

Embed Size (px)

Citation preview

7 Semiparametric Estimation of Additive Models

Additive models are very useful for approximating the high-dimensional regression mean functions. They

and their extensions have become one of the most widely used nonparametric techniques since the excellent

monograph by Hastie and Tibshirani (1990) and the companion software as described in Chambers and

Hastie (1991). For a recent survey on additive models, see Horowitz (2014).

Much applied research in economics and statistics is concerned with the estimation of a conditional

mean or quantile function. Specifically, let () be a random pair where is a scalar random variable

and is a × 1 random vector that is continuously distributed. We are interested in the estimation

of either () ≡ ( | = ) or () = argmin(·) [ ( − ()) | = ] the th conditional

quantile function of given = : ( ≤ () | = ) = In a classical nonparametric additive

model, or is assumed to have the form

() = 0 +

X=1

() (7.1)

or

() = 0 +

X=1

() (7.2)

where 0 is a constant, is the th element of and 1 are one-dimensional smooth functions

that are unknown and estimated nonparametrically. Model (7.1) or (7.2) can be extended to

() =

⎛⎝0 +

X=1

()

⎞⎠ (7.3)

or

() =

⎛⎝0 +

X=1

()

⎞⎠ (7.4)

where is a strictly increasing function that may be known or unknown.

Below we focus on the study of the estimation of model (7.1) and its extension. Then we briefly touch

upon the models in (7.2)-(7.4).

7.1 The Additive Model and the Backfitting Algorithm

7.1.1 The Basic Additive Model

In the regression framework, a simple additive model is defined by

= 0 +

X=1

() + (7.5)

where (|1 ) = 0 ¡2|1

¢= 2 (1 ) the

0 are arbitrary univariate functions

that are assumed to be smooth and unknown. Note that we can add a constant to a component or

0 and subtract the constant from another component in (7.5). Thus the 0 1 are not identified

without further restrictions. To prevent ambiguity, various identification conditions can be assumed. For

example, one can assume that either

[ ()] = 0 = 1 (7.6)

142

or

(0) = 0 = 1 (7.7)

or Z () = 0 = 1 (7.8)

whichever is convenient for the estimation method on hand. We also assume that the ’ are smooth

functions so that they can be estimated as well as the one-dimensional nonparametric regression problem

(Stone, 1985, 1986). Hence, the curse of dimensionality is avoided.

Frequently, we will denote below

() = ( | = ) = 0 +

X=1

() (7.9)

where = (1 )0and = (1 )

0

Model (7.5) allows use to examine the extent of nonlinear contribution of each explanatory variable to

the dependent variable Under the identification conditions that [ ()] = 0 hold for each = 1

and (|1 ) = 0 we have 0 = ( ) so that the single finite dimensional parameter 0 can be

estimated by the sample mean = −1P

=1 Since converges to 0 at the parametric√-rate,

which is faster than any nonparametric convergence rate, here we will simply work on the model without

0 in (7.5) by assuming ( ) = 0

Additive models of the form (7.5) have been shown to be useful in practice. They naturally generalize

the linear regression models and allow interpretation of marginal changes, i.e., the effect of one variable,

say on the conditional mean function () holding everything else constant. They are also interesting

from a theoretical perspective since they combine flexible nonparametric modeling of many variables with

statistical precision that is typical for just one explanatory variable.

Example 7.1 (Additive AR(p) models) In the time series literature, a useful class of nonlinear

autoregressive models are the additive models

=

X=1

(−) + (7.10)

In this case, the model is also called additive autoregressive models of order and simply denoted as

AAR(). In particular, it includes the AR() model as a special case and allows us to test whether an

AR() model holds reasonably for a given time series.

Restricting ourselves to the class of additive models (7.5), the prediction error can be written as

⎡⎣ − 0 −X

=1

()

⎤⎦2

= [ − (1 )]2 +

⎡⎣ (1 )− 0 −X

=1

()

⎤⎦2 (7.11)

where (1 ) = ( |1 ) Thus finding the best additive model to minimize the least

squares prediction error is equivalent to finding the one that best approximates the conditional mean

function in the senses that 0 and (1 ) minimize the second term in (7.11). In the case where

the additive model is not correctly specified (i.e., Pr( (1 ) = 0 +P

=1 ()) 1), we can

interpret it as the approximation of the conditional mean function

143

7.1.2 The Backfitting Algorithm

The estimation of (1 ) can easily be done by using the backfitting algorithm in the nonparametric

literature. To do this we first introduce some background knowledge on global spline approximation.

The local linear modelling cannot be directly applied to fit the additive model (7.5) with = 0 To

approximate the unknown functions 1 locally at the point 1 we need to localize simultane-

ously in the variables 1 This yields a -dimensional hypercube, which contains hardly any data

points for small to moderate sample sizes, unless the neighborhood is very large. Nevertheless, when the

neighborhood is too large to contain enough data points, the approximation error will be large. This is

the key problem underlying the curse of dimensionality.

To attenuate the problem, we can approximate the nonlinear functions 1 by polynomial splines,

or Hermite polynomials, among others. For example, we can approximate () by

(b) =

X=1

() (7.12)

where b =¡1

¢0and ()=1 can be chosen from some bases of functions, which includes

the trigonometric series sin () cos ()∞=1 the polynomial series©1 2 3

ª and the Gallant’s

(1982) flexible Fourier form© 2 sin () cos () sin (2) cos (2)

ª Below we introduce two popular

choices of approximating functions, namely, polynomial splines and Hermite polynomials.

Spline methods are very useful for nonparametric modelling. They are based on global approximation

and are useful extensions of polynomial regression techniques. Let 1 be a sequence of given knots

such that −∞ 1 · · · ∞ These knots can be chosen either by researchers or data themselves.

A spline function of order is a ( − 1)th continuously differentiable function such that its restriction toeach of the intervals (−∞ 1] [1 2) · · · [ ∞) is a polynomial function of order − 1 The followingformal definition is adapted from Eubank (1999, p. 281).

Definition 7.1 A polynomial spline function () of order with knots 1 is a function of the

form

() =+X=1

() (7.13)

for some set of coefficients = 1 + where⎧⎨⎩ () = −1 = 1

+ () = (− )−1+ = 1

(7.14)

and (− )−1+ = max(− )

−1 0

The above definition is equivalent to saying that

(i) is a piecewise polynomial of order − 1 on any subinterval [ +1);(ii) has − 2 continuous derivatives, and(iii) has a discontinuous ( − 1)st derivative with jumps at = 1

Thus, a spline is a piecewise polynomial whose different polynomial segments are joined together at the

knots = 1 in a fashion that insures the continuity properties. Let (1 ) denote the set of

all functions of the form (7.14). Then (1 ) is a vector space in the sense that the sums of functions

144

in (1 ) remain in the set, etc. Since the functions 1, −1 (− 1)

−1+ (− )

−1+ are

linearly independent, it follows that (1 ) has dimensions +

Example 7.2 (Polynomial splines) To restrict to our case, we can approximate ( = 1 )

by a polynomial spline of order with knots©1

ªby

() '−1X=0

+

X=1

+−1 (− )−1+ ≡ (b) (7.15)

In real applications, for any given number of knots value the knot©

ªcan be simply

chosen as the empirical quantiles of i.e., = ( + 1)-th quantile of for = 1 When

the knots are fine enough on the support of which is usually assumed to be compact, the resulting

spline function (b) can approximate the smooth function quite well. Two popular choices of

are = 2 and 4. When = 2 (b) is simply the piecewise linear function

(b) = 0 + 1+ 2 (− 1)+ + · · ·+ +1

¡−

¢+ (7.16)

One can easily verify that (b) is piecewise linear and continuous, and has kinks at the knots

1

Example 7.3 (Hermite polynomials of order ) We can also approximate the unknown func-

tion ( = 1 ) by Hermite polynomials of order :

() 'X=0

(− 1)exp

(−(− 1)

2

222

)≡ (b) (7.17)

where 1 and 22 can be chosen as the sample mean and sample variance of the data =1 Hermitepolynomials are often chosen when the underlying variables have infinite support.

After the approximation, we can estimate the unknown parameters in the approximation by the least

squares method. That is, we choose b1 b to minimize the following criterion function

−1X=1

− 1 (1b1)− · · ·− (b)2 (7.18)

Let the solution be bb = 1 Then the estimated functions are simplyb () =

³ bb´ = 1 (7.19)

The above least squares problem can be solved directly, resulting in a large parametric problem with

an inversion of matrix of high order. Alternatively, the optimization problem can be solved using the

backfitting algorithm. Conditional expectations provide a simple intuitive motivation for the backfitting

algorithm. If the additive model (7.5) is correct with = 0 (otherwise replace by minus its sample

mean), then for any = 1

⎡⎣ −X 6=

() |

⎤⎦ = () (7.20)

This immediately suggests an iterative algorithm for computing :

145

Step 1. Given the initial values of b2 b (say from the direct least squares solution), minimize (7.18)

with respect to b1 This is a much smaller “parametric problem” can can be solved relatively easily.

Step 2. With estimated values of b1 and values of b3 b we now minimize (7.18) with respect to b2

This results an updated estimate of b2 Repeat this exercise until b is updated.

Step 3. Repeat Steps 1-2 until certain convergence criterion is met.

This is the basic idea of the backfitting algorithm (Ezekiel, 1924, Buja et al., 1989). Let b = −

P 6=

³ bb´ be the partial residuals without using the regressor Then the backfitting

algorithm finds b by minimizing

−1X=1

b − (b)2 (7.21)

This is a nonparametric regression problem of b on the variable . The resulting estimate is linear

in the partial residuals b and can be written as⎛⎜⎜⎜⎜⎜⎜⎝

³1 bb´

³2 bb´...

³ bb´

⎞⎟⎟⎟⎟⎟⎟⎠ = S

⎛⎜⎜⎜⎜⎝b1b2...b

⎞⎟⎟⎟⎟⎠ (7.22)

where S is the smoothing matrix. For the ease of presentation, denote the left hand side of (7.22) as bgand write Y = (1 )

0 Then (7.22) can be written as

bg = S⎛⎝Y−X

6=bg⎞⎠ (7.23)

The above example can utilize polynomial splines or Hermite polynomials as a nonparametric smoother.

The idea can be applied to any nonparametric smoother. Let S be a smoothing matrix that is obtained

by regressing nonparametrically the partial residuals b on The general backfitting algorithmcan be outlined as below.

Step 1. Initialize the functions bg1 bgStep 2. For = 1 compute bg∗ = S ³Y−P 6= bg´ and center the estimator to obtain

bg (·) = bg∗ (·)− −1X=1

bg∗ () (7.24)

where bg∗ () denotes the th element of bgStep 3. Repeat Step 2 until convergence.

See Hastie and Tibshirani (1990, p.91) for a discussion on the above algorithm. The re-centering in

Step 2 is to comply with the constraint in (7.6). The convergence issue of the algorithm is delicate and

has been addressed via the concept of concurvity by Buja et al. (1989). Concurvity is the analogue of

146

collinearity in the linear regression models. Assuming that concurvity is not present, it is shown there

that the backfitting algorithm converges and solves the following equation⎛⎜⎜⎜⎜⎝bg1bg2...bg

⎞⎟⎟⎟⎟⎠ =

⎛⎜⎜⎜⎜⎝ S1 · · · S1

S2 · · · S2...

.... . .

...

S S · · ·

⎞⎟⎟⎟⎟⎠−1⎛⎜⎜⎜⎜⎝

S1

S2...

S

⎞⎟⎟⎟⎟⎠⎛⎜⎜⎜⎜⎝

1

2...

⎞⎟⎟⎟⎟⎠ (7.25)

Direct calculation of the right-hand-side of (7.25) involves inverting a × square matrix and can

hardly be implemented on an average computer for moderate to large sample sizes. In contrast, the

backfitting does not share this drawback and is frequently used in practical implementations.

To add:

Mammen, Linton, and Nielsen (1999, AoS,The existence and asymptotic properties of a backfitting

projection algorithm under weak conditions).

7.1.3 Generalized Additive Models: Logistic Regression

As Hastie and Tibshirani (1990) remark, the linear model is used for regression in a wide variety of contexts

other than the ordinary regression, including log-linear models, logistic regression, the proportional-

hazards model for survival data, models for ordinal categorical response, and transformation models. It

is a convenient but crude first-order approximation to the regression function, and in many cases it is

adequate. The additive model can be used to generalize all these models in an obvious way. For clarity,

we focus on the logistic regression model.

In this setting the response variable is dichotomous, such as yes/no or survived/died, increase/decrease,

and the data analysis is aimed at relating this outcome to the predictors. One quantity of interest is the

proportion of outcome as a function of the predictors (explanatory variables).

In linear modelling of binary data, the most popular approach is logistic regression which models the

logit of the response probability with a linear form

logit () ≡ log½

()

1− ()

¾= 0 (7.26)

where () = ( = 1|) Alternatively, we can write

() =exp ( 0)

1 + exp ( 0)(7.27)

There are several reasons for its popularity, but the most compelling is that the logit model ensures

that the proportions () lie in (0 1) (see (7.27)) without any constraints on the linear predictor 0We can generalize the model in (7.26) by replacing the linear predictor with an additive one

log

½ ()

1− ()

¾= +

X=1

() (7.28)

or that in (7.27) to

() =

⎛⎝+

X=1

()

⎞⎠ (7.29)

where () = exp()1+exp() denotes the CDF of the standard logistic distribution. Thus (7.29) is a special

case of the generalized additive model in (7.4) where is a strictly increasing function that has a known

functional form.

147

Insight from the Linear Logistic Regression Model To estimate the model (7.28), we can gain

some insight from the linear logistic regression methodology. Maximum likelihood is the most popular

method for estimating the linear logistic model. For the present problem the log-likelihood has the form

() =X=1

log ( ()) + (1− ) log (1− ()) (7.30)

where () = exp (0) (1 + exp (

0)) The score equations

()

=

X=1

[ − ()] = 0 (7.31)

are nonlinear in the parameters and consequently one has to find the solution iteratively.

The Newton-Raphson iterative method can be expressed in an appealing form. Given the current

estimate b we can estimate the probabilities () by = exp³ 0b´ ³1 + exp³ 0

b´´ We form

the linearized response

= 0b + ( − ) (1− ) (7.32)

where the quantity represents the first-order Taylor’s series approximation to logit() about the

current estimate 3 Denote = ( − ) (1− ) If b and hence are fixed, the variance of

is 1/ (1− ) and hence we choose the weights ≡ (1− ) Alternatively, we can verify that

(|) = 0 and ¡2 |

¢=

1

() (1− ())(7.33)

in the extreme case where = () So when approximates () we expect that

(|) ≈ 0 and ¡2 |

¢ ≈ 1

() (1− ())≈ 1

(1− ) (7.34)

Consequently, a new b can be obtained by weighted linear regression of on with weights ≡ (1− ) This is repeated until b converges.Algorithm The above iterative algorithm lends itself ideally to the generalized additive model in (7.28).

Define

= e+ X=1

e () + ( − ) (1− ) (7.35)

where (e e) are the current estimates for the additive model components and =

exp³e+P

=1 e ()´

1 + exp³e+P

=1 e ()´ (7.36)

Define the weights

= (1− ) (7.37)

The new estimates of 0 and ( = 1 ) are computed by fitting a weighted additive model to

Of course, this additive model fitting procedure is iterative as well. Fortunately, the functions from

the previous step are good starting values for the next step. This procedure is called the local-scoring

algorithm in the literature. The new estimates from each local scoring step are monitored and the

iterations are stopped when their relative change is negligible.

3Pretending that is bounded away from 0 and 1 and is close to , we have, by the first order Taylor expansion, that

logit() = log

1−

≈ log

1−

+ 1

(1−) ( − )

148

7.2 The Marginal Integration Method

7.2.1 The Marginal Integration Estimator

Let (1 ) = 1 be a random sample from the following additive model

= 0 +

X=1

() + (7.38)

where (|1 ) = 0 ¡2 |1

¢= 2 (1 ) and the (·)=1 is a set of

unknown functions satisfying [ ()] = 0 = 1 . We follow Chen et al. (1995) to define the

marginal integration estimator.

Let (·) be the marginal density of = 1 Let (1 ) = 0 +P

=1 () be the

conditional mean function. For any ∈ 1 we define

= (1 − 1 + 1 )0

The joint density of = (1 −1+1 )0is denoted as

¡¢ Then for a fixed

= (1 )0 the functional Z

() ¡¢Y

6= (7.39)

is 0 + ()

Let (·) and (·) be kernel functions with compact support. Let

(·) = −1 (·) and (·) = −(−1) (·) (7.40)

Using the Nadaraya-Watson (NW) kernel method to estimate the mean function (·) we average overthe observations to obtain the following estimator. For 1 ≤ ≤ and any in the domain of (·) define

b () =1

X=1

e (1 −1 +1 )

=1

X=1

"P=1

¡ −

¢ ( − )P

=1 ¡ −

¢ ( − )

#(7.41)

=X=1

"1

X=1

¡ −

¢ ( − )P

=1 ¡ −

¢ ( − )

#

If the regressors were independent, we might useP

=1 ( − )P

=1 ( − ) to esti-

mate () This is a one-dimensional NW estimator. Nevertheless, this estimator has larger variance

in comparison to the above estimator even in this restricted situation, see Härdle and Tsybakov (1995).

Let (·) denote the joint density of = (1 )0 The following assumptions are modified

from Chen et al. (1995).

Assumptions

A1. is IID. ¡4¢∞

A2. The densities and are bounded, Lipschitz continuous and bounded away from zero. The

function has Lipschitz continuous derivatives.

A3. The conditional variance function 2 () = ¡2 | =

¢is Lipschitz continuous.

149

A4 The kernel function (·) is a bounded nonnegative second order kernel that is compactly supportedand Lipschitz continuous. 02 =

R2 () ∞ and 21 =

R2 () ∞

A5. The kernel function is a bounded, compactly supported, and Lipschitz continuous. is a th

order kernel with kk22 =R2 () ∞

A6. As →∞ → 0 = 0−15 and −1 log→∞

Theorem 7.2 Suppose that Assumptions A1-A6 hold and the order of satisfies (− 1) 2 Then

25 b ()− ()− 0 → ( () ()) (7.42)

where

() = 2021

(1

200 () + 0 ()

Z¡¢

()

()

)

() = −10 02

Z2 () 2

¡¢

()

Proof. See Chen et al. (1995).

Theorem 7.2 says that the rate of convergence to the asymptotic normal limit distribution does

not suffer from the “curse of dimensionality”. Nevertheless, to achieve this rate of convergence, we

must impose some restrictions on the bandwidth sequences and choices of kernels. Note that the above

bandwidth condition does not exclude the optimal one dimensional smoothing bandwidth of the −15

rate for both and when ≤ 4 More importantly, we can take = ¡−15

¢ When ≥ 5 we can no

longer use at the rate −15 and we need to use higher order kernel to reduce the bias that is associatedwith the use of

Estimation of the regression surface Define

b () = X=1

b ()− (− 1) b (7.43)

where b = −1P

=1 The following theorem gives the asymptotic distribution of b ().Theorem 7.3 Under the conditions of Theorem 7.2,

25 b ()− () → ( () ()) (7.44)

where () =P

=1 () and () =P

=1 ().

Proof. See Chen et al. (1995).

Theorem 7.3 says that the covariance between b () and b () for 6= is asymptotically negli-

gible, i.e., it is of smaller order than the variances of each component function estimator.

7.2.2 Marginal Integration Estimation of Additive Models with Known Links

For many situations, especially binary and survival time data, the model (7.38) may not be appropriate.

In the parametric case, a more appropriate framework of modelling is provided by generalized linear

150

models of McCullagh & Nelder (1989). Hastie and Tibshirani (1991) extend these ideas to nonparametric

modelling. In the nonparametric case, the model can be fully or partially specified. In the full model

specification case, the conditional distribution of given is assumed to belong to an exponential family

with known link function and mean function such that

() = 0 +

X=1

() (7.45)

where [ ()] = 0 = 1 . This model is usually called a generalized additive model in the

literature. It implies, for example, that the variance is functionally related to the mean. In the partial

model specification case, we keep the form (7.45) but don’t restrict ourselves to the exponential family.

In this latter case, the variance function is unrestricted.

Example 7.4. Clearly, when is the identity function we have the additive regression model exam-

ined above. Other examples include the logit and probit link functions for binary data, and the logarithm

transform for Poisson count data (McCullagh & Nelder 1989, p.30) and the Box-cox transformation (e.g.,

→ ¡ − 1¢ ). It also incudes cases when the regression function is multiplicative.

The backfitting procedure in conjunction with Fisher scoring is widely used to estimate (7.45) (Hastie

and Tibshirani 1991, p.141). It exploits the likelihood structure. Nevertheless, it is even less tractable

from a statistical perspective when is not the identity because the estimate is not linear in

Estimation of the Additive Components Linton and Härdle (1996) propose a marginal integration-

based method of estimating the components in (7.45). The main advantage of their method is that

one can obtain its asymptotic properties. They also suggest how to take into account of the additional

information provided by the exponential family structure.

Noticing that under the additive structure (7.45),

() ≡Z

©¡

¢ª¡¢ = 0 + () (7.46)

The general strategy is to replace both ¡

¢and

¡¢in (7.46) by their estimates. We estimate

¡

¢by

e ¡ ¢ = X=1

¡

¢ (7.47)

where

¡

¢=

¡ −

¢ ( − )P

=1 ¡ −

¢ ( − )

(7.48)

() can be estimated by its sample analogue:

b () = −1X=1

©e ¡

¢ª (7.49)

When is the identity function, b () is linear in , i.e., b () = P=1 () (c.f. equation

(7.41)), where

() = −1X=1

¡

¢

In general, b () is a nonlinear function of 151

The above procedure is carried out for each = 1 We can obtain estimates of each (·)evaluated at each sample point. Let

b = ()−1 X=1

X=1

b () b () = b ()− bWe reestimate () by

b () = −1(b+ X

=1

b ())

where −1 is the inverse function of Let b = − b () be the additive regression residuals which

estimate the errors = − () These residuals can be used to test the additive structure, i.e., to

look for possible interactions ignored in the simple additive structure. When (7.45) is true, b should beapproximately uncorrelated with any function of

We modify Assumption A6 to be

A6* As →∞ → 0 = 0−15 25 → 0 and 25−1 →∞

Theorem 7.4 Suppose that Assumptions A1-A6 and A6* hold and the order of satisfies (− 1) Assume that is twice continuously differentiable. Then

25 b ()− ()− 0 → ( () ()) (7.50)

where

() = 2021

½1

200 ()

Z0 ( ())

¡¢ + 0 ()

Z0 ( ())

ln ()

¡¢

¾

() = −10 02

Z[0 ( ())]2

2 () 2¡¢

()

Proof. See Linton and Härdle (1996).

As Linton and Härdle (1996) remark, we can use the local linear smoother as a pilot in place of the

NW estimator. In this case, the asymptotic variance of b () will be the same but the bias will takethe simpler form

() = 2021

½1

200 ()

Z0 ( ())

¡¢

¾

To construct the asymptotic confidence interval of b () we need to estimate () and () It

is easy to show that () can be consistently estimated by

b () = X=1

bb2where b = 1

X=1

0nb³

´o

¡

¢

The formula for the estimate of () is quite complicated and we thus omit it. Nevertheless, in case of

undersmoothing, i.e., = ¡−15

¢ it suffices to estimate () Alternatively, we can use bootstrap

method (e.g., wild bootstrap) for approximating the desired asymptotic confidence interval.

152

Estimation of the Regression Surface After we obtain estimates for the additive component, we

can obtain estimates for the regression function. Note that we don’t limit ourselves to the exponential

family structure discussed earlier.

Theorem 7.5 Suppose the conditions of Theorem 7.4 hold and ≡ −1 is twice continuously differen-

tiable. Then

25 b ()− () → ( () ()) (7.51)

where () = 0 (0 +P

=1 ())P

=1 () and () = 0 (0 +P

=1 ())2P

=1 ().

Proof. The result follows from Theorem 7.4, the delta method, and the fact that b () and b ()are asymptotically independent for 6=

Theorem 7.5 says that the rate of convergence of b () is free from the “curse of dimensionality” as

desired.

Remark. Yang, Sperlich, and Härdle (2003, JSPI) studies the derivative estimation of generalized

additive models via the kernel method. In addition, they study the hypothesis testing on derivatives.

7.2.3 Efficient Estimation of Generalized Additive Models

Recently, Linton (2000, ET) defines new procedures for estimating the generalized additive nonparametric

regression models that are more efficient than the Linton and Härdle (1996). He considers criterion

functions based on the linear exponential family. When the linear exponential family specification is

correct, the new estimator achieves certain oracle bounds. For brevity, we refer the reader to Linton

(2000).

7.3 Additive Partially Linear Models

In this section we introduce two methods to estimate additive partially linear models. One is the series

method of Li (2000, Intl Economic Review)) and the other is the kernel method of Fan and Li (2003,

Statistica Sinica).

7.3.1 Series Estimation

A typical additive partially linear model is of the form

= 00 + 1 (1) + · · ·+ () + (7.52)

where (| 1 ) = 0 is a × 1 vector of random variables that does not contain a

constant term, 0 is a × 1 vector of unknown parameter; is of dimension ( ≥ 1 = 1 );

(·) = 1 are unknown smooth functions. Denote by the non-overlapping variables obtained

from (1 ). is of dimension with ≤ ≤P=1 In practice, the most widely used case is

where = 1 ∀ so that is a scalar random variable and = .

Clearly, the individual functions (·) ( = 1 ) are not identified without some identification

conditions. In the literature of kernel estimation, it is convenient to impose [ ()] = 0 for all

= 2 · · · Nevertheless, such conditions are not easily imposed for series estimation. Instead, here itis more convenient to impose

(0) = 0 = 2 (7.53)

153

To construct a series estimator for the unknown parameters in the model, we use Li’s (2000) definition

of the class of additive functions.

Definition. A function () is said to belong to an additive class of functions G ( ∈ G) if(i) () =

P=1 () () is continuous in its support Z which is a compact sunset of R

( = 1 );

(ii)P

=1h ()

2i∞;

(iii) (0) = 0 for = 2 · · · When () is a vector-valued function, we say that ∈ G if each of its component belongs to GIn vector notation, we can write (7.52) as

= 0 + 1 + · · ·+ + = 0 + + (7.54)

where = (1 )0 = (1 )

0 = ( (1) ())

0 = (1 )

0and =

1 + · · ·+

For = 1 we shall use a linear combination of functions,

() = [

1 ()

()]

0 toapproximate () Let

() = [11 (1)

0

()0]0 where =

P=1 The linear combination

of () forms an approximation function for () ≡ (1 ) =P

=1 () The approximation

function has the following properties: (1) it belongs to G and (2) as grows for all = 1 there

is a linear combination of () that can approximate any ∈ G arbitrarily well in the mean squarederror sense.

Define

=h

(1)

()i0( = 1 ), (7.55)

= (1 )

Note that is of dimension × and is of dimension × Define = ( 0 )− 0 where −

is the symmetric generalized inverse of For a matrix with rows, let e = If we premultiply

both sides of (7.54) by then we have

e = e0 + e + e (7.56)

Subtracting (7.56) from (7.54) gives

− e = ³ − e´0 + ( − e) + ³ − e´ (7.57)

So we can estimate 0 by regressing − e on − e to obtain

b = ∙³ − e´0 ³ − e´¸− ³ − e´0 ³ − e ´ (7.58)

After obtaining b we can estimate () =P=1 () by b () = ()

0 b whereb = ( 0 )− 0 ³ − eb´ (7.59)

() can also be estimated easily based upon b For example, b1 (1) = 11 (1)

0 b1 where b1 is acollection of the first 1 elements in bDefine () ≡ (| = ) and 2 ( ) ≡Var(| = = ) We use () to denote the pro-

jection of () onto G That is, () = [ ()] where by the definition of (·) we know () is

154

an additive function, i.e., () =P

=1 () ∈ G and (·) is the solution of the following minimizationproblem:

©[ ()− ()] [ ()− ()]

0ª= inf

=

=1 ∈G

⎧⎨⎩" ()−

X=1

()

#" ()−

X=1

()

#0⎫⎬⎭

Noting that (·) is of dimension × 1, we can write () = ¡(1) () () ()¢0 To state the main result, we need to make some assumptions.

Assumptions.

A1. (i) ( ) = 1 are IID. The support of ( ) is a compact subset of R+; (ii)Both () ≡ (| = ) and 2 ( ) ≡Var(| = = ) are bounded functions on the support

of ( )

A2. (i) For every there is a nonsingular matrix such that for () = () the smallest

eigenvalue of [ () ()

0] is bounded away from zero uniformly in ; (ii) there is a sequence of

constants 0 () satisfying sup∈Z¯¯ ()

¯¯ ≤ 0 () and = such that (0 ())2 → 0 as

→∞ where Z is the support of

A3. (i) For =P

=1 there exist some 0 and = ()= (01

0

)0 = 1 such

that sup∈Z¯ ()− ()

0¯=

³P=1

´as min1 → ∞ where = (01

0)

0;

(ii)√P

=1 − → 0 as →∞

A4. Φ = ©[ − ()] [ − ()]

0ªis positive definite.

Assumption A1 is quite standard in the literature of estimating additive models. Assumption A2

ensures that ( 0 ) is asymptotically nonsingular. While Assumptions A2-A3 are not primitive conditions,it is known that many series functions satisfy these conditions. Newey (1997) gives primitive conditions

for power series and splines such that Assumptions A2-A3 hold. For power series, 0 () = () ; for

B-splines, 0 () = (√) For = 1 if the density () of is continuously differentiable of

order then =

The following theorem states the asymptotic property of bTheorem 7.6 Under Assumptions A1-A4, we have

√³b − 0

´→

¡0Φ−1ΨΦ−1

¢

where Ψ = £2 ( )

0

¤and = − ()

For statistical inference, we need to obtain consistent estimates of Φ and Ψ. Li (2000) shows that we

can estimate them consistently by

bΦ = −1X=1

³ − e

´³ − e

´0and bΨ = −1

X=1

b2 ³ − e

´³ − e

´0

where e is the th row of e = and b = − 0b − b ()

Li (2000) also gives the convergence rate of b () = ()0 b to () =P=1 ()

155

Theorem 7.7 Under Assumptions A1-A4, we have

(i) sup∈Z |b ()− ()| = (0 ())³p

+P

=1−

´;

(ii) −1P

=1 [b ()− ()]2=

³+

P=1

−2

´;

(iii)R[b ()− ()]2 () =

³+

P=1

−2

´ where (·) is the cdf of

The property of b () is similar and thus omitted for brevity.7.3.2 Kernel Method

We now study the kernel method of estimating the additive partially linear models. Like Fan and Li

(2003), we consider the additive partially linear model

= 0 + 0 + 1 (1) + · · ·+ () + (7.60)

where (| 1 ) = 0 is a ×1 vector of random variables that does not contain a constantterm, 0 is a scalar parameter, =

¡1

¢0is a × 1 vector of unknown parameters; the 0 are

univariate continuous random variables, and (·) = 1 are unknown smooth functions.Let

= (1 −1 +1 )

where is removed from =(1 ). Define

¡¢= 1 (1) + + −1 (−1) + +1 (+1) + + ()

Then we can rewrite (7.60) as

= 0 + 0 + () +

¡

¢+ (7.61)

Fan, Härdle, and Mammen (1998) consider the case where is a × 1 vector of discrete variables andsuggest two ways of estimating model (7.61). In neither method did they make full use of the information

that enters the regression function linearly. Motivated by this observation, Fan and Li (2003) consider

a two-stage estimation procedure which applies to the case where contains both discrete and continuous

elements and makes full use of the information that enters the regression function linearly.

For = 1 define

( ) = £| = =

¤ () =

£( )

¤

( ) = £| = =

¤ () =

£( )

¤

Denote = () and = () Then taking conditional expectations on both sides of (7.61)

gives

( ) = 0 + ( )0 + () +

¡¢ (7.62)

Integrating both sides of (7.62) over leads to

() = 0 + ()0 + () (7.63)

where we have used the identification condition that £

¡

¢¤= 0 Replacing in (7.63) by

and then summing both sides of (7.63) gives

X=1

= 0 +

X=1

0 +X

=1

() (7.64)

156

Subtracting (7.64) from (7.60), we can eliminateP

=1 () and get

−X

=1

= (1− )0 +

à −

X=1

!0 + (7.65)

Let Y = −P

=1 and X =³1 ( −

P=1 )

0´ Then in vector notation we can write (7.65)

as

Y = X + (7.66)

where Y and X are × 1 and × ( + 1) matrices with th rows given by Y and X respectively, = (1 )

0 and =¡0

0¢0 with 0 = (1− )0We can apply OLS regression to (7.66) to obtain

=

Ã0

!= (X 0X )−1X 0Y = + (X 0X )−1X 0 (7.67)

Under standard conditions, we can show that converges to at the parametric√-rate.

Nevertheless, is an infeasible estimator because it depends on the unknown quantitiesP

=1

andP

=1 . To obtain a feasible estimator of we need to replace these unknown quantities by their

consistent estimates. A consistent estimator of = () is given by

b =1

X=1

½P=1 ( − )( − )P=1( − )( − )

¾

=1

X=1

X=1

( − )( − )P=1 ( − )( − )

(7.68)

≡X=1

where the definition for is clear, () = −1 () () = −(−1)

¡

¢ and are

kernel functions, and are smoothing parameters. Fan and Li (2003) use the leave-one-out method

to obtain which can only simplify the proofs but does not change the asymptotic results. Note

thatP

=1 = 1 b is a weighted average of Similarly, a consistent estimator of = () is

given by b =X=1

(7.69)

Let bY = −P

=1b and bX = ³1 −

P=1

b

´ We can obtain a feasible estimator of by

replacing Y and X in (7.67) by bY and bX respectively. Nevertheless, noting that near the boundarythe density () = (1 ) of cannot be estimated well. So Fan and Li (2003) consider trimming

observations near the boundary. Assume that ∈ [ ] where are both finite constants,

= 1 Define the trimming set S = Π=1 [ + − ] where = for some 0

0 1 = max Let 1 = 1 ( ∈ S) Fan and Li (2003) estimate by

b = Ã b0b!=³ bX 0 bX´−1 bX 0 bY = Ã X

=1

bX bX 01

!−1 X=1

bX bY1 (7.70)

where bY and bX are × 1 and × (+ 1) matrices with th rows given by bY1 and bX1 respectively.To state the main result, we make the following assumptions.

157

Assumptions.

A1. =1 are IID; ( ) has a finite support with the support of being a product set

Π=1 [ ] ; the density function of is bounded from below by a positive constant on its support;

¡4¢∞

A2. () () () and (1 ) all belong to G4 where ≥ 2 is an integer. Let

¢ where and denote the continuous and discrete components of respectively, then

for all values of 2 ( 1 ) = (2 | = 1 = 1 = ) ∈ G41 G is defined in Section4.1.1.

A3. Φ = h( −

P=1 ) ( −

P=1 )

0iis positive definite.

A4. The kernel functions and are bounded, symmetric, and both are of order

A5. As →∞ 2 →∞ 32−1 →∞ and (2 + 2 )→ 0

Note that when = = Assumption A5 implies that 32 →∞ and 2 → 0 which allows

the use of a second order kernel if ≤ 5 It also implies that the data needs to be undersmoothed.The following theorem shows the asymptotic property of b

Theorem 7.8 Under Assumptions A1-A5, we have

√³b −

´→ (0Φ−1ΨΦ−1)

where Ψ = £2

0

¤ with = + ( − ()) (1 −

P=1 ) = − (|) = () −P

=1 () and = () ¡

¢¡

¢ Here (·) (·) and (·) are the density

functions of and respectively.

Given the√-consistent estimator b we can rewrite (7.60) as

− 0b = 0 +

X=1

() + + 0

³ − b´ (7.71)

= 0 +

X=1

() + error term.

The intercept term 0 can be√-consistently estimated by

b0 = −0b

where = −1P

=1 and = −1P

=1 Note that (7.71) is essentially an additive regression

model with − 0b as the new dependent variable and + 0

³ − b´ as the new error term. Sinceb − =

¡−12

¢ a rate faster than any nonparametric convergence rate, the asymptotic distribution

of any nonparametric estimator of () based on (7.71) will remain the same as if one replaces b by To make statistical inference on , we need to estimate Φ and Ψ consistently. Let b (·) b (·) andb (·) denote the kernel estimators of (·) (·) and (·) respectively. That is,

b () = −1X=1

( − ) b ¡¢ = −1X=1

¡ −

¢

b () = −1X=1

( − )¡ −

¢

158

Let b denote a kernel estimator of (|) Define

b = b () b ¡¢ b = b −X

=1

b = −1X=1

bb = − b b = b + (b − )

Ã1−

X=1

b!

b = − b0 − 0b − X

=1

b () Then we can estimate Φ and Ψ consistently by

bΦ = 1

X=1

à −

X=1

b

!Ã −

X=1

b

!0and bΨ = 1

X=1

b2 b b07.4 Specification Test for Additive Models

7.4.1 Test for Additive Partially Linear Models via Series Method

Li, Hsiao, and Zinn’s (2003, JoE) consider consistent specification tests for semiparametric/nonparametric

models where the null models all contain some nonparametric components based on series estimation

methods. A leading case is to test for an additive partially linear model. The null hypothesis is

0 : (| ) = 0 +

X=1

() a.s. for some ∈ BX=1

(·) ∈ G, (7.72)

where is a × 1 vector of regressors, is of dimension ≥ 1 is an × 1 unknown parameter, is a × 1 non-overlapping variables of = 1 B is a compact set of R and G is the class ofadditive functions defined in the last section. In particular, the identification condition is (0) = 0 for

= 2 The alternative hypothesis is

1 : (| ) 6= 0 +

X=1

() (7.73)

on a set with positive measure for any ∈ B and any P=1 (·) ∈ G.

Let = − 0 −

P=1 () and = ( ) The null hypothesis 0 is equivalent to

0 : (|) = 0 a.s. (7.74)

Noting that (|) = 0 a.s. if and only if (()) = 0 for all (·) ∈ A, the class of bounded ()-

measurable functions, Li, Hsiao, and Zinn (2003) follow Bierens and Ploberger (1997), Stute (1997), and

Stinchcombe and White (1998), and consider the following unconditional moment test:

[ ( )] = 0 for almost all ∈ = ⊂R+ (7.75)

where (·) is a proper choice of weight function so that (7.75) is equivalent to (7.74). Stinchcombe andWhite (1998) show that there exists a wide class of weight functions (·) that makes (7.75) equivalent

159

to (7.74). Choices of weight functions include the exponential function ( ) = exp ( 0) the

logistic function ( ) = 1(1 + exp( − 0)) with 6= 0 the trigonometric function ( ) =

cos( 0) + sin(

0) and the usual indicator function ( ) = 1 ( ≤ ) See Stinchcombe and

White (1998) and Bierens and Ploberger (1997) for more discussions. The goodness of switching from a

conditional moment test of (7.74) to an unconditional moment test of (7.75) is to avoid the estimation of

the alternative model nonparametrically, as in Chen and Fan (1999), and Delgado and González Manteiga

(2001).

If is observable, one can construct a test based upon the sample analogue of [ ( )] = 0 :

0 () =1√

X=1

( ) (7.76)

0 (·) can be viewed as a random element taking values in the separable space L2 (= ) of all real, Borelmeasurable functions on = such that R= ()2 () ∞ 4 which is endowed with the 2 norm

kk =½Z

= ()2 ()

¾12

Chen and White (1997) show that for a sequence IID process (·)=1. L2 (= )-valued elements,−12

P=1 (·) converges weakly to Z (·) in the topology of (L2 (= ) ||·||) if and only if

R=

h ()

2i ()

∞, where Z is a Gaussian element with the same covariance function Ω (0) = [ () (0)]

Let 2 () = ¡2 |

¢ It is easy to check that for (·) = ( ·) we have

hk (·)k2

i=

½Z2 ( )

2 ()

¾=

½2 ()

Z ( )

2 ()

¾≤

£2 ()

¤2Z= () ∞

provided the weight function (·) is bounded above by on =×=. This implies that

0 (·) converges weakly to 0∞ (·) in L2(= ||·||)

where 0 (·) is a Gaussian process centered at zero and with covariance function

Ω (0) = [ () (0)] =

£2 () ( ) (

0)¤ (7.77)

Since is unobservable, we can replace it by a consistent estimate of it, say b and construct afeasible test statistic as b () = 1√

X=1

b ( ) (7.78)

There are several ways to obtain b One is to apply the series method of Li (2000) to obtain consistentestimates of the parameters in the additive partially linear models first. Another is to apply the kernel

method of Fan and Li (2003). In either case, let b be the consistent estimator of and b () be theconsistent estimator of () = 1 Then we estimate consistently by

b = − 0b − X

=1

b () 4A separable space is topological space that has a countable dense subset. An example is the Euclidean space R

160

Now, one can construct a Cramer-von Mises statistic for testing 0 :

=

Z b ()2 () = 1

X=1

b ()2

where (·) is the empirical distribution of1 Alternatively, one can construct the Kolmogorov-

Smirnov statistic:

= sup∈=

¯ b ()¯ or = max1≤≤

¯ b ()¯

The following theorem states the asymptotic distributions of the test statistics.

Theorem 7.9 Under some conditions and under 0

(i) b (·) converges weakly to ∞ (·) in L2(= ||·||) where ∞ (·) is a Gaussian process with zeromean and certain covariance function Ω1 (

0) where

Ω1 (0) =

£2() () (

0)¤

() = ( )− ()− () with () = [ ( )] () = [ ( ) ] [ (0)]−1

= − [] and [·] denotes the projection onto the space G of additive functions.(ii)

→ R[∞ ()]

2 () and

→ sup∈= |∞ ()| where (·) is the cdf of

Li, Hsiao and Zinn (2003) consider a special case of the additive partially linear model where =

() is a × 1 vector of known functions. In their case, = ¡2 |

¢=

¡2 |

¢= 2() and

Ω1 (0) = Ω1 ( 0) =

£2() () (

0)¤

where () = ( )− ()− () with () = [ ( )] () = [ ( ) ] [ (0)]−1

and = ()− [ ()]

The asymptotic null distribution is not pivotal. Li, Hsiao and Zinn (2003) suggest using a residual-

based wild bootstrap method to approximate the critical values for the null limiting distributions of

and The procedure is standard and we only discuss it briefly here based on the method of sieves.

Let ∗ denote the wild bootstrap error that is generated via a two-point distribution: ∗ = [(1−

√5)2]b

with probability (1+√5)(2

√5) and ∗ = [(1+

√5)2]b with probability (√5− 1)(2√5)We generate

∗ according to the null model ∗ = 0

b + ()

0 b+ ∗

Based on the wild bootstrap sample ( ∗ )=1 we can re-estimate the model under the nullto obtain estimates b∗ and b∗ Let b∗ = ∗ − 0

b∗ − ()

0 b∗ the bootstrap residual. Then thebootstrap test statstic is given by

b∗ () = 1√

X=1

b∗ ( )

Using b∗ () we can compute a bootstrap version of the statistic, i.e., ∗ =

1

P=1[

b∗ ()]2

The bootstrap version of the statistic is analogously defined.

161

7.5 Generalized Additive Partially Linear Models

In this section, we introduce Gozalo and Linton’s (2001) estimation and test of generalized additive

partially linear models.

Let ( ) be a random vector with of dimension , of dimension and as scalar. Let

( ) = (| = = ) ( ) has a generalized additive partially linear structure:

0 ( ( )) = 0 + 00 +X

=1

() (7.79)

where (·), ∈ Θ ⊂ R is a parametric family of transformations (a link function which is knownup to the finite dimensional parameter ), is a × 1 vector of unknown parameters, = (1 )

are the -dimensional random variables, and (·), = 1 are one-dimensional unknown smooth

functions. For the identification purpose, it is convenient to assume that [ ()] = 0 For future

use, let 0 = (0 0 0)

There are many cases where the specification in (7.79) can arise. It allows us to nest the renown stan-

dard logit and probit models in more flexible structures. Also, it nest the Box-Cox-type of transformation

as a special case where () =¡ − 1¢

Example 7.5 (Misclassification of a binary dependent variable). The specification in (7.79)

can arise from a misclassification of a binary dependent variable as in Copas (1988) and Hausman,

Abrevaya and Scott-Morton (1998). Suppose that

( ∗ = 1| = = ) = ( ) = −1Ã+ 0 +

X=1

()

!(7.80)

for some known link function but that when ∗ = 1 we erroneously observe = 0 with probability

1 and ∗ = 0 we erroneously observe = 1 with probability 2 Then

( = 1| = = )

= ( ∗ = 1| = = ) (1− 1) + ( ∗ = 0| = = )2

= 2 + (1− 1 − 2)−1Ã+ 0 +

X=1

()

!

which is of the form (7.79) with −1 = 2 + (1− 1 − 2)−1

7.5.1 Estimation

To simplify presentation, we shall follow Gozalo and Linton (2001) and assume that is continuous while

is discretely valued. Specifically, ∈ 1 = 1 Let = ( ) and = ( )

The values and take will be written as = ( ) and = ( ) Like before, for any = 1

we shall partition = () and = ( ) Let (·) (·) and (·) denote the conditionaldensity functions of and given = respectively. Define

( ) = (|)

¡|

¢ (|) (7.81)

which is a measure of the dependence between and in the conditional distribution. Obviously,

( ) lies between zero and infinity, and when and are independent given = it is one.

162

Let X be the support of and X0 be a rectangular strict subset of X , i.e., X0 = Π=1X0 X0 = 1 are intervals. Let X0 = Π 6=X0For any = ( ) let

0 (; ) =

Z0 ( ( ))

¡|

¢ − 0 − (7.82)

0 (; ) =

Ã+ 0 +

X=1

0 (; )

! (7.83)

where = −1 When the additive structure in (7.79) is true,0 (; 0) = () and

0 (; 0) =

( ) for all (7.82) and (7.83) form the basis of the so-called marginal integration method of esti-

mating additive nonparametric models.

As we shall see, the estimation of the model (7.79) is similar to the idea profile least squares. That

is, one can first assume that = ( ) is known and estimate the additive index using the integration

method. Then one can estimate by generalized method of moments.

Estimation of for given

We can estimate () in two ways. One is consistent when (7.79) holds, and the other is consistent

more generally. First, we estimate () by the NW method

b () = P=1 (−) 1 ( = )P=1 (−) 1 ( = )

(7.84)

where (−) = Π=1 ( −) (·) = −1 (·) (·) is a one-dimensional kernel of order

and = () is a bandwidth sequence. The property of b () is standard. At any interior point ,√

µb ()− ()−

!1 ()

¶→ (0 02 ()) (7.85)

where =R () and () = 2 () (|) where 2 () =Var( | = ) is the conditional

variance function. The bias function () is the probability limit of () where

() =1

(|)X=1

(−) 1 ( = ) ( )− ( )

When (·) satisfies the generalized additive model structure (7.79), we can estimate () with a betterrate of convergence by imposing the additive restrictions. The empirical versions of 0

(; ) and

0 (; ) are given by

e (; ) =1

X=1

0

¡ b ¡ ¢¢1¡ ∈ X0

¢− 0 − (7.86)

e ( ) =

Ã+ 0 +

X=1

e (; )

! (7.87)

where a bandwidth 0 = 0 () is used throughout this estimation.

The asymptotic properties of these estimators can be studied using the similar methods to those of

Linton and Härdle (1996) who derive the pointwise asymptotic properties of e (; ) and e ( )in the absence of discrete variables. Specifically,p

0

µe ( )− ( )− 0!10 (;)

¶→ (0 020 (;)) (7.88)

163

where =R () 0 (;) = 0 ( ( ()))

P=1 0 (;) 0 (;) = 0 ( ( ()))

2 P=1

0 (;) with 0 (;) =R0 ( ()) (;)

¡|

¢ and 0 (;) =

R0 ( ())

2 (;)

¡|

¢2

Estimation of

Gozalo and Linton (2001) suggest estimating the parameters in by the generalized method of

moments. That is, they choose ∈ Ψ to minimize the criterion function

() =

°°°°° 1X=1

( em (;) )°°°°°2

(7.89)

where Ψ is a compact parameter space in R++1 (·) is a given moment function, is a symmetric

and positive definite weighting matrix, em (;) = (e (;) e (;) ) and kk2 = 0 for anypositive definite matrix and conformable vector To study the asymptotic properties of the estimatorb obtained from minimizing the above criterion function, we suppose that the ×1 vector ( ≥ ++1)

of moment functions satisfy

£¡m

0 (;) ¢¤= 0 (7.90)

where m0 (;) =¡0 (;)

0 (;) ¢ if and only if = 0 For example, from a Gaussian

likelihood for homoskedastic regression we obtain

¡m

0 (;) ¢= ( − (;))

(;)

(7.91)

while from a binary choice likelihood we obtain

¡m

0 (;) ¢=

− (;)

(;) (1− (;))

(;)

(7.92)

Given the estimate b (m ) we can finally estimate () by e³; b´ Let (m ) = (m (;) ) and () =

¡m0 (;)

¢ Let () = ()

0 = [ ( )] Define

R (m (;) ) = (m )

m

¯m=m(;)

and R () = R ¡m0 (;) ¢

Under weak conditions, we have

1√

X=1

¡

0¢+

1√

X=1

R ¡ 0¢ ¡ em (;)−m0 (;)

¢ → (0 )

for some finite positive definite matrix The following theorem states the asymptotic property of the

GMM estimator bTheorem 7.10 Under certain conditions,

√³b − 0

´→

µ0³000

´−1 ³00

00´³

000

´−1¶The above theorem implies that the optimal choice of the weighting matrix is given by = b −1

where b → Now the asymptotic variance becomes³000

´−1

164

7.5.2 Specification Test

We now test for the validity of the additive specification (7.79) of the regression function () over a

subset of interest J0 ⊂ R+ of the support of The null hypothesis isH0 : () = 0 (;0) for some 0 ∈ Ψ and all ∈ J0 (7.93)

The alternative hypothesis H1 is the negation of H0 Gozalo and Linton (2001) consider replacing 0

by their estimator b which is√-consistent under the null hypothesis. Let be a family of monotonic

transformations. They consider the following test statistics:

b0 =1

X=1

h (b ())−

³e³; b´´i2 () (7.94)

b1 =1

X=1

e h b ()− e³; b´i () (7.95)

b2 =1

2

X=1

X 6=

ee () () (7.96)

b3 =1

X=1

(e − b)2 () (7.97)

where = (( −) ) 1 ( = ) b = − b () and e = − e³; b´ are the unrestrictedand restricted (additive) residuals, respectively, and (·) is a pre-specified nonnegative weighting function,which is continuous and strictly positive on J0 In (7.94), likely candidates for the function are =

and = One reject the null for large values of b = 0 1 2 3Under certain conditions, the test statistics b = 0 1 2 3 are, after location and scale adjustment,

asymptotically standard normal under the null hypothesis. That is, for each test statistic there exists a

random sequence©

ªsuch that under H0

=2b − p

→ (0 1) = 0 3 (7.98)

The expressions of and are given in Gozalo and Linton (2001).

As usual, the asymptotic normal approximations here can work poorly. To implement the test, one

needs to rely on the bootstrap to compute the bootstrapped p-values or critical values. The procedure is

standard. For more details, see Gozalo and Linton (2001).

7.6 Exercises

1. Generate = 100 binary observations according to the model

log

½ ()

1− ()

¾= + 2

where () = ( = 1| = ) and is uniformly distributed on [−1 1] Fit the model logit( ()) =0 + () as described in the chapter. Plot both the estimated logits and probabilities with their

true values.

2. Derive approximate formulae for the standard errors of the estimates in the previous exercise and

add these to the plots above.

165