Upload
vutu
View
233
Download
0
Embed Size (px)
Citation preview
7 Semiparametric Estimation of Additive Models
Additive models are very useful for approximating the high-dimensional regression mean functions. They
and their extensions have become one of the most widely used nonparametric techniques since the excellent
monograph by Hastie and Tibshirani (1990) and the companion software as described in Chambers and
Hastie (1991). For a recent survey on additive models, see Horowitz (2014).
Much applied research in economics and statistics is concerned with the estimation of a conditional
mean or quantile function. Specifically, let () be a random pair where is a scalar random variable
and is a × 1 random vector that is continuously distributed. We are interested in the estimation
of either () ≡ ( | = ) or () = argmin(·) [ ( − ()) | = ] the th conditional
quantile function of given = : ( ≤ () | = ) = In a classical nonparametric additive
model, or is assumed to have the form
() = 0 +
X=1
() (7.1)
or
() = 0 +
X=1
() (7.2)
where 0 is a constant, is the th element of and 1 are one-dimensional smooth functions
that are unknown and estimated nonparametrically. Model (7.1) or (7.2) can be extended to
() =
⎛⎝0 +
X=1
()
⎞⎠ (7.3)
or
() =
⎛⎝0 +
X=1
()
⎞⎠ (7.4)
where is a strictly increasing function that may be known or unknown.
Below we focus on the study of the estimation of model (7.1) and its extension. Then we briefly touch
upon the models in (7.2)-(7.4).
7.1 The Additive Model and the Backfitting Algorithm
7.1.1 The Basic Additive Model
In the regression framework, a simple additive model is defined by
= 0 +
X=1
() + (7.5)
where (|1 ) = 0 ¡2|1
¢= 2 (1 ) the
0 are arbitrary univariate functions
that are assumed to be smooth and unknown. Note that we can add a constant to a component or
0 and subtract the constant from another component in (7.5). Thus the 0 1 are not identified
without further restrictions. To prevent ambiguity, various identification conditions can be assumed. For
example, one can assume that either
[ ()] = 0 = 1 (7.6)
142
or
(0) = 0 = 1 (7.7)
or Z () = 0 = 1 (7.8)
whichever is convenient for the estimation method on hand. We also assume that the ’ are smooth
functions so that they can be estimated as well as the one-dimensional nonparametric regression problem
(Stone, 1985, 1986). Hence, the curse of dimensionality is avoided.
Frequently, we will denote below
() = ( | = ) = 0 +
X=1
() (7.9)
where = (1 )0and = (1 )
0
Model (7.5) allows use to examine the extent of nonlinear contribution of each explanatory variable to
the dependent variable Under the identification conditions that [ ()] = 0 hold for each = 1
and (|1 ) = 0 we have 0 = ( ) so that the single finite dimensional parameter 0 can be
estimated by the sample mean = −1P
=1 Since converges to 0 at the parametric√-rate,
which is faster than any nonparametric convergence rate, here we will simply work on the model without
0 in (7.5) by assuming ( ) = 0
Additive models of the form (7.5) have been shown to be useful in practice. They naturally generalize
the linear regression models and allow interpretation of marginal changes, i.e., the effect of one variable,
say on the conditional mean function () holding everything else constant. They are also interesting
from a theoretical perspective since they combine flexible nonparametric modeling of many variables with
statistical precision that is typical for just one explanatory variable.
Example 7.1 (Additive AR(p) models) In the time series literature, a useful class of nonlinear
autoregressive models are the additive models
=
X=1
(−) + (7.10)
In this case, the model is also called additive autoregressive models of order and simply denoted as
AAR(). In particular, it includes the AR() model as a special case and allows us to test whether an
AR() model holds reasonably for a given time series.
Restricting ourselves to the class of additive models (7.5), the prediction error can be written as
⎡⎣ − 0 −X
=1
()
⎤⎦2
= [ − (1 )]2 +
⎡⎣ (1 )− 0 −X
=1
()
⎤⎦2 (7.11)
where (1 ) = ( |1 ) Thus finding the best additive model to minimize the least
squares prediction error is equivalent to finding the one that best approximates the conditional mean
function in the senses that 0 and (1 ) minimize the second term in (7.11). In the case where
the additive model is not correctly specified (i.e., Pr( (1 ) = 0 +P
=1 ()) 1), we can
interpret it as the approximation of the conditional mean function
143
7.1.2 The Backfitting Algorithm
The estimation of (1 ) can easily be done by using the backfitting algorithm in the nonparametric
literature. To do this we first introduce some background knowledge on global spline approximation.
The local linear modelling cannot be directly applied to fit the additive model (7.5) with = 0 To
approximate the unknown functions 1 locally at the point 1 we need to localize simultane-
ously in the variables 1 This yields a -dimensional hypercube, which contains hardly any data
points for small to moderate sample sizes, unless the neighborhood is very large. Nevertheless, when the
neighborhood is too large to contain enough data points, the approximation error will be large. This is
the key problem underlying the curse of dimensionality.
To attenuate the problem, we can approximate the nonlinear functions 1 by polynomial splines,
or Hermite polynomials, among others. For example, we can approximate () by
(b) =
X=1
() (7.12)
where b =¡1
¢0and ()=1 can be chosen from some bases of functions, which includes
the trigonometric series sin () cos ()∞=1 the polynomial series©1 2 3
ª and the Gallant’s
(1982) flexible Fourier form© 2 sin () cos () sin (2) cos (2)
ª Below we introduce two popular
choices of approximating functions, namely, polynomial splines and Hermite polynomials.
Spline methods are very useful for nonparametric modelling. They are based on global approximation
and are useful extensions of polynomial regression techniques. Let 1 be a sequence of given knots
such that −∞ 1 · · · ∞ These knots can be chosen either by researchers or data themselves.
A spline function of order is a ( − 1)th continuously differentiable function such that its restriction toeach of the intervals (−∞ 1] [1 2) · · · [ ∞) is a polynomial function of order − 1 The followingformal definition is adapted from Eubank (1999, p. 281).
Definition 7.1 A polynomial spline function () of order with knots 1 is a function of the
form
() =+X=1
() (7.13)
for some set of coefficients = 1 + where⎧⎨⎩ () = −1 = 1
+ () = (− )−1+ = 1
(7.14)
and (− )−1+ = max(− )
−1 0
The above definition is equivalent to saying that
(i) is a piecewise polynomial of order − 1 on any subinterval [ +1);(ii) has − 2 continuous derivatives, and(iii) has a discontinuous ( − 1)st derivative with jumps at = 1
Thus, a spline is a piecewise polynomial whose different polynomial segments are joined together at the
knots = 1 in a fashion that insures the continuity properties. Let (1 ) denote the set of
all functions of the form (7.14). Then (1 ) is a vector space in the sense that the sums of functions
144
in (1 ) remain in the set, etc. Since the functions 1, −1 (− 1)
−1+ (− )
−1+ are
linearly independent, it follows that (1 ) has dimensions +
Example 7.2 (Polynomial splines) To restrict to our case, we can approximate ( = 1 )
by a polynomial spline of order with knots©1
ªby
() '−1X=0
+
X=1
+−1 (− )−1+ ≡ (b) (7.15)
In real applications, for any given number of knots value the knot©
ªcan be simply
chosen as the empirical quantiles of i.e., = ( + 1)-th quantile of for = 1 When
the knots are fine enough on the support of which is usually assumed to be compact, the resulting
spline function (b) can approximate the smooth function quite well. Two popular choices of
are = 2 and 4. When = 2 (b) is simply the piecewise linear function
(b) = 0 + 1+ 2 (− 1)+ + · · ·+ +1
¡−
¢+ (7.16)
One can easily verify that (b) is piecewise linear and continuous, and has kinks at the knots
1
Example 7.3 (Hermite polynomials of order ) We can also approximate the unknown func-
tion ( = 1 ) by Hermite polynomials of order :
() 'X=0
(− 1)exp
(−(− 1)
2
222
)≡ (b) (7.17)
where 1 and 22 can be chosen as the sample mean and sample variance of the data =1 Hermitepolynomials are often chosen when the underlying variables have infinite support.
After the approximation, we can estimate the unknown parameters in the approximation by the least
squares method. That is, we choose b1 b to minimize the following criterion function
−1X=1
− 1 (1b1)− · · ·− (b)2 (7.18)
Let the solution be bb = 1 Then the estimated functions are simplyb () =
³ bb´ = 1 (7.19)
The above least squares problem can be solved directly, resulting in a large parametric problem with
an inversion of matrix of high order. Alternatively, the optimization problem can be solved using the
backfitting algorithm. Conditional expectations provide a simple intuitive motivation for the backfitting
algorithm. If the additive model (7.5) is correct with = 0 (otherwise replace by minus its sample
mean), then for any = 1
⎡⎣ −X 6=
() |
⎤⎦ = () (7.20)
This immediately suggests an iterative algorithm for computing :
145
Step 1. Given the initial values of b2 b (say from the direct least squares solution), minimize (7.18)
with respect to b1 This is a much smaller “parametric problem” can can be solved relatively easily.
Step 2. With estimated values of b1 and values of b3 b we now minimize (7.18) with respect to b2
This results an updated estimate of b2 Repeat this exercise until b is updated.
Step 3. Repeat Steps 1-2 until certain convergence criterion is met.
This is the basic idea of the backfitting algorithm (Ezekiel, 1924, Buja et al., 1989). Let b = −
P 6=
³ bb´ be the partial residuals without using the regressor Then the backfitting
algorithm finds b by minimizing
−1X=1
b − (b)2 (7.21)
This is a nonparametric regression problem of b on the variable . The resulting estimate is linear
in the partial residuals b and can be written as⎛⎜⎜⎜⎜⎜⎜⎝
³1 bb´
³2 bb´...
³ bb´
⎞⎟⎟⎟⎟⎟⎟⎠ = S
⎛⎜⎜⎜⎜⎝b1b2...b
⎞⎟⎟⎟⎟⎠ (7.22)
where S is the smoothing matrix. For the ease of presentation, denote the left hand side of (7.22) as bgand write Y = (1 )
0 Then (7.22) can be written as
bg = S⎛⎝Y−X
6=bg⎞⎠ (7.23)
The above example can utilize polynomial splines or Hermite polynomials as a nonparametric smoother.
The idea can be applied to any nonparametric smoother. Let S be a smoothing matrix that is obtained
by regressing nonparametrically the partial residuals b on The general backfitting algorithmcan be outlined as below.
Step 1. Initialize the functions bg1 bgStep 2. For = 1 compute bg∗ = S ³Y−P 6= bg´ and center the estimator to obtain
bg (·) = bg∗ (·)− −1X=1
bg∗ () (7.24)
where bg∗ () denotes the th element of bgStep 3. Repeat Step 2 until convergence.
See Hastie and Tibshirani (1990, p.91) for a discussion on the above algorithm. The re-centering in
Step 2 is to comply with the constraint in (7.6). The convergence issue of the algorithm is delicate and
has been addressed via the concept of concurvity by Buja et al. (1989). Concurvity is the analogue of
146
collinearity in the linear regression models. Assuming that concurvity is not present, it is shown there
that the backfitting algorithm converges and solves the following equation⎛⎜⎜⎜⎜⎝bg1bg2...bg
⎞⎟⎟⎟⎟⎠ =
⎛⎜⎜⎜⎜⎝ S1 · · · S1
S2 · · · S2...
.... . .
...
S S · · ·
⎞⎟⎟⎟⎟⎠−1⎛⎜⎜⎜⎜⎝
S1
S2...
S
⎞⎟⎟⎟⎟⎠⎛⎜⎜⎜⎜⎝
1
2...
⎞⎟⎟⎟⎟⎠ (7.25)
Direct calculation of the right-hand-side of (7.25) involves inverting a × square matrix and can
hardly be implemented on an average computer for moderate to large sample sizes. In contrast, the
backfitting does not share this drawback and is frequently used in practical implementations.
To add:
Mammen, Linton, and Nielsen (1999, AoS,The existence and asymptotic properties of a backfitting
projection algorithm under weak conditions).
7.1.3 Generalized Additive Models: Logistic Regression
As Hastie and Tibshirani (1990) remark, the linear model is used for regression in a wide variety of contexts
other than the ordinary regression, including log-linear models, logistic regression, the proportional-
hazards model for survival data, models for ordinal categorical response, and transformation models. It
is a convenient but crude first-order approximation to the regression function, and in many cases it is
adequate. The additive model can be used to generalize all these models in an obvious way. For clarity,
we focus on the logistic regression model.
In this setting the response variable is dichotomous, such as yes/no or survived/died, increase/decrease,
and the data analysis is aimed at relating this outcome to the predictors. One quantity of interest is the
proportion of outcome as a function of the predictors (explanatory variables).
In linear modelling of binary data, the most popular approach is logistic regression which models the
logit of the response probability with a linear form
logit () ≡ log½
()
1− ()
¾= 0 (7.26)
where () = ( = 1|) Alternatively, we can write
() =exp ( 0)
1 + exp ( 0)(7.27)
There are several reasons for its popularity, but the most compelling is that the logit model ensures
that the proportions () lie in (0 1) (see (7.27)) without any constraints on the linear predictor 0We can generalize the model in (7.26) by replacing the linear predictor with an additive one
log
½ ()
1− ()
¾= +
X=1
() (7.28)
or that in (7.27) to
() =
⎛⎝+
X=1
()
⎞⎠ (7.29)
where () = exp()1+exp() denotes the CDF of the standard logistic distribution. Thus (7.29) is a special
case of the generalized additive model in (7.4) where is a strictly increasing function that has a known
functional form.
147
Insight from the Linear Logistic Regression Model To estimate the model (7.28), we can gain
some insight from the linear logistic regression methodology. Maximum likelihood is the most popular
method for estimating the linear logistic model. For the present problem the log-likelihood has the form
() =X=1
log ( ()) + (1− ) log (1− ()) (7.30)
where () = exp (0) (1 + exp (
0)) The score equations
()
=
X=1
[ − ()] = 0 (7.31)
are nonlinear in the parameters and consequently one has to find the solution iteratively.
The Newton-Raphson iterative method can be expressed in an appealing form. Given the current
estimate b we can estimate the probabilities () by = exp³ 0b´ ³1 + exp³ 0
b´´ We form
the linearized response
= 0b + ( − ) (1− ) (7.32)
where the quantity represents the first-order Taylor’s series approximation to logit() about the
current estimate 3 Denote = ( − ) (1− ) If b and hence are fixed, the variance of
is 1/ (1− ) and hence we choose the weights ≡ (1− ) Alternatively, we can verify that
(|) = 0 and ¡2 |
¢=
1
() (1− ())(7.33)
in the extreme case where = () So when approximates () we expect that
(|) ≈ 0 and ¡2 |
¢ ≈ 1
() (1− ())≈ 1
(1− ) (7.34)
Consequently, a new b can be obtained by weighted linear regression of on with weights ≡ (1− ) This is repeated until b converges.Algorithm The above iterative algorithm lends itself ideally to the generalized additive model in (7.28).
Define
= e+ X=1
e () + ( − ) (1− ) (7.35)
where (e e) are the current estimates for the additive model components and =
exp³e+P
=1 e ()´
1 + exp³e+P
=1 e ()´ (7.36)
Define the weights
= (1− ) (7.37)
The new estimates of 0 and ( = 1 ) are computed by fitting a weighted additive model to
Of course, this additive model fitting procedure is iterative as well. Fortunately, the functions from
the previous step are good starting values for the next step. This procedure is called the local-scoring
algorithm in the literature. The new estimates from each local scoring step are monitored and the
iterations are stopped when their relative change is negligible.
3Pretending that is bounded away from 0 and 1 and is close to , we have, by the first order Taylor expansion, that
logit() = log
1−
≈ log
1−
+ 1
(1−) ( − )
148
7.2 The Marginal Integration Method
7.2.1 The Marginal Integration Estimator
Let (1 ) = 1 be a random sample from the following additive model
= 0 +
X=1
() + (7.38)
where (|1 ) = 0 ¡2 |1
¢= 2 (1 ) and the (·)=1 is a set of
unknown functions satisfying [ ()] = 0 = 1 . We follow Chen et al. (1995) to define the
marginal integration estimator.
Let (·) be the marginal density of = 1 Let (1 ) = 0 +P
=1 () be the
conditional mean function. For any ∈ 1 we define
= (1 − 1 + 1 )0
The joint density of = (1 −1+1 )0is denoted as
¡¢ Then for a fixed
= (1 )0 the functional Z
() ¡¢Y
6= (7.39)
is 0 + ()
Let (·) and (·) be kernel functions with compact support. Let
(·) = −1 (·) and (·) = −(−1) (·) (7.40)
Using the Nadaraya-Watson (NW) kernel method to estimate the mean function (·) we average overthe observations to obtain the following estimator. For 1 ≤ ≤ and any in the domain of (·) define
b () =1
X=1
e (1 −1 +1 )
=1
X=1
"P=1
¡ −
¢ ( − )P
=1 ¡ −
¢ ( − )
#(7.41)
=X=1
"1
X=1
¡ −
¢ ( − )P
=1 ¡ −
¢ ( − )
#
If the regressors were independent, we might useP
=1 ( − )P
=1 ( − ) to esti-
mate () This is a one-dimensional NW estimator. Nevertheless, this estimator has larger variance
in comparison to the above estimator even in this restricted situation, see Härdle and Tsybakov (1995).
Let (·) denote the joint density of = (1 )0 The following assumptions are modified
from Chen et al. (1995).
Assumptions
A1. is IID. ¡4¢∞
A2. The densities and are bounded, Lipschitz continuous and bounded away from zero. The
function has Lipschitz continuous derivatives.
A3. The conditional variance function 2 () = ¡2 | =
¢is Lipschitz continuous.
149
A4 The kernel function (·) is a bounded nonnegative second order kernel that is compactly supportedand Lipschitz continuous. 02 =
R2 () ∞ and 21 =
R2 () ∞
A5. The kernel function is a bounded, compactly supported, and Lipschitz continuous. is a th
order kernel with kk22 =R2 () ∞
A6. As →∞ → 0 = 0−15 and −1 log→∞
Theorem 7.2 Suppose that Assumptions A1-A6 hold and the order of satisfies (− 1) 2 Then
25 b ()− ()− 0 → ( () ()) (7.42)
where
() = 2021
(1
200 () + 0 ()
Z¡¢
()
()
)
() = −10 02
Z2 () 2
¡¢
()
Proof. See Chen et al. (1995).
Theorem 7.2 says that the rate of convergence to the asymptotic normal limit distribution does
not suffer from the “curse of dimensionality”. Nevertheless, to achieve this rate of convergence, we
must impose some restrictions on the bandwidth sequences and choices of kernels. Note that the above
bandwidth condition does not exclude the optimal one dimensional smoothing bandwidth of the −15
rate for both and when ≤ 4 More importantly, we can take = ¡−15
¢ When ≥ 5 we can no
longer use at the rate −15 and we need to use higher order kernel to reduce the bias that is associatedwith the use of
Estimation of the regression surface Define
b () = X=1
b ()− (− 1) b (7.43)
where b = −1P
=1 The following theorem gives the asymptotic distribution of b ().Theorem 7.3 Under the conditions of Theorem 7.2,
25 b ()− () → ( () ()) (7.44)
where () =P
=1 () and () =P
=1 ().
Proof. See Chen et al. (1995).
Theorem 7.3 says that the covariance between b () and b () for 6= is asymptotically negli-
gible, i.e., it is of smaller order than the variances of each component function estimator.
7.2.2 Marginal Integration Estimation of Additive Models with Known Links
For many situations, especially binary and survival time data, the model (7.38) may not be appropriate.
In the parametric case, a more appropriate framework of modelling is provided by generalized linear
150
models of McCullagh & Nelder (1989). Hastie and Tibshirani (1991) extend these ideas to nonparametric
modelling. In the nonparametric case, the model can be fully or partially specified. In the full model
specification case, the conditional distribution of given is assumed to belong to an exponential family
with known link function and mean function such that
() = 0 +
X=1
() (7.45)
where [ ()] = 0 = 1 . This model is usually called a generalized additive model in the
literature. It implies, for example, that the variance is functionally related to the mean. In the partial
model specification case, we keep the form (7.45) but don’t restrict ourselves to the exponential family.
In this latter case, the variance function is unrestricted.
Example 7.4. Clearly, when is the identity function we have the additive regression model exam-
ined above. Other examples include the logit and probit link functions for binary data, and the logarithm
transform for Poisson count data (McCullagh & Nelder 1989, p.30) and the Box-cox transformation (e.g.,
→ ¡ − 1¢ ). It also incudes cases when the regression function is multiplicative.
The backfitting procedure in conjunction with Fisher scoring is widely used to estimate (7.45) (Hastie
and Tibshirani 1991, p.141). It exploits the likelihood structure. Nevertheless, it is even less tractable
from a statistical perspective when is not the identity because the estimate is not linear in
Estimation of the Additive Components Linton and Härdle (1996) propose a marginal integration-
based method of estimating the components in (7.45). The main advantage of their method is that
one can obtain its asymptotic properties. They also suggest how to take into account of the additional
information provided by the exponential family structure.
Noticing that under the additive structure (7.45),
() ≡Z
©¡
¢ª¡¢ = 0 + () (7.46)
The general strategy is to replace both ¡
¢and
¡¢in (7.46) by their estimates. We estimate
¡
¢by
e ¡ ¢ = X=1
¡
¢ (7.47)
where
¡
¢=
¡ −
¢ ( − )P
=1 ¡ −
¢ ( − )
(7.48)
() can be estimated by its sample analogue:
b () = −1X=1
©e ¡
¢ª (7.49)
When is the identity function, b () is linear in , i.e., b () = P=1 () (c.f. equation
(7.41)), where
() = −1X=1
¡
¢
In general, b () is a nonlinear function of 151
The above procedure is carried out for each = 1 We can obtain estimates of each (·)evaluated at each sample point. Let
b = ()−1 X=1
X=1
b () b () = b ()− bWe reestimate () by
b () = −1(b+ X
=1
b ())
where −1 is the inverse function of Let b = − b () be the additive regression residuals which
estimate the errors = − () These residuals can be used to test the additive structure, i.e., to
look for possible interactions ignored in the simple additive structure. When (7.45) is true, b should beapproximately uncorrelated with any function of
We modify Assumption A6 to be
A6* As →∞ → 0 = 0−15 25 → 0 and 25−1 →∞
Theorem 7.4 Suppose that Assumptions A1-A6 and A6* hold and the order of satisfies (− 1) Assume that is twice continuously differentiable. Then
25 b ()− ()− 0 → ( () ()) (7.50)
where
() = 2021
½1
200 ()
Z0 ( ())
¡¢ + 0 ()
Z0 ( ())
ln ()
¡¢
¾
() = −10 02
Z[0 ( ())]2
2 () 2¡¢
()
Proof. See Linton and Härdle (1996).
As Linton and Härdle (1996) remark, we can use the local linear smoother as a pilot in place of the
NW estimator. In this case, the asymptotic variance of b () will be the same but the bias will takethe simpler form
() = 2021
½1
200 ()
Z0 ( ())
¡¢
¾
To construct the asymptotic confidence interval of b () we need to estimate () and () It
is easy to show that () can be consistently estimated by
b () = X=1
bb2where b = 1
X=1
0nb³
´o
¡
¢
The formula for the estimate of () is quite complicated and we thus omit it. Nevertheless, in case of
undersmoothing, i.e., = ¡−15
¢ it suffices to estimate () Alternatively, we can use bootstrap
method (e.g., wild bootstrap) for approximating the desired asymptotic confidence interval.
152
Estimation of the Regression Surface After we obtain estimates for the additive component, we
can obtain estimates for the regression function. Note that we don’t limit ourselves to the exponential
family structure discussed earlier.
Theorem 7.5 Suppose the conditions of Theorem 7.4 hold and ≡ −1 is twice continuously differen-
tiable. Then
25 b ()− () → ( () ()) (7.51)
where () = 0 (0 +P
=1 ())P
=1 () and () = 0 (0 +P
=1 ())2P
=1 ().
Proof. The result follows from Theorem 7.4, the delta method, and the fact that b () and b ()are asymptotically independent for 6=
Theorem 7.5 says that the rate of convergence of b () is free from the “curse of dimensionality” as
desired.
Remark. Yang, Sperlich, and Härdle (2003, JSPI) studies the derivative estimation of generalized
additive models via the kernel method. In addition, they study the hypothesis testing on derivatives.
7.2.3 Efficient Estimation of Generalized Additive Models
Recently, Linton (2000, ET) defines new procedures for estimating the generalized additive nonparametric
regression models that are more efficient than the Linton and Härdle (1996). He considers criterion
functions based on the linear exponential family. When the linear exponential family specification is
correct, the new estimator achieves certain oracle bounds. For brevity, we refer the reader to Linton
(2000).
7.3 Additive Partially Linear Models
In this section we introduce two methods to estimate additive partially linear models. One is the series
method of Li (2000, Intl Economic Review)) and the other is the kernel method of Fan and Li (2003,
Statistica Sinica).
7.3.1 Series Estimation
A typical additive partially linear model is of the form
= 00 + 1 (1) + · · ·+ () + (7.52)
where (| 1 ) = 0 is a × 1 vector of random variables that does not contain a
constant term, 0 is a × 1 vector of unknown parameter; is of dimension ( ≥ 1 = 1 );
(·) = 1 are unknown smooth functions. Denote by the non-overlapping variables obtained
from (1 ). is of dimension with ≤ ≤P=1 In practice, the most widely used case is
where = 1 ∀ so that is a scalar random variable and = .
Clearly, the individual functions (·) ( = 1 ) are not identified without some identification
conditions. In the literature of kernel estimation, it is convenient to impose [ ()] = 0 for all
= 2 · · · Nevertheless, such conditions are not easily imposed for series estimation. Instead, here itis more convenient to impose
(0) = 0 = 2 (7.53)
153
To construct a series estimator for the unknown parameters in the model, we use Li’s (2000) definition
of the class of additive functions.
Definition. A function () is said to belong to an additive class of functions G ( ∈ G) if(i) () =
P=1 () () is continuous in its support Z which is a compact sunset of R
( = 1 );
(ii)P
=1h ()
2i∞;
(iii) (0) = 0 for = 2 · · · When () is a vector-valued function, we say that ∈ G if each of its component belongs to GIn vector notation, we can write (7.52) as
= 0 + 1 + · · ·+ + = 0 + + (7.54)
where = (1 )0 = (1 )
0 = ( (1) ())
0 = (1 )
0and =
1 + · · ·+
For = 1 we shall use a linear combination of functions,
() = [
1 ()
()]
0 toapproximate () Let
() = [11 (1)
0
()0]0 where =
P=1 The linear combination
of () forms an approximation function for () ≡ (1 ) =P
=1 () The approximation
function has the following properties: (1) it belongs to G and (2) as grows for all = 1 there
is a linear combination of () that can approximate any ∈ G arbitrarily well in the mean squarederror sense.
Define
=h
(1)
()i0( = 1 ), (7.55)
= (1 )
Note that is of dimension × and is of dimension × Define = ( 0 )− 0 where −
is the symmetric generalized inverse of For a matrix with rows, let e = If we premultiply
both sides of (7.54) by then we have
e = e0 + e + e (7.56)
Subtracting (7.56) from (7.54) gives
− e = ³ − e´0 + ( − e) + ³ − e´ (7.57)
So we can estimate 0 by regressing − e on − e to obtain
b = ∙³ − e´0 ³ − e´¸− ³ − e´0 ³ − e ´ (7.58)
After obtaining b we can estimate () =P=1 () by b () = ()
0 b whereb = ( 0 )− 0 ³ − eb´ (7.59)
() can also be estimated easily based upon b For example, b1 (1) = 11 (1)
0 b1 where b1 is acollection of the first 1 elements in bDefine () ≡ (| = ) and 2 ( ) ≡Var(| = = ) We use () to denote the pro-
jection of () onto G That is, () = [ ()] where by the definition of (·) we know () is
154
an additive function, i.e., () =P
=1 () ∈ G and (·) is the solution of the following minimizationproblem:
©[ ()− ()] [ ()− ()]
0ª= inf
=
=1 ∈G
⎧⎨⎩" ()−
X=1
()
#" ()−
X=1
()
#0⎫⎬⎭
Noting that (·) is of dimension × 1, we can write () = ¡(1) () () ()¢0 To state the main result, we need to make some assumptions.
Assumptions.
A1. (i) ( ) = 1 are IID. The support of ( ) is a compact subset of R+; (ii)Both () ≡ (| = ) and 2 ( ) ≡Var(| = = ) are bounded functions on the support
of ( )
A2. (i) For every there is a nonsingular matrix such that for () = () the smallest
eigenvalue of [ () ()
0] is bounded away from zero uniformly in ; (ii) there is a sequence of
constants 0 () satisfying sup∈Z¯¯ ()
¯¯ ≤ 0 () and = such that (0 ())2 → 0 as
→∞ where Z is the support of
A3. (i) For =P
=1 there exist some 0 and = ()= (01
0
)0 = 1 such
that sup∈Z¯ ()− ()
0¯=
³P=1
−
´as min1 → ∞ where = (01
0)
0;
(ii)√P
=1 − → 0 as →∞
A4. Φ = ©[ − ()] [ − ()]
0ªis positive definite.
Assumption A1 is quite standard in the literature of estimating additive models. Assumption A2
ensures that ( 0 ) is asymptotically nonsingular. While Assumptions A2-A3 are not primitive conditions,it is known that many series functions satisfy these conditions. Newey (1997) gives primitive conditions
for power series and splines such that Assumptions A2-A3 hold. For power series, 0 () = () ; for
B-splines, 0 () = (√) For = 1 if the density () of is continuously differentiable of
order then =
The following theorem states the asymptotic property of bTheorem 7.6 Under Assumptions A1-A4, we have
√³b − 0
´→
¡0Φ−1ΨΦ−1
¢
where Ψ = £2 ( )
0
¤and = − ()
For statistical inference, we need to obtain consistent estimates of Φ and Ψ. Li (2000) shows that we
can estimate them consistently by
bΦ = −1X=1
³ − e
´³ − e
´0and bΨ = −1
X=1
b2 ³ − e
´³ − e
´0
where e is the th row of e = and b = − 0b − b ()
Li (2000) also gives the convergence rate of b () = ()0 b to () =P=1 ()
155
Theorem 7.7 Under Assumptions A1-A4, we have
(i) sup∈Z |b ()− ()| = (0 ())³p
+P
=1−
´;
(ii) −1P
=1 [b ()− ()]2=
³+
P=1
−2
´;
(iii)R[b ()− ()]2 () =
³+
P=1
−2
´ where (·) is the cdf of
The property of b () is similar and thus omitted for brevity.7.3.2 Kernel Method
We now study the kernel method of estimating the additive partially linear models. Like Fan and Li
(2003), we consider the additive partially linear model
= 0 + 0 + 1 (1) + · · ·+ () + (7.60)
where (| 1 ) = 0 is a ×1 vector of random variables that does not contain a constantterm, 0 is a scalar parameter, =
¡1
¢0is a × 1 vector of unknown parameters; the 0 are
univariate continuous random variables, and (·) = 1 are unknown smooth functions.Let
= (1 −1 +1 )
where is removed from =(1 ). Define
¡¢= 1 (1) + + −1 (−1) + +1 (+1) + + ()
Then we can rewrite (7.60) as
= 0 + 0 + () +
¡
¢+ (7.61)
Fan, Härdle, and Mammen (1998) consider the case where is a × 1 vector of discrete variables andsuggest two ways of estimating model (7.61). In neither method did they make full use of the information
that enters the regression function linearly. Motivated by this observation, Fan and Li (2003) consider
a two-stage estimation procedure which applies to the case where contains both discrete and continuous
elements and makes full use of the information that enters the regression function linearly.
For = 1 define
( ) = £| = =
¤ () =
£( )
¤
( ) = £| = =
¤ () =
£( )
¤
Denote = () and = () Then taking conditional expectations on both sides of (7.61)
gives
( ) = 0 + ( )0 + () +
¡¢ (7.62)
Integrating both sides of (7.62) over leads to
() = 0 + ()0 + () (7.63)
where we have used the identification condition that £
¡
¢¤= 0 Replacing in (7.63) by
and then summing both sides of (7.63) gives
X=1
= 0 +
X=1
0 +X
=1
() (7.64)
156
Subtracting (7.64) from (7.60), we can eliminateP
=1 () and get
−X
=1
= (1− )0 +
à −
X=1
!0 + (7.65)
Let Y = −P
=1 and X =³1 ( −
P=1 )
0´ Then in vector notation we can write (7.65)
as
Y = X + (7.66)
where Y and X are × 1 and × ( + 1) matrices with th rows given by Y and X respectively, = (1 )
0 and =¡0
0¢0 with 0 = (1− )0We can apply OLS regression to (7.66) to obtain
=
Ã0
!= (X 0X )−1X 0Y = + (X 0X )−1X 0 (7.67)
Under standard conditions, we can show that converges to at the parametric√-rate.
Nevertheless, is an infeasible estimator because it depends on the unknown quantitiesP
=1
andP
=1 . To obtain a feasible estimator of we need to replace these unknown quantities by their
consistent estimates. A consistent estimator of = () is given by
b =1
X=1
½P=1 ( − )( − )P=1( − )( − )
¾
=1
X=1
X=1
( − )( − )P=1 ( − )( − )
(7.68)
≡X=1
where the definition for is clear, () = −1 () () = −(−1)
¡
¢ and are
kernel functions, and are smoothing parameters. Fan and Li (2003) use the leave-one-out method
to obtain which can only simplify the proofs but does not change the asymptotic results. Note
thatP
=1 = 1 b is a weighted average of Similarly, a consistent estimator of = () is
given by b =X=1
(7.69)
Let bY = −P
=1b and bX = ³1 −
P=1
b
´ We can obtain a feasible estimator of by
replacing Y and X in (7.67) by bY and bX respectively. Nevertheless, noting that near the boundarythe density () = (1 ) of cannot be estimated well. So Fan and Li (2003) consider trimming
observations near the boundary. Assume that ∈ [ ] where are both finite constants,
= 1 Define the trimming set S = Π=1 [ + − ] where = for some 0
0 1 = max Let 1 = 1 ( ∈ S) Fan and Li (2003) estimate by
b = Ã b0b!=³ bX 0 bX´−1 bX 0 bY = Ã X
=1
bX bX 01
!−1 X=1
bX bY1 (7.70)
where bY and bX are × 1 and × (+ 1) matrices with th rows given by bY1 and bX1 respectively.To state the main result, we make the following assumptions.
157
Assumptions.
A1. =1 are IID; ( ) has a finite support with the support of being a product set
Π=1 [ ] ; the density function of is bounded from below by a positive constant on its support;
¡4¢∞
A2. () () () and (1 ) all belong to G4 where ≥ 2 is an integer. Let
=¡
¢ where and denote the continuous and discrete components of respectively, then
for all values of 2 ( 1 ) = (2 | = 1 = 1 = ) ∈ G41 G is defined in Section4.1.1.
A3. Φ = h( −
P=1 ) ( −
P=1 )
0iis positive definite.
A4. The kernel functions and are bounded, symmetric, and both are of order
A5. As →∞ 2 →∞ 32−1 →∞ and (2 + 2 )→ 0
Note that when = = Assumption A5 implies that 32 →∞ and 2 → 0 which allows
the use of a second order kernel if ≤ 5 It also implies that the data needs to be undersmoothed.The following theorem shows the asymptotic property of b
Theorem 7.8 Under Assumptions A1-A5, we have
√³b −
´→ (0Φ−1ΨΦ−1)
where Ψ = £2
0
¤ with = + ( − ()) (1 −
P=1 ) = − (|) = () −P
=1 () and = () ¡
¢¡
¢ Here (·) (·) and (·) are the density
functions of and respectively.
Given the√-consistent estimator b we can rewrite (7.60) as
− 0b = 0 +
X=1
() + + 0
³ − b´ (7.71)
= 0 +
X=1
() + error term.
The intercept term 0 can be√-consistently estimated by
b0 = −0b
where = −1P
=1 and = −1P
=1 Note that (7.71) is essentially an additive regression
model with − 0b as the new dependent variable and + 0
³ − b´ as the new error term. Sinceb − =
¡−12
¢ a rate faster than any nonparametric convergence rate, the asymptotic distribution
of any nonparametric estimator of () based on (7.71) will remain the same as if one replaces b by To make statistical inference on , we need to estimate Φ and Ψ consistently. Let b (·) b (·) andb (·) denote the kernel estimators of (·) (·) and (·) respectively. That is,
b () = −1X=1
( − ) b ¡¢ = −1X=1
¡ −
¢
b () = −1X=1
( − )¡ −
¢
158
Let b denote a kernel estimator of (|) Define
b = b () b ¡¢ b = b −X
=1
b = −1X=1
bb = − b b = b + (b − )
Ã1−
X=1
b!
b = − b0 − 0b − X
=1
b () Then we can estimate Φ and Ψ consistently by
bΦ = 1
X=1
à −
X=1
b
!Ã −
X=1
b
!0and bΨ = 1
X=1
b2 b b07.4 Specification Test for Additive Models
7.4.1 Test for Additive Partially Linear Models via Series Method
Li, Hsiao, and Zinn’s (2003, JoE) consider consistent specification tests for semiparametric/nonparametric
models where the null models all contain some nonparametric components based on series estimation
methods. A leading case is to test for an additive partially linear model. The null hypothesis is
0 : (| ) = 0 +
X=1
() a.s. for some ∈ BX=1
(·) ∈ G, (7.72)
where is a × 1 vector of regressors, is of dimension ≥ 1 is an × 1 unknown parameter, is a × 1 non-overlapping variables of = 1 B is a compact set of R and G is the class ofadditive functions defined in the last section. In particular, the identification condition is (0) = 0 for
= 2 The alternative hypothesis is
1 : (| ) 6= 0 +
X=1
() (7.73)
on a set with positive measure for any ∈ B and any P=1 (·) ∈ G.
Let = − 0 −
P=1 () and = ( ) The null hypothesis 0 is equivalent to
0 : (|) = 0 a.s. (7.74)
Noting that (|) = 0 a.s. if and only if (()) = 0 for all (·) ∈ A, the class of bounded ()-
measurable functions, Li, Hsiao, and Zinn (2003) follow Bierens and Ploberger (1997), Stute (1997), and
Stinchcombe and White (1998), and consider the following unconditional moment test:
[ ( )] = 0 for almost all ∈ = ⊂R+ (7.75)
where (·) is a proper choice of weight function so that (7.75) is equivalent to (7.74). Stinchcombe andWhite (1998) show that there exists a wide class of weight functions (·) that makes (7.75) equivalent
159
to (7.74). Choices of weight functions include the exponential function ( ) = exp ( 0) the
logistic function ( ) = 1(1 + exp( − 0)) with 6= 0 the trigonometric function ( ) =
cos( 0) + sin(
0) and the usual indicator function ( ) = 1 ( ≤ ) See Stinchcombe and
White (1998) and Bierens and Ploberger (1997) for more discussions. The goodness of switching from a
conditional moment test of (7.74) to an unconditional moment test of (7.75) is to avoid the estimation of
the alternative model nonparametrically, as in Chen and Fan (1999), and Delgado and González Manteiga
(2001).
If is observable, one can construct a test based upon the sample analogue of [ ( )] = 0 :
0 () =1√
X=1
( ) (7.76)
0 (·) can be viewed as a random element taking values in the separable space L2 (= ) of all real, Borelmeasurable functions on = such that R= ()2 () ∞ 4 which is endowed with the 2 norm
kk =½Z
= ()2 ()
¾12
Chen and White (1997) show that for a sequence IID process (·)=1. L2 (= )-valued elements,−12
P=1 (·) converges weakly to Z (·) in the topology of (L2 (= ) ||·||) if and only if
R=
h ()
2i ()
∞, where Z is a Gaussian element with the same covariance function Ω (0) = [ () (0)]
Let 2 () = ¡2 |
¢ It is easy to check that for (·) = ( ·) we have
hk (·)k2
i=
½Z2 ( )
2 ()
¾=
½2 ()
Z ( )
2 ()
¾≤
£2 ()
¤2Z= () ∞
provided the weight function (·) is bounded above by on =×=. This implies that
0 (·) converges weakly to 0∞ (·) in L2(= ||·||)
where 0 (·) is a Gaussian process centered at zero and with covariance function
Ω (0) = [ () (0)] =
£2 () ( ) (
0)¤ (7.77)
Since is unobservable, we can replace it by a consistent estimate of it, say b and construct afeasible test statistic as b () = 1√
X=1
b ( ) (7.78)
There are several ways to obtain b One is to apply the series method of Li (2000) to obtain consistentestimates of the parameters in the additive partially linear models first. Another is to apply the kernel
method of Fan and Li (2003). In either case, let b be the consistent estimator of and b () be theconsistent estimator of () = 1 Then we estimate consistently by
b = − 0b − X
=1
b () 4A separable space is topological space that has a countable dense subset. An example is the Euclidean space R
160
Now, one can construct a Cramer-von Mises statistic for testing 0 :
=
Z b ()2 () = 1
X=1
b ()2
where (·) is the empirical distribution of1 Alternatively, one can construct the Kolmogorov-
Smirnov statistic:
= sup∈=
¯ b ()¯ or = max1≤≤
¯ b ()¯
The following theorem states the asymptotic distributions of the test statistics.
Theorem 7.9 Under some conditions and under 0
(i) b (·) converges weakly to ∞ (·) in L2(= ||·||) where ∞ (·) is a Gaussian process with zeromean and certain covariance function Ω1 (
0) where
Ω1 (0) =
£2() () (
0)¤
() = ( )− ()− () with () = [ ( )] () = [ ( ) ] [ (0)]−1
= − [] and [·] denotes the projection onto the space G of additive functions.(ii)
→ R[∞ ()]
2 () and
→ sup∈= |∞ ()| where (·) is the cdf of
Li, Hsiao and Zinn (2003) consider a special case of the additive partially linear model where =
() is a × 1 vector of known functions. In their case, = ¡2 |
¢=
¡2 |
¢= 2() and
Ω1 (0) = Ω1 ( 0) =
£2() () (
0)¤
where () = ( )− ()− () with () = [ ( )] () = [ ( ) ] [ (0)]−1
and = ()− [ ()]
The asymptotic null distribution is not pivotal. Li, Hsiao and Zinn (2003) suggest using a residual-
based wild bootstrap method to approximate the critical values for the null limiting distributions of
and The procedure is standard and we only discuss it briefly here based on the method of sieves.
Let ∗ denote the wild bootstrap error that is generated via a two-point distribution: ∗ = [(1−
√5)2]b
with probability (1+√5)(2
√5) and ∗ = [(1+
√5)2]b with probability (√5− 1)(2√5)We generate
∗ according to the null model ∗ = 0
b + ()
0 b+ ∗
Based on the wild bootstrap sample ( ∗ )=1 we can re-estimate the model under the nullto obtain estimates b∗ and b∗ Let b∗ = ∗ − 0
b∗ − ()
0 b∗ the bootstrap residual. Then thebootstrap test statstic is given by
b∗ () = 1√
X=1
b∗ ( )
Using b∗ () we can compute a bootstrap version of the statistic, i.e., ∗ =
1
P=1[
b∗ ()]2
The bootstrap version of the statistic is analogously defined.
161
7.5 Generalized Additive Partially Linear Models
In this section, we introduce Gozalo and Linton’s (2001) estimation and test of generalized additive
partially linear models.
Let ( ) be a random vector with of dimension , of dimension and as scalar. Let
( ) = (| = = ) ( ) has a generalized additive partially linear structure:
0 ( ( )) = 0 + 00 +X
=1
() (7.79)
where (·), ∈ Θ ⊂ R is a parametric family of transformations (a link function which is knownup to the finite dimensional parameter ), is a × 1 vector of unknown parameters, = (1 )
are the -dimensional random variables, and (·), = 1 are one-dimensional unknown smooth
functions. For the identification purpose, it is convenient to assume that [ ()] = 0 For future
use, let 0 = (0 0 0)
There are many cases where the specification in (7.79) can arise. It allows us to nest the renown stan-
dard logit and probit models in more flexible structures. Also, it nest the Box-Cox-type of transformation
as a special case where () =¡ − 1¢
Example 7.5 (Misclassification of a binary dependent variable). The specification in (7.79)
can arise from a misclassification of a binary dependent variable as in Copas (1988) and Hausman,
Abrevaya and Scott-Morton (1998). Suppose that
( ∗ = 1| = = ) = ( ) = −1Ã+ 0 +
X=1
()
!(7.80)
for some known link function but that when ∗ = 1 we erroneously observe = 0 with probability
1 and ∗ = 0 we erroneously observe = 1 with probability 2 Then
( = 1| = = )
= ( ∗ = 1| = = ) (1− 1) + ( ∗ = 0| = = )2
= 2 + (1− 1 − 2)−1Ã+ 0 +
X=1
()
!
which is of the form (7.79) with −1 = 2 + (1− 1 − 2)−1
7.5.1 Estimation
To simplify presentation, we shall follow Gozalo and Linton (2001) and assume that is continuous while
is discretely valued. Specifically, ∈ 1 = 1 Let = ( ) and = ( )
The values and take will be written as = ( ) and = ( ) Like before, for any = 1
we shall partition = () and = ( ) Let (·) (·) and (·) denote the conditionaldensity functions of and given = respectively. Define
( ) = (|)
¡|
¢ (|) (7.81)
which is a measure of the dependence between and in the conditional distribution. Obviously,
( ) lies between zero and infinity, and when and are independent given = it is one.
162
Let X be the support of and X0 be a rectangular strict subset of X , i.e., X0 = Π=1X0 X0 = 1 are intervals. Let X0 = Π 6=X0For any = ( ) let
0 (; ) =
Z0 ( ( ))
¡|
¢ − 0 − (7.82)
0 (; ) =
Ã+ 0 +
X=1
0 (; )
! (7.83)
where = −1 When the additive structure in (7.79) is true,0 (; 0) = () and
0 (; 0) =
( ) for all (7.82) and (7.83) form the basis of the so-called marginal integration method of esti-
mating additive nonparametric models.
As we shall see, the estimation of the model (7.79) is similar to the idea profile least squares. That
is, one can first assume that = ( ) is known and estimate the additive index using the integration
method. Then one can estimate by generalized method of moments.
Estimation of for given
We can estimate () in two ways. One is consistent when (7.79) holds, and the other is consistent
more generally. First, we estimate () by the NW method
b () = P=1 (−) 1 ( = )P=1 (−) 1 ( = )
(7.84)
where (−) = Π=1 ( −) (·) = −1 (·) (·) is a one-dimensional kernel of order
and = () is a bandwidth sequence. The property of b () is standard. At any interior point ,√
µb ()− ()−
!1 ()
¶→ (0 02 ()) (7.85)
where =R () and () = 2 () (|) where 2 () =Var( | = ) is the conditional
variance function. The bias function () is the probability limit of () where
() =1
(|)X=1
(−) 1 ( = ) ( )− ( )
When (·) satisfies the generalized additive model structure (7.79), we can estimate () with a betterrate of convergence by imposing the additive restrictions. The empirical versions of 0
(; ) and
0 (; ) are given by
e (; ) =1
X=1
0
¡ b ¡ ¢¢1¡ ∈ X0
¢− 0 − (7.86)
e ( ) =
Ã+ 0 +
X=1
e (; )
! (7.87)
where a bandwidth 0 = 0 () is used throughout this estimation.
The asymptotic properties of these estimators can be studied using the similar methods to those of
Linton and Härdle (1996) who derive the pointwise asymptotic properties of e (; ) and e ( )in the absence of discrete variables. Specifically,p
0
µe ( )− ( )− 0!10 (;)
¶→ (0 020 (;)) (7.88)
163
where =R () 0 (;) = 0 ( ( ()))
P=1 0 (;) 0 (;) = 0 ( ( ()))
2 P=1
0 (;) with 0 (;) =R0 ( ()) (;)
¡|
¢ and 0 (;) =
R0 ( ())
2 (;)
¡|
¢2
Estimation of
Gozalo and Linton (2001) suggest estimating the parameters in by the generalized method of
moments. That is, they choose ∈ Ψ to minimize the criterion function
() =
°°°°° 1X=1
( em (;) )°°°°°2
(7.89)
where Ψ is a compact parameter space in R++1 (·) is a given moment function, is a symmetric
and positive definite weighting matrix, em (;) = (e (;) e (;) ) and kk2 = 0 for anypositive definite matrix and conformable vector To study the asymptotic properties of the estimatorb obtained from minimizing the above criterion function, we suppose that the ×1 vector ( ≥ ++1)
of moment functions satisfy
£¡m
0 (;) ¢¤= 0 (7.90)
where m0 (;) =¡0 (;)
0 (;) ¢ if and only if = 0 For example, from a Gaussian
likelihood for homoskedastic regression we obtain
¡m
0 (;) ¢= ( − (;))
(;)
(7.91)
while from a binary choice likelihood we obtain
¡m
0 (;) ¢=
− (;)
(;) (1− (;))
(;)
(7.92)
Given the estimate b (m ) we can finally estimate () by e³; b´ Let (m ) = (m (;) ) and () =
¡m0 (;)
¢ Let () = ()
0 = [ ( )] Define
R (m (;) ) = (m )
m
¯m=m(;)
and R () = R ¡m0 (;) ¢
Under weak conditions, we have
1√
X=1
¡
0¢+
1√
X=1
R ¡ 0¢ ¡ em (;)−m0 (;)
¢ → (0 )
for some finite positive definite matrix The following theorem states the asymptotic property of the
GMM estimator bTheorem 7.10 Under certain conditions,
√³b − 0
´→
µ0³000
´−1 ³00
00´³
000
´−1¶The above theorem implies that the optimal choice of the weighting matrix is given by = b −1
where b → Now the asymptotic variance becomes³000
´−1
164
7.5.2 Specification Test
We now test for the validity of the additive specification (7.79) of the regression function () over a
subset of interest J0 ⊂ R+ of the support of The null hypothesis isH0 : () = 0 (;0) for some 0 ∈ Ψ and all ∈ J0 (7.93)
The alternative hypothesis H1 is the negation of H0 Gozalo and Linton (2001) consider replacing 0
by their estimator b which is√-consistent under the null hypothesis. Let be a family of monotonic
transformations. They consider the following test statistics:
b0 =1
X=1
h (b ())−
³e³; b´´i2 () (7.94)
b1 =1
X=1
e h b ()− e³; b´i () (7.95)
b2 =1
2
X=1
X 6=
ee () () (7.96)
b3 =1
X=1
(e − b)2 () (7.97)
where = (( −) ) 1 ( = ) b = − b () and e = − e³; b´ are the unrestrictedand restricted (additive) residuals, respectively, and (·) is a pre-specified nonnegative weighting function,which is continuous and strictly positive on J0 In (7.94), likely candidates for the function are =
and = One reject the null for large values of b = 0 1 2 3Under certain conditions, the test statistics b = 0 1 2 3 are, after location and scale adjustment,
asymptotically standard normal under the null hypothesis. That is, for each test statistic there exists a
random sequence©
ªsuch that under H0
=2b − p
→ (0 1) = 0 3 (7.98)
The expressions of and are given in Gozalo and Linton (2001).
As usual, the asymptotic normal approximations here can work poorly. To implement the test, one
needs to rely on the bootstrap to compute the bootstrapped p-values or critical values. The procedure is
standard. For more details, see Gozalo and Linton (2001).
7.6 Exercises
1. Generate = 100 binary observations according to the model
log
½ ()
1− ()
¾= + 2
where () = ( = 1| = ) and is uniformly distributed on [−1 1] Fit the model logit( ()) =0 + () as described in the chapter. Plot both the estimated logits and probabilities with their
true values.
2. Derive approximate formulae for the standard errors of the estimates in the previous exercise and
add these to the plots above.
165