View
3
Download
0
Category
Preview:
Citation preview
Statistica Sinica (2009): manuscript 1
Nonparametric Regression Function Estimation for
Errors-in-Variables Models with Validation Data
Lilun Du, Changliang Zou and Zhaojun Wang∗
Nankai University
Abstract: This paper develops estimation approach for nonparametric regression
analysis with measurement error in covariable assuming the availability of inde-
pendent validation data on covariable in addition to primary data on the response
variable and surrogate covariable. Without specifying any error model structure
between the surrogate and true covariables, we propose an estimator which inte-
grates local linear regression and Fourier transformation method. An estimator
combined by two local linear kernel estimators, is firstly used to calibrate the con-
ditional expectation of unknown objective regression function given the surrogate
covariates, and then the final estimator can be derived by passing a trigonometric
series approach suggested by Delaigle et al. (2006). Under mild conditions, the
consistency of the proposed estimator is established and the convergence rate is
also obtained. Numerical examples show that it performs well in applications.
Key words and phrases: Asymptotic Normality, Local Linear Regression, Measure-
ment Error, Trigonometric Series.
1. Introduction
In the last two decades, errors-in-variables (EV) model has drawn much at-
tention from statisticians. An increasing number of applications of the linear
and non-linear EV models have been seen in recent years due to their simple
forms and wide applicabilities. Comprehensive reviews on the research and de-
velopment of the EV model can be found in Fuller (1987), Carroll et al. (2006)
and the references therein. In practice, the relationship between the measured
(surrogate) variables and the true variables can be rather complicated compared
to the classical additive error structure usually assumed. In such cases, obtain-
ing correct statistical analyses becomes challenging. One solution is to use the
help of validation data to capture the underlying relationship between the true∗Corresponding author. Email: zjwang@nankai.edu.cn
2 LILUN DU, CHANGLIANG ZOU AND ZHAOJUN WANG
variables and surrogate variables. For instance in the measurement of heart mus-
cle damage caused by a myocardial infection, peak cardiac enzyme level in the
bloodstream is a variable obtained easily, but this cannot assess accurately the
damage to the heart muscle. Instead, arterioscintograph, an invasive and ex-
pensive procedure, can be employed to produce a more accurate measure of the
heart muscle in a small subset of subjects enrolled in the study (cf., Wittes et al.
1989). Here, diagnostic data of heart damage by peak cardiac enzyme level in
the bloodstream are used as surrogate variable and the corresponding exact data
measured from the process of arterioscintograph for a small subset of subjects
are used as validation variable.
Inference based on surrogate data and a validation sample has been the ob-
ject of much attention. Carroll and Stefanski (1990), Carroll and Wand (1991),
Pepe and Fleming (1991), Pepe (1992), Carroll et al. (1995), Lee and Sepan-
ski (1995), Sepanski and Lee (1995), Wang (1999), Wang and Rao (2002) and
Stute et al. (2007) and the above referenced authors developed suitable methods
for different models. Although there are many collective efforts on statistical
inference with validation data, they mostly focus on specifying some paramet-
ric relationship between covariates and responses. While parametric methods
are useful in certain applications, questions will always arise about adequacy of
these parametric model assumptions and about potential impact of model mis-
specifications on statistical analyses. In comparison, it is well known that the
flexibility of nonparametric regression provides a useful tool especially when the
relationship is too complicated to be specified parametrically. This motivates us
to consider estimating regression function nonparametrically when the covariate
is measured erroneously in which some validation data are available to relate the
surrogate and true variables without specifying any error model structure.
To be specific, we assume that independent validation dataset V = (tj , tj)N+nj=N+1
is available, in addition to the primary (surrogate) dataset S = (Yi, ti)Ni=1 which
are generated by the following nonparametric model
Y = m(t) + ε, (1.1)
where Y is a scalar response variable, t is a univariate explanatory variable,
t is the surrogate variable of t, and ε is a random error with E[ε|t] = 0 and
E[ε2] < ∞. Given ti’s, the errors εi’s are assumed to be independent and identi-
Nonparametric Regression with Validation Data 3
cally distributed. Our objective is to estimate the unknown regression function
m(t) in (1.1) with the datasets V and S. Obviously standard methodologies based
on regression calibration developed for parametric inferences, such as Carroll and
Wand (1991), Sepanski and Lee (1995) and Stute et al. (2007) are not applicable
in this situation. Recently, Wang (2006) developed an estimation approach for
nonparametric regression analysis with surrogate data and validation sampling
which cannot be applied to our problem. Besides the fundamental difference
between measurement error in response variable considered in his paper and co-
variable measured erroneously considered here, the validation dataset assumed
in Wang (2006) is in the form of tj , Yj , YjN+nj=N+1 which contains the exactly
matched tj , YjN+nj=N+1 so that the regression calibration method could be em-
ployed to estimate the function E[Y |Y , t], whereas in our problem the validation
dataset does not provide such auxiliary information except for tj , tjN+nj=N+1.
In this paper, we propose a nonparametric estimator m(z) which integrates
local linear regression and Fourier transformation method. This estimation ap-
proach consists of two major steps: An estimator combined by two local linear
kernel estimators (LLKE) based on V and S respectively, is firstly proposed to
calibrate the function E[Y |U(t) = z] where U(t) = E[t|t]. This estimator is some-
what special and unique because two LLKEs are nested as one kernel function
involves in the other kernel function. Although E[Y |U(t) = z] is not our objec-
tive function, the relationship between m(z) = E[Y |t = z] and E[Y |U(t) = z] can
then be derived through the trigonometric series approach suggested by Delaigle
et al. (2006). Under mild conditions, the consistency of m(z) is established and
the convergence rate is also derived.
In the next section, we start by describing our estimation approach, and then
state the main asymptotic results. In Section 3, we investigate the finite sample
properties of our proposed approach. A real-data example is used to demonstrate
the method in Section 4. Section 5 concludes this paper by suggesting some future
research issues. The proofs are given in the Appendix.
2. Methodology
Recall model (1.1) and the assumptions below it. We consider rewriting (1.1)
4 LILUN DU, CHANGLIANG ZOU AND ZHAOJUN WANG
as the following form
Y = m(U(t) + η) + ε, (2.1)
t = U(t) + η, (2.2)
where U(t), ε and η are independent. Taking U(t) as a new variable, this can
be regarded as a Berkson EV model (Berkson 1950; Carroll et al. 2006). This
enables us to apply a recent nonparametric technique for solving Berkson EV
problem developed by Delaigle et al. (2006), as long as we could obtain estimation
of the distribution of η and E[Y |U(t) = z] on which we will elaborate next.
2.1. Estimating E[Y |U(t) = z]
Represent (2.1) as
Y = M(U(t)) + ξ, (2.3)
where M(z) ≡ E[Y |U(t) = z], E[ξ|t] = 0 and E[ξ2] < ∞ . Generally, M is not
equal to m and ξ 6= ε + η. Based on the validation set V, we can firstly estimate
U(t) by means of local linear fit, that is
U(t) =
∑N+nj=N+1 tjL
(tj−tbn
)Sn2 − (tj − t)Sn1
∑N+nj=N+1 L
(tj−tbn
)Sn2 − (tj − t)Sn1
,
where
Snγ =1
nbn
N+n∑
j=N+1
L
(tj − t
bn
)(tj − t)γ , γ = 0, 1, 2,
L(·) is a symmetric density kernel function and bn is a bandwidth. Here we
choose to use the LLKE rather than the Nadaraya-Watson estimator for the
reason that the former possesses superior boundary behavior (cf., Fan 1992). As
illustrated in the followings, this estimator will be nested into another LLKE,
and thus unsatisfactory performance at boundary at the stage of the first local
linear fit may be deteriorated in the next LLKE because the value of U(t) at
boundary may lie in the interior of the range of U(t).
So far, with the datasets V and S, we can estimate the function M(·) through
Nonparametric Regression with Validation Data 5
plugging U(t) into the LLKE of (Yi, U(ti))Ni=1, that is,
M(z) =
∑Ni=1 YiK
(U(ti)−z
hN
)SN2 − (U(ti)− z)SN1
∑Ni=1 K
(U(ti)−z
hN
)SN2 − (U(ti)− z)SN1
, (2.4)
where
SNγ =1
NhN
N∑
i=1
K
(U(ti)− z
hN
)(U(ti)− z)γ , γ = 0, 1, 2, 3,
K(·) is a symmetric density kernel function and hN is a bandwidth.
Remark 1 The preceding algorithms require LLKE of function U(t) regressed on
t in the validation data. As discussed by Carroll and Wand (1991), Sepanski and
Carroll (1993), Sepanski et al. (1994) and others, the range of t in the validation
data is usually smaller than that in primary data observed, which could affect
LLKE U(t) studied above. If the LLKE were used blindly, this would lead to
extrapolation which is a dangerous business. Following the method dealing with
this edge effects problem developed by these authors, the LLKE U(t) used in
this paper are calculated on a compact set Θ = [tV min, tV max] interior to the
support of t, where tV min = mintjN+nj=N+1 and tV max = maxtjN+n
j=N+1. Sums
in the primary data are taken only for those t ∈ Θ. While this truncation causes
certain loss in efficiency, it is counterbalanced by a gain in robustness.
2.2. Estimating m(z)
In what follows, we shall assume that the densities of t, t and η, denoted
as ft, ft and fη respectively, are all compactly supported and bounded away
from zero. Without loss of generality, we suppose their support intervals have
been rescaled so that are contained within Ω = [−π, π]. In addition, to assure
m is identifiable, we suppose that the the support of ft is contained within the
range of U(t). After obtaining M(z) by Eq. (2.4), the Fourier transformation
method introduced by Delaigle et al. (2006) can be accommodated to find the
relationship between m(z) and M(z).
On Ω we may write the trigonometric series for m(z) as
m(z) = m0 +∞∑
l=1
m1l cos(lz) + m2l sin(lz), (2.5)
6 LILUN DU, CHANGLIANG ZOU AND ZHAOJUN WANG
where
m0 =12π
∫
Ωm(t)dt, m1l =
1π
∫
Ωm(t) cos(lt)dt, m2l =
1π
∫
Ωm(t) sin(lt)dt.
Analogously, M(U) can be represented as
M(U) = M0 +∞∑
l=1
M1l cos(lU) + M2l sin(lU), (2.6)
where the constants M0, M1l,M2l for l = 1, 2, . . . , are the Fourier coefficients
determined by function M . Furthermore, the coefficients m1l,m2l are uniquely
determined by Ml1,M2l, if U and η are independent which implied by the Berkson
model (2.1)-(2.2). Hence, simple calculations yield(
m1l
m2l
)=
1α2
1l + α22l
(α1l −α2l
α2l α1l
)(M1l
M2l
), (2.7)
where α1l = Ecos(lη) and α2l = Esin(lη) provided that α21l + α2
2l 6= 0 for
l ≥ 1.
In the present problem, the distribution of η is not explicitly available. How-
ever, with the help of validation data, the empirical distribution of ηj can be
used instead, where
ηj = tj − U(tj), j = N + 1, . . . , N + n.
This type of estimation has been proposed and studied by Akritas and Keilegom
(2001). Correspondingly, α1l and α2l for l = 1, 2, . . . , can be estimated by
α1l =1n
N+n∑
j=N+1
cos(lηj), α2l =1n
N+n∑
j=N+1
sin(lηj). (2.8)
Thus, the estimated coefficients of m1l,m2l in (2.7) can be represented as(
m1l
m2l
)=
1α2
1l + α22l
(α1l −α2l
α2l α1l
)(M1l
M2l
), (2.9)
where
M1l =1π
∫
HM(t) cos(lt)dt, M2l =
1π
∫
HM(t) sin(lt)dt,
Nonparametric Regression with Validation Data 7
m0 = M0 =12π
∫
HM(t)dt,
M(·) is defined in (2.4), and H ⊆ Ω contains the support of M . Combining Eqs.
(2.5) and (2.9), our final estimator m(z) is given by
m(z) = m0 +q∑
l=1
[m1l cos(lz) + m2l sin(lz)], (2.10)
where q denotes the number of Fourier coefficients that are included in the es-
timator and can be deemed as another smoothing (regularization) parameter
besides bn and hN .
Remark 2 Based on Eqs. (2.1)-(2.2), we can see that inference on the rela-
tionship between m(z) and M(z) essentially lies in the nonparametric regression
problems with additive EV structure. Nonparametric methods for inference in
the settings of the classical EV model, that is t = t+δ, include kernel approaches
(e.g., Fan and Truong 1993; Delaigle et al. 2009) and techniques that are based
on simulation and extrapolation (see Cook and Stefanski 1994; Carroll et al.
1999; Staudenmayer and Ruppert 2004). These approaches may be also used in
the present circumstance after certain modifications, although the Berkson model
is particularly appropriate according to (2.2). A more thorough investigation on
this problem is beyond the scope of this paper but should be a subject of future
research.
2.3. Main Results
In this subsection, we study the asymptotic behavior of the proposed esti-
mator. To be clear, a set of conditions for the results stated later are presented.
Let Ft|t denotes the conditional distribution function of t given t.
Conditions
(C1) m(·) has s-order Lipschitz continuous second derivative for some real s > 0.
(C2) The functions U(·) and M(·) are bounded and twice continuously differen-
tiable.
(C3) The density function of U(t), denoted as fU , is Lipschitz continuous and
bounded away from 0.
8 LILUN DU, CHANGLIANG ZOU AND ZHAOJUN WANG
(C4) The kernel functions K(w) and L(w) are bounded and symmetric proba-
bility density functions and satisfy∫
K(w)w2dw < ∞ and∫
L(w)w2dw <
∞. In addition, K(w) is twice differentiable and satisfies∫
K′(w)wdw <
∞,∫
K′2(w)dw < ∞, and
∫ |K ′′(w)|dw < ∞.
(C5) N , n, bn, hN satisfy the conditions that γ2n/h2
N → 0, nbn →∞ and Nh3N →
∞, where γn = ( 1√nbn
+ b2n) log1/2(1/bn).
(C6) lim Nn = λ ∈ [0,∞).
(C7) The density function ft is three times continuously differentiable.
(C8) F ′t|t is continuous in (t, t) and supt,t |t2F ′
t|t| < ∞, and the same holds for all
other partial derivatives of Ft|t with respect to t and t up to order two.
Remark 3 It is noted that conditions (C1)-(C4) are standard in nonparametric
regression except that some assumptions on the derivatives of the function K
are required which are often satisfied for most of commonly used kernel func-
tions. (C5) is the bandwidth condition used in theorems. (C6) is the standard
condition in the study of validation sampling and is reasonable in practice. (C7)
and (C8) are mild conditions used in the proofs of Theorem 2, which guarantee
the consistency of the moment estimators Eq. (2.8). Obviously, the conditions
imposed here are mild.
Due to the relationship between m(z) and M(z), the performance of esti-
mator M(z) is important, whose property plays an important role in the conver-
gence rate of m(z). Moreover, the double local linear regression M(z) is rather
complicated, whose theoretical results cannot be derived by using the standard
technique of LLKE, since it involves two error terms, say ξ and η. Hence, we
begin by studying the asymptotic property of M(z).
Theorem 1 Suppose Conditions (C1)-(C6) hold, we have√
NhN (M(z)−M(z)−B(z)) L−→ N(0, V (z)), (2.11)
where
V (z) =[Var(ξ) + λVar(η)M
′2(z)] ∫
K2(w)dw
fU (z),
B(z) =12M
′′(z)h2
N
∫K(w)w2dw + O(b2
n). (2.12)
Nonparametric Regression with Validation Data 9
Furthermore, if U is reversible,
B(z) =12M
′′(z)h2
N
∫K(w)w2dw − 1
2M
′(z)U
′′(U−1(z))b2
n
∫L(w)w2dw,
where U−1 denotes the inverse function of U .
Remark 4 If λ = 0, which means n >> N , the asymptotic property of M(z)
is the same as the standard LLKE (cf., Fan 1992). In reality, λ is usually a
constant larger than 1, because the covariates exactly measured tjN+nj=N+1 are
expensive to be obtained. However, in this situation, M(z) still achieves the
optimal convergence rate of the standard LLKE by choosing appropriate hN and
bn.
The next theorem establishes the convergence rate of the proposed estimator
m(z).
Theorem 2 Suppose Conditions (C1)-(C8) hold. If c1j−a1 ≤ |αkj | ≤ c2j
−a1
and |Mkj | ≤ c3j−a2 for some positive constants c1, c2, c3, a1 and a2, we have
∫
ΩE[m(t)−m(t)]2dt = O(N−1q2b+1 + h4
Nq2b+1 + b4nq2b+1 + q−2−s),
where b=max (a1, 2a1 − a2), s > 0 is defined in Condition (C1).
Remark 5 The convergence rate of M(z) presented in Theorem 1 leads to the
mean square error of Mkl − Mkl being with the order of Op(N−1 + h4N + b4
n).
Similarly, the mean square error of αkl − αkl is of order Op(n−1 + b4n) as shown
in the proof of Theorem 2. Combining the convergence rates of Mkl −Mkl and
αkl − αkl together results in the first three terms of mean integrated square
error of m(z) presented in the Theorem 2. The Condition (C5) and Theorem 2
motivate us to choose hN and bn in the region of [N−1/3, N−1/4] as then the mean
integrated squared error of m(t) will mainly depend on N and q. In particular,
undersmoothing at the stage of estimating M decreases the estimation bias at
the expense of variance to some extent, but further smoothing at the stage of
Fourier transformation can compensate for the increase of variance.
2.4. Smoothing Parameters Selection
As the same with any nonparametric regression procedure, an important
choice to be made is the amount of local averaging performed to obtain the re-
gression estimate. For the local polynomial regression estimator, bandwidth se-
lection rules were considered in Ruppert et al. (1995) and Fan and Gijbels (1996)
10 LILUN DU, CHANGLIANG ZOU AND ZHAOJUN WANG
among others. Delaigle et al. (2006) proposed an automatic way of choosing the
smoothing parameter q and hN , which is a combination of an existing plug-in
bandwidth selector for LLKE and the cross-validation (CV) rule for trigonomet-
ric series. Since our estimator m(z) involves three regularization parameters hN ,
bn and q, we therefore present the following modification of the leave-one-out CV
selection criterion.
First, we use bn = bcn−1/20, where bc is the delete-one CV bandwidth selector
for the validation sample, that is the minimizer of
CV(bn) =1n
N+n∑
j=N+1
(tj − U (−j)(tj ; bn))2,
where U (−j)(·; bn) denotes the leave-one-out version of U using bn. Here the
multiplier n−1/20 is recommended partially due to the condition (C5) and the
order of cross-validation selector that have an optimal Op(n−1/5) rate (Hardle et
al. 1988). This bandwidth yields certain degree of undersmoothing to U , which
is appropriate from the intuitive explanation provided in Remark 5.
After obtaining bn and defining the corresponding U with the selected band-
width bn, a cross-validation criterion for selecting hN and q would choose
(hN , q) = arg minhN ,q
1N
N∑
i=1
[Yi − m(−i)(U(ti);hN , q)]2,
where m(−i)(·;hN ,m) denotes the version of m that is constructed on omitting
(ti, Yi) from the surrogate data using hN and q terms in the trigonometric series.
In this criterion, U(ti) is used as an empirical approximation of ti in the sense
of ignoring the error item ηi because in practical application the true regressor
ti is not observable. Note that since q is an integer, this two-dimensional CV
criterion does not require extensive computation effort. Furthermore, based on
our numerical results, it suffices to choose q from 1 to 5 which is in accordance
with the empirical findings in Delaigle et al. (2006). For instance, when (n,N) =
(60, 120), it requires less than 3 seconds to complete the whole curve fits on 201
grid points, using a Pentium-M 2.4MHz CPU.
3. Numerical Performance Assessment
We conduct a simulation study to evaluate numerical properties of the pro-
posed estimator using the smoothing parameters selection method described in
Nonparametric Regression with Validation Data 11
Section 2.4. The number of variety of regression function and combinations of
parameters are too large to allow a comprehensive, all-encompassing comparison.
Our goal is to show the effectiveness and robustness of the proposed estimator
m(z), and thus we only choose certain representative examples for illustration.
−0.5 0.0 0.5
0.2
0.3
0.4
0.5
0.6
z(a)
−0.5 0.0 0.5
0.2
0.3
0.4
0.5
0.6
z(b)
Figure 3.1: The estimated curves of M(z) for the regression function (I) through 1,000 repe-
titions when (a): (n, N) = (30, 120), U given by (i); (b): (n, N) = (30, 120), U given by (iii).
The solid, dashed, dotted, and two dashed-dotted curves represent M(z), the median of m2(z),
the median of M(z), and the quartiles of M(z), respectively.
We consider two regression functions m(·) taken from the examples of De-
laigle et al. (2006):
(I) m(z) = (1− z2)2Iz∈[−1,1]
(II) m(z) = (1− z2)2 exp(2z)Iz∈[−1,1].
where I· is the indicator function. Three cases for U are considered: (i) t = t+δ
which is the classical Berkson model; (ii) t = t2− 14+δ; (iii) t = cos(2πt)+δ, where
δ is independent of t and follows the uniform distribution U[−1, 1]. The t’s were
generated from two distributions: (1) t ∼ N(0, 0.52); (2) t ∼ exp(1)−1, where the
latter denotes the centered standard exponential distribution. Throughout this
section the ε’s are assumed to follow the normal distribution N(0, 0.252). For each
of several choices of m, U and the distribution of t, 1,000 simulated datasets were
generated for each sample size combination of (n,N) = (30, 120), (60, 200). The
kernel functions K and L used in (2.4) are both chosen to be the Epanechnikov
12 LILUN DU, CHANGLIANG ZOU AND ZHAOJUN WANG
kernel function 0.75(1− x2)I(−1 ≤ x ≤ 1), which has certain optimal properties
(cf., Fan and Gijbels 1996).
−1.0 −0.5 0.0 0.5 1.0
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
z(a)
−1.0 −0.5 0.0 0.5 1.0
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
z(b)
Figure 3.2: Median Curves of 1000 estimators of regression function (I) with (n, N) = (60, 120)
when (a): U given by (i), ft given by (1); (b): U given by (ii), ft given by (1). The solid, dashed,
dotted, and dashed-dotted curves represent m(z), m(z), m1(z) and m2(z) respectively.
It is challenging to compare the proposed method with alternative methods,
since there is no obvious comparable method in the literature. Here, we consider
Delaigle et al.’s (2006) method where a simple Berkson error structure is assumed
(denoted as m1). Another possible alternative is the naive LLKE based on the
dataset (Yi, ti)Ni=1 (denoted as m2) for comparison use, although it is not a
consistent one due to ignoring of the measurement error. For m1, the method
proposed in Delaigle et al. (2006) for determining the parameters is considered.
For m2, the leave-one-out CV approach is used for choosing bandwidth hN .
Firstly, as the performance of our final estimator m(z) relies heavily on the
estimator M(z), it is interesting to see how well M(z) works. Figure 3.1 shows
the regression function M(z), the curves of the median and quartiles of 1000
estimates M(z) and the median of m2(z) with two different settings for the
regression function (I). In both settings, the first ft is considered. Since the close
form of true function M(z) is not available, we use simulation to approximate
it through 100,000 repetitions. In the first setting where U is considered as the
additive Berkson error model, the m2(z) is certainly a consistent estimator as we
can see from Figure 3.1(a). However, this is not the case for the second setting in
Nonparametric Regression with Validation Data 13
−1.0 −0.5 0.0 0.5 1.0
−0.5
0.0
0.5
1.0
1.5
z(a)
−1.0 −0.5 0.0 0.5 1.0
−0.5
0.0
0.5
1.0
1.5
z(a)
Figure 3.3: Median Curves of 1000 estimators of regression function (II) with (n, N) = (60, 120)
when (a): U given by (i), ft given by (2); (b): U given by (ii), ft given by (2). The solid, dashed,
dotted, and dashed-dotted curves represent m(z), m(z), m1(z) and m2(z) respectively.
which the nonlinear error structure (III) is considered. In this case, m2(z) is far
away from the true M(z) as we would expect, but M(z) still performs reasonably
well guaranteed by Theorem 1.
Table 3.1: The estimated MISE comparison for the estimators m(z), m1(z) and m2(z).
m(z) m1(z) m2(z)(n,N) U ft regression function m
(I) (II) (I) (II) (I) (II)(30, 120) (i) (1) 1.73e-2 7.32e-2 1.05e-2 2.54e-2 1.14e-1 2.65e-1
(i) (2) 1.53e-2 6.95e-2 1.10e-2 2.35e-2 1.11e-1 2.74e-1(ii) (1) 3.69e-2 8.95e-2 9.80e-2 1.47e-1 1.33e-1 3.67e-1(ii) (2) 3.40e-2 8.19e-2 9.66e-2 1.39e-1 1.28e-1 3.68e-1
(60, 120) (i) (1) 1.25e-2 6.22e-2 7.49e-3 1.96e-2 1.01e-1 2.60e-1(i) (2) 1.07e-2 5.80e-2 7.18e-3 1.75e-2 9.85e-2 2.48e-1(ii) (1) 2.55e-2 7.77e-2 9.64e-2 1.39e-1 1.23e-1 3.45e-1(ii) (2) 2.22e-2 7.49e-2 1.03e-1 1.58e-1 1.16e-1 3.31e-1
Figures 3.2 and 3.3 show the regression function curve m(z), the curves of
the medians of 1000 estimates m(z), m1(z) and m2(z) under different settings
of U and ft for (n,N) = (60, 120), in the two examples (I) and (II) respectively.
From these two figures, we can observe that m(z) can capture the patterns of
14 LILUN DU, CHANGLIANG ZOU AND ZHAOJUN WANG
the true curves as well as m1(z) does, although m(z) tends to have larger bias
at boundaries and peaks due to its explicit dependence on the size of validation
dataset. Taking the sample sizes and noise levels into account, the proposed
estimator m(z) and smoothing parameter selection method appear to perform
very well for the test functions considered in this study. In comparison, clearly,
m2(z) fails to produce correct function curves as we expected.
−1.0 −0.5 0.0 0.5 1.0
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
z(a)
−1.0 −0.5 0.0 0.5 1.0
−0.5
0.0
0.5
1.0
1.5
z(b)
Figure 3.4: Median Curves of 1000 estimators with (n, N) = (60, 120) when (a): m given by
(I), U given by (iii), ft given by (2); (b): m given by (II), U given by (iii), ft given by (2).
The solid, dashed, dotted, and dashed-dotted curves represent m(z), m(z), m1(z) and m2(z)
respectively.
Table 1 summarizes the results shown in Figures 3.2-3.3 numerically. The
estimated mean integrated squared errors (MISE) which were evaluated on a
grid of 201 equidistant values of z in [-1,1] are presented. We can see that m1(z)
has certain advantage over m(z) when the function (i) is considered because its
assumption on error model is correct in such a case. m(z) has much smaller
MISE than m1(z) in the case that the nonlinear Berkson model (ii) are used. In
these cases, the performance of m(z) improves considerably as the sample sizes
increases in terms of MISE, whereas m1(z) is hardly affected by those changes.
It is worth pointing out that the superiority of m(z) to m1(z) may become more
significant as the difference between the U and the simple linear model gets more
prominent. Figures 3.4 shows the median curve comparison when U is chosen as
(iii) with (n,N) = (60, 120). In both examples of this figure, m1(z) completely
Nonparametric Regression with Validation Data 15
fails to capture the main profile of the underlying regression function. Note that
in all the cases listed in Table 1, m2(z) are totally incorrect which indicates that
the proposed method is necessary.
4. A Real-Data Example
In this section, we apply the proposed approach to a dataset of enzyme
reaction speeds collected in 1974. The reaction speed (Y ) is calculated by the
particle number of radioactive matter obtained by reaction per minute in basal
density (t). The objective of this analysis is to relate Y to the basal density t.
There are two ways to measure the basal density: a simple chemical method can
be used to measure the basal density, but cannot assess it accurately because of
measurement errors; Another approach is to use a precision machine tool and an
expensive procedure to produce a more accurate measure t of the basal density
for a small subset of subjects enrolled in the study. The basal density obtained
by the chemical method is used as the surrogate variable t, and the corresponding
exact measure for a small subset of subjects is used as the validation variable t.
This dataset has been analyzed by Stute et al. (2007) in which a nonparamet-
ric fit is applied to relate t and t and the following non-linear model is considered
as the underlying regression function m,
m(t, β) =β1t
β2 + t. (4.1)
The dataset comprises n = 10 validation observations and N = 30 surrogate
observations. Here, we apply the proposed methodology to this dataset. Using
the smoothing parameters selection method given in Section 2.4 results in three
parameters bn = 0.453, hN = 0.144 and q = 3. Figure 4.1 displays the corre-
sponding estimated curve (m) along with the curve produced by the parametric
model (4.1) with β1 = 212.7 and β2 = 0.06484 obtained from Stute et al. (2007).
It is readily seen that the two curves have similar patterns, although the esti-
mator we proposed looks coarser, because the nonparametric smoother absorbs
considerably more degrees of freedom than do parametric approaches. In addi-
tion, m(z) differs much from the parametric one at the right boundary. This is
not surprising to us because of the lack of data in the interval t ∈ [−0.8, 1.0].
Thus, again it should be emphasized that compared with parametric methods, to
use the proposed method requires relatively large sample size, especially at the
16 LILUN DU, CHANGLIANG ZOU AND ZHAOJUN WANG
boundary. We think this has become a less significant limitation with advances
in various technologies. New instruments can capture more information, and a
large amount of data become available in modern statistical analysis.
0.0 0.2 0.4 0.6 0.8 1.0
100
120
140
160
180
200
Basal Density
Rea
ctio
n S
peed
Figure 4.5: Our estimator m(z) and the parametric model (4.1) fit of the regression function
m, represented by the solid and dotted curves respectively.
5. Discussion
In the foregoing investigation, we have assumed without comment that in the
validation dataset only the true and surrogate covariables are observed. In fact,
in certain applications (e.g., Chen 2002), the observations on response variable
are also available. In such situations, several naive nonparametric function esti-
mators are obtainable. One is to directly use the matched Yj , tjN+nj=N+1 which is
apparently inefficient because the size of validation dataset is usually relatively
small. An alternative estimator is using the proposed approach given in Section
2 except that replacing the surrogate dataset with an extended sample by com-
bining Yj , tjN+nj=N+1 and Yi, tiN
i=1 together. However, this estimator ignores
the information provided by Yj , tjN+nj=N+1 and hence may not be efficient either.
How to construct an estimator which can fully incorporate the information given
by data in this case warrants further research.
Nonparametric Regression with Validation Data 17
Another extension is how to estimate nonparametric curve in a high dimen-
sion space when multi-covariates are contaminated with non-additive errors by
using a validation data. As we know, the Fourier transformation is not directly
applicable in this situation. To avoid the “curse of dimensionality”, several di-
mension reduction models, such as the additive model, the single-index model,
the partially linear model and the varying coefficient model, have been thor-
oughly studied and applied widely in practical applications. It is of interest to
incorporate these dimension reduction methods into our proposed methodology
to solve high-dimensional problems with validation data. Specially, the case that
only one univariate variable is measured with error in a multi-covariates model,
say, y = m(t,x) + ε where t is measured with error but x is measured exactly, is
often occurred in real life data. How to deal with this kind of problem is an im-
portant and interesting topic for future study. For instance, consider a commonly
used partially linear model y = m(t)+xT β +ε and the error model t = U(t)+η.
To solve this EV problem, by using the similar idea in Section 2, we may firstly
rewrite the model into y−xT β = M(U(t)) + ξ and regard y−xT β as a working
response Y (β), which is a convention in the partial linear model inference. The
next step is to estimate the coefficients β and M(z) simultaneously. Finally, the
Fourier transformation method suggested in Delaigle et al. (2006) can also be
employed to recover the nonparametric function m(z).
Acknowledgment
The authors thank the editor, the associate editor, and two referees for many
constructive comments and suggestions which greatly improved the quality of
the paper. This research was supported by the NSF of Tianjin Grant 07JCY-
BJC04300, the NNSF of China Grants 10771107 and 10711120448.
Appendix: Proofs of Theorems
Unless otherwise stated, subscript i runs from 1 to N , j from N +1 to N +n
which denote the surrogate data and validation data respectively. Let C1, C2, . . .
denote generic positive constants, not depending on M, hN , bn, n or N .
As Theorem 1 and its proof play an important role on establishing the other
results, we will detail the steps of the proof. First, we state the following necessary
lemma. In what follows, we assume the function U is reversible; Otherwise, the
18 LILUN DU, CHANGLIANG ZOU AND ZHAOJUN WANG
terms involving U−1 can be expressed as O(1) and the corresponding convergence
rate can be achieved.
Lemma 1 Suppose Conditions C1-C6 hold, we have
(i) SN0P−→ fU (z).
(ii) S2N1 = op(SN0SN2), SN1SN3 = op(S2
N2).
(iii) SN1φN1 = op(SN2φN0), where
φNk =1
NhN
∑
i
K
(U(ti)− z
hN
)(U(ti)− z)kξi, k = 0, 1.
Proof. (i) Using second-order Taylor expansion, we have
SN0 :=fU (z)
=1
NhN
∑
i
K
(U(ti)− z
hN
)+
1NhN
∑
i
K′(
U(ti)− z
hN
) (U(ti)− U(ti)
)
hN
+1
2NhN
∑
i
K′′(
U(ti)− z
hN
) (U(ti)− U(ti)
)2
h2N
+ D4
:=D1 + D2 + D3 + D4
Standard density theory (Parzen 1962) leads to D1 = fU (z)(1+ op(1)). Next, we
focus on dealing with D2 and D3, respectively.
First, according to the asymptotic properties of LLKE (Fan and Gijbels
1996), we have
(U(ti)− U(ti)
)=
1
nbnft(ti)
∑
j
(tj − U(tj)
)L
(tj − ti
bn
)
+1
2nbnft(ti)
∑
j
[U′′(ti)
(tj − ti
)2]L
(tj − ti
bn
) (1 + op(1))
= (UD + UB)(1 + op(1)). (A.1)
Following Condition (C5), uniform convergence rate of the local linear smoother
(e.g., Carroll et al. 1997) reveals that
supt∈Ω
[U(t)− U(t)] = Op(γn). (A.2)
Nonparametric Regression with Validation Data 19
Then D2 can be decomposed as the following two terms,
D2 =
(1
Nh2N
∑
i
K′(
U(ti)− z
hN
)UD +
12Nh2
N
∑
i
K′(
U(ti)− z
hN
)UB
)(1 + op(1))
:= [D2V + D2B](1 + op(1)).
Simple calculations yield that
D2B =b2n
2Nh2N
∑
i
U′′(ti)K
′(
U(ti)− z
hN
)×
∫w2L(w)dw
= U′′(U−1(z))fU (z)
∫w2L(w)dw ×
∫K′(w)dw
b2n
2hN(1 + op(1)), (A.3)
D2V =1
nbnh2N
∑
j
(tj − U(tj)
)E
K
′(
U(t)−zhN
)L
(tj−tbn
)
ft(t)
∣∣∣∣∣tj
+1
nbnh2N
∑
j
(tj − U(tj)
) 1N
∑
i
K′(
U(ti)−zhN
)L
(tj−ti
bn
)
ft(ti)− E
K
′(
U(t)−zhN
)L
(tj−tbn
)
ft(t)
∣∣∣∣∣tj
:= D2V 1 +1
nbnh2N
∑
j
ηj1N
∑
i
ζij := D2V 1 + D2V 2.
Note that given i, ζij , j = N +1, . . . , N +n are independent, and it is also true
that given j, ζij , i = 1, . . . , N are independent. Thus, we have
E(D22V 1) =
1n2h4
N
∑
j
E
[(tj − U(tj)
)2K′2
(U(tj)− z
hN
)]
= λVar(η)fU (z)
Nh3N
∫K′2(w)dw, (A.4)
and
E[D22V 2] =
1N2n2b2
nh4N
∑
j
E
[η2
j
( ∑
i
ζij
)2]
=1
N2n2b2nh4
N
∑
j
E
[E
[η2
j |tj]∑
i
E[ζ2ij |tj
]]
=1
N2n2b2nh4
N
∑
j
E
E
[η2
j |tj]∑
i
E
K
′(
U(ti)−zhN
)L
(tj−ti
bn
)
ft(ti)
2 ∣∣∣∣∣tj
(1 + o(1))
=Var(η)
Nnbnh3N
fU (z)ftU
−1(z)
∫K′2(w)dw
∫L2(w)dw. (A.5)
20 LILUN DU, CHANGLIANG ZOU AND ZHAOJUN WANG
Finally, using (A.2) we have
E[|D3|] ≤ 12Nh3
N
E
[∑
i
∣∣∣∣∣K′′(
U(ti)− z
hN
) ∣∣∣∣∣(U(ti)− U(ti)
)2]
≤ supt∈Ω
(U(t)− U(t)
)2 12Nh3
N
∑
i
E
[∣∣∣∣∣K′′(
U(ti)− z
hN
) ∣∣∣∣∣
]
=Op
(γ2
n
h2N
),
where we use the independence between ηj and tj . As a consequence, Condition
(C5) and the Markov inequality lead to D3 → 0. By using Condition (C5)
and the similar but more cumbersome arguments we can derive D4 = op(D3).
Combining this and Eqs. (A.1)-(A.5) together, we can complete the proof. (ii)
By using similar arguments but tedious algebra in (i) and Conditions (C2) and
(C5), we can obtain
SN1 = Op(h2N + b2
n + γ2n/hN ),
SN2 = Op(h2N + γnhN ),
SN3 = Op(h4N + γnh2
N ),
which directly lead to the result (ii).
(iii) The proof follows from similar but tedious calculations, and hence is
omitted. ¤
Proof of Theorem 1:
By Lemma 1, collecting the leading terms, M(z)−M(z) can be represented
as the following form
M(z)−M(z) =
1NhN
∑i K
(U(ti)−z
hN
) [Yi −M(U(ti))
]
fU (z)
+
M′′(z)
2NhN
∑i K
(U(ti)−z
hN
) [U(ti)− z
]2
fU (z)
(1 + op(1))
:=MV + MB
fU (z)(1 + op(1)),
Nonparametric Regression with Validation Data 21
where MB and MV can be rewritten as
MB =
1
2NhN
∑
i
K
(U(ti)− z
hN
)M
′′(z)
[U(ti)− z
]2
+1
2NhN
∑
i
K
(U(ti)− z
hN
)M
′′(z)
[U(ti)− U(ti)
]2
(1 + op(1))
:=MB1 + MB2(1 + op(1)),
MV =1
NhN
∑
i
K
(U(ti)− z
hN
)[Yi −M(U(ti))
]
− 1NhN
∑
i
K
(U(ti)− z
hN
)[M(U(ti))−M(U(ti))
]
:=MV 1 −MV 2.
For MB1, Taylor expansion of kernel function yields
MB1 =M
′′(z)
2NhN
∑
i
K
(U(ti)− z
hN
) (U(ti)− z
)2
+M
′′(z)
2Nh2N
∑
i
K′(
U(ti)− z
hN
) (U(ti)− U(ti)
) (U(ti)− z
)2 (1 + op(1))
:=MB11 + MB12 (1 + op(1)) .
Simple calculations yield that
MB11 =1
2hNE
[K
(U(t)− z
hN
)M
′′(z)(U(t)− z)2
](1 + op(1))
=12
∫K(w)w2dwM
′′(z)fU (z)h2
N + op(h2N ),
which is the same as the bias term of LLKE. As for MB12, using (A.2),
MB12 ≤ supt∈Ω
(U(t)− U(t)
) ∣∣∣M ′′(z)
∣∣∣ 12Nh2
N
∑
i
K′(
U(ti)− z
hN
) (U(ti)− z
)2
= Op (γnhN ) .
Hence, we conclude that
MB1 =12
∫K(w)w2dwM
′′(z)fU (z)h2
N + op(h2N ). (A.6)
22 LILUN DU, CHANGLIANG ZOU AND ZHAOJUN WANG
Similarly, we can also obtain
MB2 = Op(γ2n), (A.7)
(A.6)-(A.7) and Condition (C5) together lead to
MB =12
∫K(w)w2dwM
′′(z)fU (z)h2
N + op(h2N ). (A.8)
Now we turn to analyze the term MV . Firstly, by using second-order Taylor
expansion and (A.1), we can prove that
MV 1 = ∆1 + Op(b2n(Nh3
N )−12 + (nbnNhN )−
12 ), (A.9)
where we denote ∆1 = 1NhN
∑i K
(U(ti)−z
hN
)ξi. A simple calculation yields that
Var(MV 1) =Var(ξ)fU (z)
NhN
∫K2(w)dw(1 + o(1)). (A.10)
Finally, let us focus on the term MV 2, which is the main term consisted of the
variance.
MV 2 =
1
NhN
∑
i
K
(U(ti)− z
hN
)M
′(U(ti))
(U(ti)− U(ti)
)
+1
Nh2N
∑
i
K′(
U(ti)− z
hN
)M
′(U(ti))
(U(ti)− U(ti)
)2
(1 + op(1))
:=
MV 21 + MV 22
(1 + op(1)),
where we can show that MV 22 = Op
(γ2
nhN
), while
MV 21 =
1
NhN
∑
i
K
(U(ti)− z
hN
)M
′(U(ti))(UD + UB/2)
(1 + op(1))
=
1
nhN
∑
j
ηjK
(U(tj)− z
hN
)M
′(U(tj))
+1
2NhN
∑
i
K
(U(ti)− z
hN
)M
′(U(ti))U
′′(ti)b2
n
∫L(w)w2dw
(1 + op(1))
= :
∆2 +12M
′(z)U
′′[U−1(z)]fU (z)b2
n
∫L(w)w2dw
(1 + op(1)). (A.11)
Nonparametric Regression with Validation Data 23
Note that
E[∆2] = 0, E[∆22] =
Var(η)nhN
M′2(z)fU (z)
∫K2(w)dw. (A.12)
Recalling the independence between validation data and surrogate data and
combining Eqs. (A.8)-(A.12), the central limit theorem leads to the result. ¤
Proof of Theorem 2:
Based on the proof of Theorem 1, we can see that
M(z)−M(z) = (∆1(z) + ∆2(z) + B(z)) (1 + op(1)), (A.13)
where ∆1, ∆2 and B are defined in (A.9), (A.11) and (2.12). By using (A.14) it
can be seen that
|Cov[M(z1), M(z2)]| ≤∣∣∣∣∣Var(η)n2h2
N
∑
j
K
(z1 − U(tj)
hN
)K
(z2 − U(tj)
hN
)[M
′(U(tj))]2
∣∣∣∣∣
+
∣∣∣∣∣Var(ξ)N2h2
N
∑
i
K
(z1 − U(ti)
hN
)K
(z2 − U(ti)
hN
) ∣∣∣∣∣+ o(h4
N + b4n + (NhN )−1)
≤ C1
NhNK ∗K
(z2 − z1
hN
)+ o(h4
N + b4n +
1NhN
). (A.14)
Eqs. (A.13), (A.14) and the definitions of M0 and Mkl imply that
E(M0 −M0)2 ≤ C2N−1 + C3h
4N + C4b
4n,
E(Mkl −Mkl)2 ≤ C2N−1 + C3h
4N + C4b
4n, (A.15)
uniformly in k and l. Hence, we have∫
ΩE(m(t)−m(t))2dt =
∫
ΩE
((m0 −m0) +
q∑
l=1
[(m1l −m1l) cos(lt)
+ (m2l −m2l) sin(lt)])2
dt
+∫
Ω
( ∞∑
l=q+1
[m1l cos(lt) + m2l sin(lt)])2
dt
=2πE(m0 −m0)2 + π
q∑
l=1
2∑
k=1
E(mkl −mkl)2 + Rq.
24 LILUN DU, CHANGLIANG ZOU AND ZHAOJUN WANG
Under Conditions (C7) and (C8), by some modifications of the proof of
Theorem 1 in Akritas and Keilegom (2001), we can show
E(αkj − αkj)2 ≤ C5n−1 + C6b
4n,
uniformly in k and l. This property, the conditions given in Theorem 2 and Eq.
(A.15) yield that
q∑
l=1
2∑
k=1
E(mkl −mkl)2 ≤ (C7N−1 + C8h
4N + C9b
4n)
q∑
l=1
l2b.
Note that by the general theory of trigonometric series we have
∞∑
l=q+1
[m1l cos(lt) + m2l sin(lt)] ≤ C10q−2−s
uniformly in t. Therefore, Rq is of the order q−2−s. Taking all above results
together we can complete the proof. ¤ReferencesAkritas, M. G. and Van Keilegom, I. (2001), Non-parametric Estimation of the Residual Distribution.
Scand. J. Stat., 28, 549–568.
Berkson, J. (1950), Are There Two Regression Problems?. J. Am. Stat. Assoc., 45, 164–180.
Carroll, R. J., Fan, J., Gijbels, I. and Wand, M. P. (1997), Generalized Partially Linear Single-IndexModels. J. Am. Stat. Assoc., 92, 477–489.
Carroll, R. J., Knickerbocker, R. K. and Wang, C. (1995), Dimension Reduction in a SemiparametricRegression Model with Errors in Covariates. Ann. Stat., 23, 161–181.
Carroll, R. J., Maca, J. D. and Ruppert, D. (1999), Nonparametric Regression in the Presence ofMeasurement Error. Biometrika, 86, 541–554.
Carroll, R. J., Ruppert, D., Stefanski, L. A. and Crainiceanu, C. (2006), Measurement Error in Non-linear Models: A Modern Perspective, Second Edition. Chapman and Hall, London.
Carroll, R. J. and Stefanski, L. A. (1990), Approximate Quasi-likelihood Estimation in Models withSurrogate Predictors. J. Am. Stat. Assoc., 85, 652–663.
Carroll, R. J. and Wand, M. P. (1991), Semiparametric Estimation in Logistic Measurement ErrorModels. J. R. Stat. Soc. Ser. B, 53, 573–585.
Chen, Y. (2002), Cox Regression in Cohort Studies With Validation Sampling. J. R. Stat. Soc. Ser.B, 64, 51–62.
Cook, J. R. and Stefanski, L. A. (1994), Simulation-Extrapolation Estimation in Parametric Measure-ment Error Models. J. Am. Stat. Assoc., 89, 1314–1328.
Delaigle, A., Fan, J. and Carroll, R. J. (2009), A Design-Adaptive Local Polynomial Estimator for theErrors-in-Variables Problem. J. Am. Stat. Assoc., 104, 348-359.
Delaigle, A., Hall, P. and Qiu, P. (2006), Nonparametric Methods for Solving the Berkson Errors-in-Variables Problem. J. R. Stat. Soc. Ser. B, 68, 201–220.
Fan, J. (1992), Design-Adaptive Nonparametric Regression. J. Am. Stat. Assoc., 87, 998–1004.
Fan, J. and Gijbels, I. (1996), Local Polynomial Modeling and Its Applications. Chapman and Hall,London.
Fan, J. and Truong, Y. K. (1993), Nonparametric Regression with Errors in Variables. Ann. Stat., 21,1990–1925.
Nonparametric Regression with Validation Data 25
Fuller, W. A. (1987), Measurement Errors Models. John Wiley, New York.
Hardle, W., Hall, P. and Marron, J. S. (1988), How Far are Automatically Chosen Regression Smooth-ing Parameter from Theire Optimum? (with discussion). J. Am. Stat. Assoc., 83, 86–99.
Lee, L. F. and Sepanski, J. (1995), Estimation of Linear and Nonlinear Errors-in-Variables ModelsUsing Validation Data. J. Am. Stat. Assoc., 90, 130–140.
Parzen, E. (1962), On Estimation of a Probability Density Function and Mode. Ann. Math. Stat, 33,1065–1076.
Pepe, M. S. (1992), Inference Using Surrogate Outcome Data and a Validation Sample. Biometrika,79, 355–365.
Pepe, M. S. and Fleming, T. R. (1991), A General Nonparametric Method for Dealing with Errors inMissing or Surrogate Covaraite Data. J. Am. Stat. Assoc. 86, 108–113.
Ruppert, D., Sheather, S. J. and Wand, M. P. (1995), An Effective Bandwidth Selector for Local LeastSquares Regression. J. Am. Stat. Assoc., 90, 1257–1270.
Sepanski, J. H. and Carroll, R. J. (1993), Semiparametric Quasilikelihood and Variance FunctionEstimation in Measurement Error Models. J. Econometrics., 58, 223–256.
Sepanski, J. H., Knickerbocker, R. K. and Carroll, R. J. (1994), A Semiparametric Correction forAttenuation. J. Am. Stat. Assoc. 89, 1366-1373.
Sepanski, J. and Lee, L. F. (1995), Semiparametric Estimation of Nonlinear Errors-in-Variables Modelswith Validation Study. J. Nonparametr. Stat., 4, 365–394.
Staudenmayer, J. and Ruppert, D. (2004), Local Polynomial Regression and Simulation-Extrapolation.J. R. Stat. Soc. Ser. B, 66, 17-30.
Stute, W., Xue, L., and Zhu, L.-X. (2007), Empirical Likelihood Inference in Nonlinear Errors-in-Covariables Models with Validation Data. J. Am. Stat. Assoc., 102, 332–346.
Wang, Q. (1999), Estimation of Partial Linear Errors-in-Variables Models with Validation Data. J.Multivariate Anal., 69, 30–64.
Wang, Q. (2006), Nonparametric Regression Function Estimation with Surrogate Data and Validationsampling. J. Multivariate Anal., 97, 1142–1161.
Wang, Q., and Rao, J. N. K. (2002), Empirical Likelihood-Based Inference in Linear Errors-in-Covariables Models with Validation Data. Biometrika, 89, 345–358.
Wittes, J., Lakatos, E. and Probstfield, J. (1989), Surrogate Endpoints in Clinical Trials: Cardiovas-cular Diseases. Stat. Med., 8, 4150–425.
LPMC and Department of Statistics, School of Mathematical Sciences, Nankai University, Tianjin,300071, China
E-mail: (feilen45@yahoo.com.cn; chlzou@yahoo.com.cn; zjwang@nankai.edu.cn)
Recommended