View
11
Download
0
Category
Preview:
Citation preview
Accepted Manuscript
Nonparametric tilted density function estimation: A cross-validationcriterion
Hassan Doosti, Peter Hall, Jorge Mateu
PII: S0378-3758(17)30215-XDOI: https://doi.org/10.1016/j.jspi.2017.12.003Reference: JSPI 5626
To appear in: Journal of Statistical Planning and Inference
Received date : 18 July 2017Revised date : 18 December 2017Accepted date : 21 December 2017
Please cite this article as: Doosti H., Hall P., Mateu J., Nonparametric tilted density functionestimation: A cross-validation criterion. J. Statist. Plann. Inference (2018),https://doi.org/10.1016/j.jspi.2017.12.003
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service toour customers we are providing this early version of the manuscript. The manuscript will undergocopyediting, typesetting, and review of the resulting proof before it is published in its final form.Please note that during the production process errors may be discovered which could affect thecontent, and all legal disclaimers that apply to the journal pertain.
Nonparametric Tilted Density Function Estimation: ACross-validation Criterion
Hassan Doosti1
Macquarie University, Sydney, Australia.
Peter Hall 2
The University of Melbourne, Melbourne, Australia.
Jorge Mateu 3
Universitat Jaume I, Castellon, Spain.
Abstract
In this paper, we propose a tilted estimator for nonparametric estimation of a
density function. We use a cross-validation criterion to choose both the band-
width and the tilted estimator parameters. We demonstrate theoretically that
our proposed estimator provides a convergence rate which is strictly faster than
the usual rate attained using a conventional kernel estimator with a positive
kernel. We investigate the performance through both theoretical and numerical
studies.
Keywords: Cross validation function, Non-parametric density function
estimation, Rate of convergence, Tilted estimators
2010 MSC: 62G07, 62G20
1. Introduction
Motivation. Doosti and Hall (2016) introduced new high-order, non-parametric
density estimators based on data perturbation, e.g. by tilting or data sharp-
ening. They proposed an approach to choose the parameters to minimise the
1hassan.doosti@mq.edu.au (Corresponding Author)2Deceased 9 January 2016.3mateu@mat.uji.es
Preprint submitted to Journal of Multivariate Analysis December 18, 2017
doosti-hall-mateu-16-7-2017.texClick here to view linked References
integrated squared distance between the new estimators and an adaptive estima-5
tor, based for example on the sinc kernel or a flat-top kernel. The new estimators
produce more accurate estimations than the high-order methods because they
remove those negative parts of the estimator which always penalise performance.
On the other hand, the new estimators suffer from low speed of computation.
In the current paper we introduce a cross-validation function which is suited for10
tilted estimators and show that by minimising the corresponding criterion, we
will have estimators which enjoy high speed of convergence. Theoretically we
prove that their rate of convergence is faster than the usual rate of convergence.
Practically the new approach reduces a large amount of computational labour.
The first draft of this work was initiated by Peter Hall before he fell ill, but15
then he did not have a chance to see more than a first draft of the paper. We
dedicate this manuscript to Peter Hall.
On the density function estimation by tilting. The first version of a
tilting-based approach was suggested by Grenander (1956), for enforcing con-
straints. Chen (1997), Zhang (1998), Muller et al. (2005) and Schick and We-20
felmeyer (2009) employed empirical likelihood, introduced by Owen (1988, 1990,
2001), to find probability weights by profiling a multinomial likelihood under a
set of constraints. Instead of empirical likelihood-based methods, some authors
used a variety of distance measures in this context. In the latter approach the
tilted estimator can be constructed by minimising these distances, subject to25
the constrains. Hall and Presnell (1999), Hall and Huang (2001, 2002), Carroll
et al. (2011) and Doosti and Hall (2016) used this methodology to find their
tilted estimators.
The novelty of our approach compared to the existing approaches is that we
benefit from recruiting a cross-validation criterion. Furthermore, we propose a30
new algorithm that helps to find the amount of tilt.
Assume we are given a random sample X = {X1, . . . , Xn}, and wish to
estimate the density, f , of the distribution from which the data were drawn.
To define a tilted kernel estimator, let p = (p1, . . . , pn) denote a probability
2
distribution on n points, so that each pi ≥ 0 and
n∑
i
pi = 1 . (1.1)
Given a random sample {X1, . . . , Xn}, drawn from a distribution with density
f , a bandwidth h and kernel K, we define the tilted kernel density estimator as
f(x|h, p) =1
h
n∑
i
piK
(x−Xi
h
). (1.2)
The conventional kernel estimator, f say, is recovered by taking pi = p0 ≡(n−1, . . . , n−1), the uniform multinomial distribution
f(x|h) = f(x|h, p0) =1
nh
n∑
i
K
(x−Xi
h
). (1.3)
It is common to take K to be a bounded, symmetric probability density, and in
that case the estimators at (1.1) and (1.2) are both nonnegative and integrate
to 1, and so are themselves proper probability densities. However, the conver-
gence rate of the estimator at (1.3) is restricted severely; it is Op(n−4/5) if f35
has two bounded derivatives, and generally cannot be improved beyond that
point, even if f is infinitely differentiable, without using a kernel K that takes
negative values in some parts of its support.
In principle we would like to choose p to minimise the integrated squared
error,
ISE(h, p) =
∫ {f(x|h, p)− f(x)
}2w(x) dx , (1.4)
where w denotes a nonnegative weight function, and minimisation is undertaken
subject to the constraint that p is a proper probability distribution. In this40
paper, however in principle, we can use cross-validation to choose h and p.
Paper organisation. The rest of this paper is organised as follows. In Sec-
tion 2, we present our cross-validation function. The main theoretical results are
described in Section 3, and Section 4 is devoted to the numerical performances
of our estimator. The proofs of the technical results appear in an Appendix at45
the end of the paper.
3
2. Cross-validation function
The cross-validation criterion that we use is given by
CV (h, p) =
∫f(x|h, p)2 dx− 2
n
n∑
i
f−i(Xi|h, p) , (2.1)
where
f−i(x|h, p) =1
h
∑
j : j 6=ipj K
(x−Xj
h
)(2.2)
and, as before, p = (p1, . . . , pn) is a probability measure on n points.
The density estimator f−i(x|h, p), defined at (2.2), is a leave-one-out version
of f(x|h, p), computed using the dataset X \ {Xi}. However, the vector p used50
in (2.2) is not re-standardised so that∑j : j 6=i pj = 1, since the latter step turns
out to be unnecessary. This reduces computational labour.
Recall the definition of the integrated squared error, ISE, at (1.4), and that
p0 denotes the uniform probability measure. For simplicity, take the weight
function in (1.4) equal to 1. It is well known that if f and f ′′ are bounded
and square integrable, then, under mild assumptions on h and K, the following
property of integrated squared error holds
ISE(h, p0) �p (nh)−1 + h4 , (2.3)
where the notation An(h) �p Bn(h) means that, whenever h = h(n) → 0 such
that nh→∞, the ratio Rn = An(h)/Bn(h) satisfies
limC→∞lim infn→∞
P(C−1 ≤ Rn ≤ C
)= 1.
In Section 3 we shall state a theorem which asserts that, for a quantity Qn
not depending on h or p, and for a class H of bandwidth h, and a class P of
probability measures p,
supp∈P
∣∣∣∣1
n
n∑
i
f−i(Xi|h, p)−∫f(x|h, p)f(x)dx−Qn
∣∣∣∣ = op{
1}. (2.4)
It follows from (2.3) and (2.4) that, since
ISE(h, p) =
∫f(x|h, p)2dx− 2
∫f(x|h, p)f(x)dx+
∫f(x)2dx, (2.5)
4
then by minimising the cross-validation criterion over h and p together, rather
than over h alone as in conventional applications of cross-validation, we can find
the bandwidth h, and multinomial probability distributions p that allow us to55
reduce ISE to a quantity strictly smaller order than min{(nh)−1+h4} � n−4/5.
That is, ISE(h, p) = op(n−4/5
)holds. In view of (2.3), n−4/5 is the order of
ISE for the conventional estimator f( ·|h), and so by using cross-validation to
choose p we are able to strictly improve the performance of f( ·|h) as measured
by ISE.60
In practice, we choose (h, p) sequentially as follows. First we take p = p0,
choose h = h to minimise CV (h, p0), and put (h(1), p(1)) = (h, p0). This is the
conventional cross-validation approach to bandwidth choice. At the rth step in
the algorithm, if (h(r), p(r)) has been selected already, and p involves m = r in a
sparse interpolation algorithm, see A below, choose (h(r+1), p(r+1)) to minimise
CV (h, p), subject to
h(r+1) ≥ ρ h(r) , (2.6)
to p having m = r + 1 in algorithm A below, and to p(r+1) being a proper
probability measure. Here, ρ ∈ (0, 1) is a predetermined constant, and its range
reflects the fact that, in asymptotic terms, the optimal bandwidth increases as
the role of bias decreases.
Algorithm A is a piecewise-constant sparse interpolation algorithm.65
Algorithm A (AA): Let X(1) ≤ . . . ≤ X(n) denote an ordering of the data
X1, . . . , Xn, and, given an integer m in the range 2 ≤ m < n, let 1 < i1 <
. . . < im = n be approximately regularly spaced between 0 and n, in the sense
that, for 1 ≤ j ≤ m, we have C1 n/m ≤ ij − ij−1 ≤ C2 n/m, where i0 = 0 and
X(i0) = −∞. Take pi to be constant, in particular to have the value qj , say, for70
ij−1 < i ≤ ij , where 1 ≤ j ≤ m.
Here and below we write C1, C2, . . . for positive constants. Of course, mod-
ifications of AA are possible, including those where the weights are distributed
approximately symmetrically in the two tails. It follows from AA that the con-
5
straints∑i pi = 1 and pi ≥ 0 are equivalent to
m∑
j=1
(ij − ij−1) qj = 1 , q1, . . . , qm ≥ 0 . (2.7)
In practice it is often desirable to choose i1, . . . , im so that some X(ij) are close
to each mode of f . This can be determined approximately by constructing a
pilot estimator of f , for example f .
The algorithm A can be replaced by its obvious linear interpolation coun-75
terpart, with almost no impact on numerical results and no impact on the
theoretical properties that we shall discuss in Section 3.
3. Theoretical properties
In this section we state assumptions under which (2.4) holds. Given con-
stants c1, c2 > 0 we define
H =[n−(1/5)−c1 , n−(1/5)+c1
], (3.1)
and we take P to be the class of all probability measures p = (p1, . . . , pn) that
satisfy
max1≤i≤n
∣∣pi − n−1∣∣ ≤ h2 nc2 (3.2)
and are constructed by sparse interpolation (see AA) from values of pi, for i in
a sequence of integers i1, . . . , im. The manner in which the definition of H at80
(3.1) involves n−1/5 reflects the fact that we permit H to contain bandwidths
that are both of smaller and larger magnitude than the optimal size, n−1/5,
allowing to minimise the order of magnitude of the integrated squared error of
f( ·|h) (see (2.3)). Note that the presence of the factor nc2 in (3.2) acknowledges
that, for some densities f , we can choose pi with more flexibility than in the85
formulae proposed by Doosti and Hall (2016).
With p = (p1, . . . , pn) defined in terms of q = (q1, . . . , qm) as in AA, we
choose h and q together to minimise CV (h, p), subject to (2.7).
In addition to AA, we impose the following condition:
Condition B (CB): (a) f , |f ′| and |f ′′| are uniformly bounded; (b) K is a90
6
bounded, symmetric, compactly supported and a Holder continuous probability
density; (c) max1≤i≤n pi ≤ C3 n−1; and (d) m ≤ nc3 where c3 < 1/2.
Assumptions (a) and (b) are conventional in kernel density estimation, assump-
tion (c) simply ensures that the constant c2 in (3.2) is not so large that the pis
can be of larger order than n−1, and assumption (d) bounds the rate at which95
the number of distinct pis can grow.
Theorem 1. Assume that H is as in (3.1), and that P is the class of prob-
ability measures p satisfying AA and CB. Then the constants c1, c2, c3 > 0, in
(3.1), (3.2) and CB, respectively, can be chosen so that (2.4) holds, and in fact100
so that the right-hand side of (2.4) equals Op[{(nh)−1 +h4}n−c4 ], uniformly in
h ∈ H and p ∈ P , where c4 > 0.
A proof of Theorem 1 is given in Appendix at the end of the paper.
4. Simulation Study105
4.1 Summary of methods used. A simulation study is carried out to
compare our estimators to those employed by Doosti and Hall (2016). In this
section we follow the first three examples in Section 4 in Doosti and Hall (2016)
to show the performances of the proposed estimator with those competitors
which have already been employed in Doosti and Hall (2016). Here, the same110
setup is employed for the selection of parameters and dataset. We have three
different approaches to tilting, referred to below as (I)–(III). Techniques (I)
and (II) were proposed by Doosti and Hall (2016) which choose p to minimise
the L2 distance to estimators computed using the sinc kernel or trapezoidal
kernel, respectively, when the bandwidth for the latter technique is chosen by115
cross-validation. Method (III) chooses p and the bandwidth directly and it cor-
responds to our new proposed estimators. The case m = 2 was also explored.
Also the following competitors contribute to the simulation study: (i) a con-
ventional second-order kernel estimator, (ii) the sinc kernel estimator, (iii) the
7
trapezoidal kernel estimator, and (iv) the diffusion estimator. For either the def-120
inition of all competitor estimators or the set up for choosing their parameters
see Section 4 in Doosti and Hall (2016). Our proposed estimator, i.e. Method
III, chooses p and the bandwidth directly, using the cross-validation function
at (2.1). In each case the standard normal kernel was used to construct the tilted
estimator. The performance depends to some extent on the way these values125
are distributed. For example, when m = 2 and the density being estimated is
unimodal, it is desirable to have one of the two groups of equal pis approxi-
mately in the middle nine-tenths of the distribution. Analogous arrangements
are appropriate for multimodal densities, or for unimodal densities when m > 2,
and in each case the appropriate distribution of the groups of equal pi values130
can be determined empirically by cross-validation.
4.2. Simulations. For the first example, we employed the same datasets
as in Figure 1 in Doosti and Hall (2016). Figure 1, from the top panel to
the bottom panel, shows estimators of Beta (3, 3), Beta (5, 3), Beta (3, 6) and
Beta (6, 6) densities, respectively. As we pass through the sequence of Beta135
densities, from top to bottom, the densities become successively smoother at
the boundaries. Each panel contains eight curves, representing the true density,
the conventional kernel estimator, the sinc and trapezoidal kernel estimators,
the diffusion estimator, and tilted estimators (I)–(III).
The following properties can be deduced from the plots in Figure 1, which are140
given in the case m = n: (a) the conventional and diffusion estimators perform
similarly, with the former typically higher than the latter in the vicinity of
the mode, except in the Beta (3, 6) case where the diffusion estimator performs
poorly due to its bimodality; (b) the sinc and trapezoidal kernel estimators suffer
from negative side lobes; (c) apart from the previous problem, as expected the145
trapezoidal kernel estimator performs similarly to the tilting method (II); and
(d) overall, the tilting method (III) arguably is the most satisfactory, since it
best captures the peak of the true density, it does not suffer from negative side
lobes, and more generally, it is not misled into producing a spurious mode.
Results when m = 2 or 3, for methods (I)–(III), are similar.150
8
−1 −0.5 0 0.5 1 1.5 2−0.5
0
0.5
1
1.5
2
2.5
−1 −0.5 0 0.5 1 1.5 2−0.5
0
0.5
1
1.5
2
2.5
−1 −0.5 0 0.5 1 1.5 2−0.5
0
0.5
1
1.5
2
2.5
3
3.5
−1 −0.5 0 0.5 1 1.5 2−0.5
0
0.5
1
1.5
2
2.5
3
Figure 1 Beta density function estimation. See text for distribution types. In
each panel the true density is represented by a bold curve, the conventional
kernel estimator by a blue dashed curve, the diffusion estimator by a black line
with triangles, the sinc kernel estimator by a red dotted curve, the trapezoidal
estimator by a green dot-dashed curve, and the tilted estimators (I), (II) and
(III) by red lines with squares, green lines with crosses and blue lines with
circles, respectively. Sample size is n = 50.
9
Figure 2 is based on the datasets in the second example in Doosti and Hall
(2016). In the case of Figure 2 the distributions are all infinitely supported,
from top to bottom panels, it shows the estimators of Student’s t3, lognormal,
separated bimodal and skewed bimodal densities, respectively. In the latter two
cases the densities were #7 and #8 of Marron and Wand (1992).155
Table 1: Approximation of MISE. The sample size is 100 and the number of replication is
500.
Density Tilted Tilted Tilted Tilted Tilted Tilted conventional Sinc Trapezoidal Diffussion
I,m=3 I,m = n II,m=3 II,m=n III,m=2 III,m=3 Estimator Estimator Estimator Estimator
Gua 0.0538 0.0487 0.0520 0.0661 0.0296 0.0283 0.0476 0.0579 0.0674 0.0511
SkU 0.6309 0.6181 0.6583 0.6550 0.5429 0.5203 0.5329 0.6249 0.6573 0.5769
StS 5.0323 5.0375 5.0527 5.0741 4.9725 4.5596 4.5732 5.0947 5.0997 4.6107
KtU 0.3751 0.5133 0.3501 0.4440 0.5042 0.5064 1.0277 0.5174 0.4460 0.4429
Out 0.4287 0.5007 0.4580 0.5972 0.5349 0.5247 0.5369 0.5171 0.5662 0.7243
Bim 0.0777 0.0858 0.0746 0.0982 0.0660 0.0772 0.0885 0.0905 0.0997 0.0731
SeB 0.0928 0.0997 0.0918 0.1207 0.1207 0.1348 0.5456 0.1092 0.1249 0.0974
SkB 0.1076 0.1251 0.1051 0.1270 0.0837 0.1018 0.1164 0.1319 0.1301 0.0972
When describing the results in Figure 2 it is convenient to treat by itself the
lognormal density flgn, for which the results are shown in the upper right-hand
panel. Density estimation there is challenged seriously by the sharp decrease
to zero of flgn(x) as x ↓ 0. Although flgn is infinitely differentiable on the real
line, in several respects estimators of flgn behave as though the density had a160
jump discontinuity at the origin. This is particularly true for the conventional
and diffusion estimators, which perform almost identically and very poorly. The
10
−6 −4 −2 0 2 4 6−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 1 2 3 4 5 6 7−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
−4 −3 −2 −1 0 1 2 3 4−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
−4 −3 −2 −1 0 1 2 3 4−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Figure 2 Estimation of densities with infinite support. See text for distribu-
tion types. Line types are as in Figure 1, and sample size is n = 100.
11
tilted estimators (I) and (II) are higher in the vicinity of the mode than the tilted
estimator (III), but the latter is smoother than (I) and (II) and so is arguably
more attractive. The tilted sinc kernel estimator (I) tracks its untilted form165
(II) very closely, and both have several spurious modes in the right-hand tail,
although of course only the sinc kernel estimator takes negative values there.
More generally, the closeness of (I) and (II) can be seen in all the panels in
Figure 2.
Next we discuss the other three panels of Figure 2. In the top panel, tilting170
methods (II) and (III) do particularly well and about equally, while in the
third panel from above the tilted estimator (II) arguably performs best. This
estimator is very close to its untilted form (iii), except in the right tail, where the
untilted form is unattractive because it takes negative values. In the bottom
panel, our tilted estimators succeed in estimating the right peak but do not175
perform so well for the left peak. All in all, tilted estimator (III) performs best
out of the seven methods.
Next we assess the performance of all seven estimators in terms of mean
integrated squared error, MISE(x) =∫E{f(x) − f(x)}2 dx. We treated the
following eight densities, for which names and formulae are as in Table 1 of180
Marron and Wand (1992): Gaussian (Gau), skewed unimodal (SkU), strongly
skewed (StS), kurtotic unimodal (KtU), outlier (Out), bimodal (Bim), separated
bimodal (SeB) and skewed bimodal (SkB). Each of these densities has either one
or two modes. Our Table 1 gives values of MISE, computed from 500 simulations
in each instance for these cases, when sample size is n = 100. The lowest value of185
MISE in each row is printed in bold. In the cases of tilted estimators (I) and (II)
we give results both for m = n, meaning that the probabilities pi are permitted
to be all different from one another, and for m = 3, where the pis take only
three distinct values, for neighboring values of i. For the tilted estimator (III)
we consider only m = 2 and m = 3. We should remind that since we recruit the190
same datasets, some columns are exactly similar to the corresponding columns
in Table 1 in Doosti and Hall (2016).
Perhaps the first thing to note from Table 1 is that, in mean squared error
12
terms, our tilted estimators perform similarly to their conventional counterparts,
for example kernel methods, when the true density is relatively simple, but they195
perform much better than conventional estimators, and similarly to the sinc
estimator, in the case of complex densities, for example KtU and SeB.
Note too that, when we allow all the pis to be distinct (i.e. when m = n),
the tilted versions of the sinc and trapezoidal kernel estimators always have
lower MISE than the native forms of those estimators. Moreover, excepting the200
case of the distribution SkU and the tilted estimator (III), even taking m = 3
produces a reduction in MISE.
Overall the tilted estimator based on minimisation of the cross-validation
function (2.1) gives best performance, having least MISE in five out of the eight
cases. In two cases the tilted trapezoidal kernel estimator (II), and in one case205
the tilted estimator (I), offer a slight improvement over method (III).
4.3. Real Data. Finally we illustrate, in Figure 3, the performance of compet-
ing estimators in the case of the Old Faithful geyser dataset (see e.g. Doosti and
Hall (2016)), where the Xis represent the duration, in minutes, of 107 eruptions
of Old Faithful geyser in Yellowstone National Park.210
The upper panel of Figure 3 shows estimators (i)–(iv) introduced in the
current section, using the same respective line types as in Figures 1 and 2.
The lower panel of Figure 3 shows the tilted estimators of types (I)–(III), and a
histogram of the data. As can be seen, all three tilted estimators are bimodal, al-
though, reflecting our experience in Figure 2, type (II) captures bimodality more215
sharply. The tilted estimators are of course always positive. The conventional
kernel estimator, in the upper panel, shows the least amount of “enthusiasm”
for bimodality. As indicated in Figure 2 and Table 1, this estimator performs
poorly when used to estimate separated bimodal densities.
220
Figure 11 in Sain and Scott (1996) and Figure 7 in Reynaud-Bouret et al.
(2011) argue that the true density should be bimodal too, but the left peak at-
tained in Sain and Scott (1996) is sharper than those found in our study as well
as that obtained by Reynaud-Bouret et al. (2011). However, in the problem
13
0 1 2 3 4 5 6-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 1 2 3 4 5 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Figure 3 Density estimators computed from Old Faithful geyser dataset. Up-
per panel shows estimators (i)–(iv), and lower panel shows estimators (I)–(III)
and a histogram. Line types are as in Figure 1.
14
of estimating the right peak all the estimators perform similarly. The proposed225
estimator in Reynaud-Bouret et al. (2011) is zero in an interval located between
the two peaks, and in this case it follows from their algorithm that their esti-
mator is not as smooth as ours.
5. Discussion. We have proposed a new cross-validation function that can
be used in choosing the bandwidth and tilted parameters. We have shown that230
the performance of the proposed density function estimator is better than the
conventional kernel-based estimators. It outperforms the previous estimators
shown in Doosti and Hall (2016) in both accuracy and in minimising the com-
putation time. Globally, the rate of convergence is faster than the usual one.
Our new estimator is handy to be employed in a wide range of applied fields. In235
particular the authors are now exploring tilted estimators for presence-absence
of spatially correlated data. In addition, one can also consider a generalised ver-
sion of the cross-validation function to impose some constraints on the density
function, e.g. bimodalality, making it more realistic in real scenarios.
References240
[1] Carroll, R. J., Delaigle, A. and Hall, P. (2011) Testing and estimating shape-
constrained nonparametric density and regression in the presence of mea-
surement error. J. Am. Statist. Ass., 106, 191202.
[2] Chen, S. X. (1997) Empirical likelihood-based kernel density estimation.
Aust. J. Statist., 39, 4756.245
[3] Doosti, H. and Hall, P. (2016) Making a non-parametric density estimator
more attractive, and more accurate, by data perturbation. J. R. Statist. Soc.
B, 78(2), 445-462.
[4] Grenander, U. (1956) On the theory of mortality measurement II. Skand.
Akt., 39, 125153.250
[5] Hall, P. and Huang, L. S. (2001) Nonparametric kernel regression subject to
monotonicity constraints. Ann. Statist., 29, 624647.
15
[6] Hall, P. and Huang, L. S. (2002) Unimodal density estimation using kernel
methods. Statist. Sin., 12, 965990.
[7] Hall, P. and Presnell, B. (1999) Intentionally biased bootstrap methods. J.255
R. Statist. Soc. B, 61, 143158.
[8] Marron, J. S. and Wand, M. P. (1992) Exact mean integrated squared error.
Ann. Statist., 20, 712736.
[9] M uller, U. U., Schick, A. and Wefelmeyer,W. G. (2005) Weighted residual-
based density estimators for nonlinear autoregressive models. Statist. Sin.,260
15, 177195.
[10] Owen, A. B. (1988) Empirical likelihood ratio confidence intervals for a
single functional. Biometrika, 75, 237249.
[11] Reynoud-Bouret, P., Rivoirard, V. and Tuleau-Malot, C. (2011) Adaptive
density estimation: a curse of support? J. Statist. Planng Inf., 141, 115139.265
[12] Sain, S. R. and Scott, D. W. (1996) On locally adaptive density estimation.
J. Am. Statist. Ass., 91, 15251534.
[13] Schick, A. and Wefelmeyer, W. G. (2009) Improved density estimators for
invertible linear processes. Communs Statist. Theor. Meth., 38, 31233147.
[14] Weisberg, S. (1985) Applied Linear Regression. New York: Wiley.270
[15] Zhang, B. (1998) A note on kernel density estimation with auxiliary infor-
mation. Communs Statist. Theor. Meth., 27, 111.
16
APPENDIX : Proof of Theorem 1
Step 1: Initial approximation to n−1∑i f−i(Xi |h, p). The approximation is
given at (B.3). Throughout our proof, in this step and subsequent ones, we
assume that c1 in (3.1) satisfies 0 < c1 < 15 . This is appropriate since the275
theorem is asserted only for sufficiently small c1.
To derive the approximation, let p(i) denote the concomitant of X(i) in the
pair (pi, Xi), and observe that
hn∑
i=1
f−i(Xi |h, p) =n∑
i=1
∑
j : j 6=ipj K
(Xi −Xj
h
)
=n∑
i=1
∑
j : j 6=ip(j)K
(X(i) −X(j)
h
)= T1 + . . .+ T4 , (B.1)
where, defining I(E) to be the indicator function of an event E ,
T1 =m∑
k1=1
∑
k2 : k2 6=k1p(ik2
)K
(X(ik1
) −X(ik2)
h
),
T2 =
m∑
k1=1
m∑
k2=1
ik2−1∑
j=ik2−1+1
I(ik1 6= j) p(j)K
(X(ik1
) −X(j)
h
),
T3 =m∑
k1=1
ik1−1∑
i=ik1−1+1
m∑
k2=1
p(ik2) I(ik2 6= i)K
(X(i) −X(ik2
)
h
),
T4 =m∑
k1=1
ik1−1∑
i=ik1−1+1
m∑
k2=1
ik2−1∑
j=ik2−1+1
I(i 6= j) p(j)K
(X(i) −X(j)
h
). (B.2)
Since pi ≤ C3 n−1, uniformly in i (see CB(c)), then
T1 ≤C3
n
m∑
k1=1
∑
k2 : k2 6=k1K
(X(ik1
) −X(ik2)
h
)≤ C3 n
−1m2 supK ,
T2 + T3 ≤2C3
n
m∑
k1=1
m∑
k2=1
ik2−1∑
j=ik2−1+1
K
(X(ik1
) −X(j)
h
)
≤ 2C3mh sup−∞<x<∞
f(x) = Op(mh) ,
uniformly in h ∈ H. (Here we have used the assumption that 0 < c1 <15 .)
Combining these bounds with (B.1) we deduce that, uniformly in h ∈ H and
17
p ∈ P,
1
n
n∑
i=1
f−i(Xi |h, p) =T4nh
+Op
(m2
n2h+m
n
). (B.3)
Step 2: Hoeffding decomposition of T4. The quantity T4 is not unlike a U -
statistic, and in this section we represent T4 as a Hoeffding-like linear projection
of T4, which we denote by T8, plus a remainder term, T9. See (B.7) and (B.8).
All expectations involved are conditional on F , which we define to be the sigma-280
field generated by X(ij) for 1 ≤ j ≤ m.
Recall from AA that X(1) ≤ . . . ≤ X(n) are the order statistics of the data
X1, . . . , Xn, and that pi = qj (the latter depending only on j) for ij−1 < i ≤ ij ,where 1 ≤ j ≤ m and i0 = 0. Conditional on F , the random variables X(i),
for ij−1 < i ≤ ij , are independent and identically distributed with density285
f/{F (X(ij))−F (X(ij−1))}, supported on the interval [X(ij−1), X(ij)]. Moreover,
the pis are F-measurable functions of q1, . . . , qm.
With these properties in mind, write T5 and T6 for the versions of T4 that
arise if, in the quadruple series defining T4 at (B.2), we take the expected value
of the summands conditional on both X(j) and F , and the expected value con-
ditional on both X(i) and F , respectively. Let T7, an F-measurable random
variable, be the quantity obtained if the conditional expectation is on F alone:
T5 =m∑
k1=1
ik1−1∑
i=ik1−1+1
m∑
k2=1
ik2−1∑
j=ik2−1+1
I(i 6= j) p(j)
× E{K
(X(i) −X(j)
h
) ∣∣∣∣ X(j),F}, (B.4)
T6 =m∑
k1=1
ik1−1∑
i=ik1−1+1
m∑
k2=1
ik2−1∑
j=ik2−1+1
I(i 6= j) p(j)
× E{K
(X(i) −X(j)
h
) ∣∣∣∣ X(i),F}, (B.5)
T7 =
m∑
k1=1
ik1−1∑
i=ik1−1+1
m∑
k2=1
ik2−1∑
j=ik2−1+1
I(i 6= j) p(j)E
{K
(X(i) −X(j)
h
) ∣∣∣∣ F}. (B.6)
In this notation we write
T4 = T8 + T9 , (B.7)
18
where
T8 = T4 − (T5 + T6) + T7 , T9 = T5 + T6 − T7 . (B.8)
Step 3: Approximation to T5. The approximation is given at (B.23). Observe
that, starting from (B.4),
T5 =∑∑
1≤k1 6=k2≤m
ik1−1∑
i=ik1−1+1
ik2−1∑
j=ik2−1+1
p(j)E
{K
(X(i) −X(j)
h
) ∣∣∣∣ X(j),F}
+m∑
k=1
∑∑
ik−1+1≤i6=j≤ik−1p(j)E
{K
(X(i) −X(j)
h
) ∣∣∣∣ X(j),F}
=∑∑
1≤k1 6=k2≤m
ik2−1∑
j=ik2−1+1
(ik1 − ik1−1 − 1) qk2F (X(ik1
))− F (X(ik1−1))
∫ X(ik1)
X(ik1−1)
K
(X(j) − x
h
)f(x) dx
+m∑
k=1
ik−1∑
j=ik−1+1
(ik − ik−1 − 2) qkF (X(ik))− F (X(ik−1))
∫ X(ik)
X(ik−1)
K
(X(j) − x
h
)f(x) dx
=
m∑
k1=1
m∑
k2=1
ik2−1∑
j=ik2−1+1
(ik1 − ik1−1 − 1) qk2F (X(ik1
))− F (X(ik1−1))
∫ X(ik1)
X(ik1−1)
K
(X(j) − x
h
)f(x) dx
−m∑
k=1
ik−1∑
j=ik−1+1
qkF (X(ik))− F (X(ik−1))
∫ X(ik)
X(ik−1)
K
(X(j) − x
h
)f(x) dx
= T51 − T52 , (B.9)
say. Since qk = qk(n) is bounded above by C3 n−1, uniformly in k and n (see
CB(c)) and since∫K = 1 then
0 ≤ T52 = C3 n−1 h
m∑
k=1
ik−1∑
j=ik−1+1
sup f
F (X(ik))− F (X(ik−1)). (B.10)
Writing F for the empirical distribution function for the sample X1, . . . , Xn,
we have sup |F − F | = Op(n−1/2). Therefore, since ik = n F (X(ik)) for each k,
then F (X(ik)) − F (X(ik−1)) = n−1 (ik − ik−1) + Op(n−1/2), uniformly in 1 ≤
k ≤ m for choices of m ≤ n. Moreover, by AA, C1 n/m ≤ ik − ik−1 ≤ C2 n/m.
Hence, since m = o(n1/2) (see CB(d)),
max1≤k≤m
{F (X(ik))− F (X(ik−1))}−1 = Op(m) . (B.11)
19
Combining (B.10) and (B.11) we deduce that, uniformly in h ∈ H and p ∈ P,
T52 = Op(n−1hmn
)= Op(mh) . (B.12)
Next, replacing (k1, k2) by (k, k1), we approximate T51, indicated in (B.9):
T51 =m∑
k=1
ik − ik−1 − 1
F (X(ik))− F (X(ik−1))
×∫ X(ik)
X(ik−1)
{ m∑
k1=1
qk1
ik1−1∑
j=ik1−1+1
K
(X(j) − x
h
)}f(x) dx
=m∑
k=1
ik − ik−1 − 1
F (X(ik))− F (X(ik−1))
∫ X(ik)
X(ik−1)
{ n∑
i=1
piK
(Xi − xh
)}f(x) dx
−m∑
k=1
ik − ik−1 − 1
F (X(ik))− F (X(ik−1))
∫ X(ik)
X(ik−1)
{ m∑
k1=1
qk1 K
(X(ik1
) − xh
)}f(x) dx
= T511 − T512 , (B.13)
say. Now,
T511 = hm∑
k=1
ik − ik−1 − 1
F (X(ik))− F (X(ik−1))
∫ X(ik)
X(ik−1)
f(x |h, p) f(x) dx , (B.14)
and, noting (B.11) and taking C2 to be as in (2.24),
T512 ≤ C2 n−1
m∑
k=1
C2 n/m
F (X(ik))− F (X(ik−1))
∫ { m∑
k1=1
K
(X(ik1
) − xh
)}f(x) dx
= Op{n−1m2 (n/m)mh
}= Op
(m2 h
). (B.15)
To simplify the formula for T511 at (B.14), note first that, since F (X(ik)) =
ik/n for each k,
ik − ik−1 − 1
F (X(ik))− F (X(ik−1))=
ik − ik−1 − 1
F (X(ik))− F (X(ik−1)) +Op(n−1/2)
= nik − ik−1 − 1
ik − ik−1 +Op(n1/2)= n+Op
(mn1/2
), (B.16)
uniformly in 1 ≤ k ≤ m. The Op(mn1/2) remainder in (B.16) does not depend
on h or p, and will be denoted below by Q1nk. Define
Ds(h, p) =
∫ ∣∣f(x |h, p)− f(x)∣∣s dx . (B.17)
20
Moment methods can be used to prove that, for all r, s ≥ 1,
suph∈H
supp∈P
E{Ds(h, p)
r}
= O[{
(nh)−1 + h4}rs/2]
.
This property, Markov’s inequality, and a lattice argument which we shall in-
troduce in step 5 and develop further in step 8, can be used to show that, for
each ε > 0,
suph∈H
supp∈P
Ds(h, p) = Op[{
(nh)−1 + h4}s/2
n2ε]. (B.18)
Hence,
maxk
∣∣∣∣∫ X(ik)
X(ik−1)
{f(x |h, p)− f(x)
}f(x) dx
∣∣∣∣2
≤ D2(p)
∫f2
= Op[{
(nh)−1 + h4}n2ε]
uniformly in h ∈ H and p ∈ P. Therefore, defining
Q1n =1
n
m∑
k=1
Q1nk
∫ X(ik)
X(ik−1)
f(x)2 dx ,
which does not depend on h or p, we have:
m∑
k=1
Q1nk
∫ X(ik)
X(ik−1)
f(x |h, p) f(x) dx = nQ1n +Op
[ m∑
k=1
|Q1nk|{
(nh)−1/2 + h2}nε]
= nQ1n +Op
[m2 n1/2 {(nh)−1/2 + h2
}nε], (B.19)
uniformly in h ∈ H and p ∈ P. Combining (B.14), (B.16) and (B.19) we deduce
that, uniformly in h ∈ H and p ∈ P, and for each ε > 0,
(nh)−1 T511 =
∫ X(n)
−∞f(x |h, p) f(x) dx+Q1n
+Op
[m2 n−1/2 {(nh)−1/2 + h2
}nε]
=
∫f(x |h, p) f(x) dx+Q1n
+Op
[m2 n−1/2 {(nh)−1/2 + h2
}nε]. (B.20)
The last identity above follows from the fact that, for each ε > 0,
∫ ∞
X(n)
f(x |h, p) f(x) dx = Op(nε−1
). (B.21)
21
To derive this result, let gδ(x) = {1−F (x)}(1/2)−δ and note that, since |F (x)−F (x)|/gδ(x) = Op(n
−1/2) uniformly in x, for each δ ∈ (0, 12 ), then, taking
xn = X(n), we have
1− F (xn) = 1− F (xn) +Op{n−1/2 gδ(xn)
}.
Hence, since 1− F (xn) = n−1,
1− F (xn) = 1− F (xn) +Op
[n−1/2
{1− F (xn)
}(1/2)−δ]
= n−1 +Op(nδ−1
)= Op
(nδ−1
)
Therefore, defining Ds(h, p) as at (B.17), we deduce that if s, t > 0 they satisfy
s−1 + t−1 = 1, then for all δ, ε > 0,
∫ ∞
X(n)
f(x |h, p) f(x) dx ≤∫ ∞
xn
f(x)2 dx+
∫ ∞
xn
∣∣f(x |h, p)− f(x)∣∣ f(x) dx
≤ (sup f) {1− F (xn)}+Ds(h, p)1/s
{∫
xn
f(x)t dx
}1/t
= Op
[nδ−1 +
{(nh)−1 + h4
}1/2n2ε/s
(nδ−1
)1/t], (B.22)
where we have used (B.18) to derive the last identity. Choosing s arbitrarily
large (i.e. t > 1 arbitrarily close to 1) in (B.22), and ε, δ arbitrarily small, we
deduce that (B.21) holds for all ε > 0.290
Combining (B.9), (B.12), (B.13) and (B.15) we deduce that, uniformly in
h ∈ H and p ∈ P,
T5 = T51 − T52 = T511 − T512 − T52 = T511 +Op(m2 h
).
Hence, by (B.20), uniformly in h ∈ H and p ∈ P, and for each ε > 0,
(nh)−1 T5 =
∫f(x |h, p) f(x) dx+Q1n +Op
[m2 n−1/2 {(nh)−1/2 + h2
}nε].
(B.23)
Step 4: Approximation to T6. The approximation is given at (B.33). The first
22
step in deriving it, starting from (B.5), is to approximate T6:
T6 =∑∑
1≤k1 6=k2≤m
ik1−1∑
i=ik1−1+1
ik2−1∑
j=ik2−1+1
p(j)E
{K
(X(i) −X(j)
h
) ∣∣∣∣ X(i),F}
+m∑
k=1
∑∑
ik−1+1≤i6=j≤ik−1p(j)E
{K
(X(i) −X(j)
h
) ∣∣∣∣ X(i),F}
=∑∑
1≤k1 6=k2≤m
ik1−1∑
i=ik1−1+1
(ik2 − ik2−1 − 1) qk2F (X(ik2
))− F (X(ik2−1))
∫ X(ik2)
X(ik2−1)
K
(X(i) − x
h
)f(x) dx
+m∑
k=1
ik−1∑
i=ik−1+1
(ik − ik−1 − 2) qkF (X(ik))− F (X(ik−1))
∫ X(ik)
X(ik−1)
K
(X(i) − x
h
)f(x) dx
=
m∑
k1=1
m∑
k2=1
(ik2 − ik2−1 − 1) qk2F (X(ik2
))− F (X(ik2−1))
ik1−1∑
i=ik1−1+1
∫ X(ik2)
X(ik2−1)
K
(X(i) − x
h
)f(x) dx
−m∑
k=1
ik−1∑
i=ik−1+1
qkF (X(ik))− F (X(ik−1))
∫ X(ik)
X(ik−1)
K
(X(i) − x
h
)f(x) dx
= T61 − T62 , (B.24)
say. Since S52 = S62 then, by (B.12),
suph∈H
supp∈P
T62 = Op(mh) . (B.25)
Next we develop an approximation to T61, which is given equivalently by
T61 =
m∑
k=1
m∑
k1=1
(ik − ik−1 − 1) qkF (X(ik))− F (X(ik−1))
×∫ X(ik)
X(ik−1)
{ ik1−1∑
i=ik1−1+1
K
(X(i) − x
h
)}f(x) dx .
23
Therefore,
T10 ≡ T61 − E(T61 | F)
=m∑
k=1
(ik − ik−1 − 1) qkF (X(ik))− F (X(ik−1))
m∑
k1=1
{ ik1−1∑
i=ik1−1+1
∫ X(ik)
X(ik−1)
K
(X(i) − x
h
)f(x) dx
− ik1 − ik1−1 − 1
F (X(ik1))− F (X(ik1−1))
∫ X(ik)
X(ik−1)
f(x) dx
∫ X(ik1)
X(ik1−1)
K
(y − xh
)f(y) dy
}
= h
m∑
k=1
(ik − ik−1 − 1) qk
m∑
k1=1
(ik1 − ik1−1 − 1)
×∫ X(ik)
X(ik−1)
[fk1(x)− E
{fk1(x)
∣∣ F}]fk(x) dx , (B.26)
where fk(x) = 0 if x /∈ Ik ≡ [X(ik−1), X(ik)],
fk(x) =f(x)
F (X(ik))− F (X(ik−1))if x ∈ Ik , )fk(x) =
1
(ik − ik−1 − 1)h
ik−1∑
i=ik−1+1
K
(X(i) − x
h
).
In what follows it is convenient to indicate, in notation for T10, T101 and
T102, the extent to which these quantities depend on h or p. Therefore we
write them as T10(h, p), T101(h) and T102(h, p), respectively. Defining rk by
qk = n−1 (1 + h2 rk), we decompose T10 into two parts:
T10(h, p) = T101(h) + T102(h, p) , (B.27)
where
T101(h) =h
n
m∑
k=1
(ik − ik−1 − 1)
m∑
k1=1
(ik1 − ik1−1 − 1)
×∫ X(ik)
X(ik−1)
[fk1(x)− E
{fk1(x)
∣∣ F}]fk(x) dx ,
T102(h, p) =h3
n
m∑
k=1
(ik − ik−1 − 1) rk
m∑
k1=1
(ik1 − ik1−1 − 1)
×∫ X(ik)
X(ik−1)
[fk1(x)− E
{fk1(x)
∣∣ F}]fk(x) dx .)
Given a random variable V , write EF (V ) for E(V | F), let S = {1, . . . , n} \
24
{i1, . . . , im}, and note that
hm∑
k1=1
(ik1−ik1−1−1) fk1(x) =m∑
k=1
ik−1∑
i=ik−1+1
K
(X(i) − x
h
)=∑
i∈SK
(Xi − xh
).
(B.28)
Therefore,
h
m∑
k1=1
(ik1 − ik1−1 − 1)[fk1(x)−E
{fk1(x)
∣∣ F}]
=∑
i∈S(1−EF )K
(Xi − xh
),
and so
T101(h) =1
n
∑
i∈S(1− EF )
m∑
k=1
(ik − ik−1 − 1)
∫ X(ik)
X(ik−1)
K
(Xi − xh
)fk(x) dx
=1
n
∑
i∈S(1− EF )
m∑
k=1
ik − ik−1 − 1
F (X(ik))− F (X(ik−1))
∫ X(ik)
X(ik−1)
K
(Xi − xh
)f(x) dx
=∑
i∈S(1− EF )
∫K
(Xi − xh
)f(x) dx+Op
(m2 nε−(1/2) h
), (B.29)
T102(h, p) =h2
n
∑
i∈S(1− EF )
m∑
k=1
(ik − ik−1 − 1) rk
∫ X(ik)
X(ik−1)
K
(Xi − xh
)fk(x) dx
=h2
n
∑
i∈S(1− EF )
m∑
k=1
ik − ik−1 − 1
F (X(ik))− F (X(ik−1))rk
∫ X(ik)
X(ik−1)
K
(Xi − xh
)f(x) dx
= Op
(mn(1/2)+ε h3 sup
k|rk|), (B.30)
uniformly in h ∈ H and p ∈ P. (The arguments leading to the remainder terms
in each of (B.29)–(B.31), uniformly in h ∈ H and p ∈ P, will be given in detail
in step 5). Similarly it can be proved that, for each ε > 0,
1
nh
∑
i∈S(1− EF )
∫K
(Xi − xh
)f(x) dx = Q2n +Op
(nε−(1/2) h2
), (B.31)
uniformly in h ∈ H, where Q2n does not depend on h. Hence, combining (B.27)
and (B.29)–(B.31), we deduce that
(nh)−1{T61 − E(T61 | F)
}= (nh)−1 T10(h, p)
= Q2n +Op
(m2 nε−(3/2) +mnε−(1/2) h2 sup
k|rk|). (B.32)
25
In view of (B.24) and (B.25), T61 = T6 + Op(mh), uniformly in h ∈ H and
p ∈ P, and so by (B.32),
(nh)−1{T6 − E(T61 | F)
}
= Q2n +Op
(m2 nε−(3/2) +mnε−(1/2) h2 sup
k|rk|+mn−1
), (B.33)
uniformly in the same sense.
Step 5: Proofs of (B.29)–(B.31). First we develop a bound for
s(h) =m∑
k=1
|sk(h)| ,
where
sk(h) =
∫ X(ik)
X(ik−1)
{(1− EF )
∑
i∈SK
(Xi − xh
)}f(x) dx .
Now,
∫ X(ik)
X(ik−1)
{∑
i∈SK
(Xi − xh
)}f(x) dx = h
∑
i∈S
∫ (Xi−X(ik−1))/h
(Xi−X(ik))/h
K(u) f(Xi−hu) du ,
and so by Rosenthal’s inequality, for each r ≥ 1,
h−2r E{sk(h)2r
∣∣ F}
≤ C(r)
([∑
i∈Svar
{∫ (Xi−X(ik−1))/h
(Xi−X(ik))/h
K(u) f(Xi − hu) du
∣∣∣∣ F}]r
+∑
i∈SE
{∣∣∣∣(1− EF )
∫ (Xi−X(ik−1))/h
(Xi−X(ik))/h
K(u) f(Xi − hu) du
∣∣∣∣2r∣∣∣∣∣ F
})
≤ 22r+1 C(r) (1 + sup f)2r nr ,
where the constant C(r) depends only on r. Therefore,
E{s(h)2r
}≤ m2r
m∑
k=1
E[E{sk(h)2r
∣∣ F}]≤ 22r+1 C(r) (1+sup f)2r
(m2nh2
)r.
Hence, by Markov’s inequality, for each ε > 0 and each B1 > 0,
suph∈H
P{s(h) > mn(1/2)+ε h
}= O
(n−B1
),
and so, if Hn ⊆ H contains no more than O(nB2) elements for some B2 > 0,
P
{suph∈Hn
s(h) > mn(1/2)+ε h
}= O
(n−B3
),
26
for all B3 > 0. Taking Hn to be a regular lattice in H, and noting the Holder
continuity assumed of K, we can extend the bound above from Hn to all of H:
P
{suph∈H
s(h) > mn(1/2)+ε h
}= O
(n−B3
), (B.34)
for all B3 > 0.
To derive (B.29), note that, uniformly in 1 ≤ k ≤ m,
ik − ik−1 − 1
F (X(ik))− F (X(ik−1))=
(ik − ik−1) {1 +O(m/n)}n−1 (ik − ik−1) {1 +Op(m/n1/2)} = n+Op
(mn1/2
).
Therefore the third identity in (B.29) can be written as
1
n
∑
i∈S(1− EF )
m∑
k=1
ik − ik−1 − 1
F (X(ik))− F (X(ik−1))
∫ X(ik)
X(ik−1)
K
(Xi − xh
)f(x) dx
=1
n
m∑
k=1
ik − ik−1 − 1
F (X(ik))− F (X(ik−1))
∫ X(ik)
X(ik−1)
{(1− EF )
∑
i∈SK
(Xi − xh
)}f(x) dx
=1
n
m∑
k=1
{n+Op
(mn1/2
)}sk(h)
=∑
i∈S(1− EF )
∫K
(Xi − xh
)f(x) dx+Op
(m2 nε−(1/2) h
),
uniformly in h ∈ H for each ε > 0, where we derived the last identity using
(B.34). This establishes (B.29).
To derive (B.30) we replace the series of steps at (B.34) by
h2
n
∑
i∈S(1− EF )
m∑
k=1
ik − ik−1 − 1
F (X(ik))− F (X(ik−1))rk
∫ X(ik)
X(ik−1)
K
(Xi − xh
)f(x) dx
=h2
n
m∑
k=1
(ik − ik−1 − 1) rkF (X(ik))− F (X(ik−1))
∫ X(ik)
X(ik−1)
{(1− EF )
∑
i∈SK
(Xi − xh
)}f(x) dx
=h2
n
m∑
k=1
{n+Op
(mn1/2
)}rk sk(h)
= Op
{h2 n−1
(n+mn1/2
) (supk|rk|)s(h)
}= Op
(mn(1/2)+ε h3 sup
k|rk|), )
where again the bound applies uniformly in h ∈ H for each ε > 0, and we295
obtained the last identity by using (B.34). This implies (B.30).
27
Finally in this step, to establish (B.31) note that, using the exact formula
for the remainder in a Taylor expansion,
1
h
∫K
(Xi − xh
)f(x) dx =
∫K(u) f(Xi − hu) du
=
∫K(u)
{f(Xi)− hu f ′(Xi) +
∫ hu
0
(hu− v) f ′′(Xi − v) dv
}du
= f(Xi) + h2 g(Xi, h) , (B.35)
where
g(x, h) =1
h2
∫K(u) du
∫ hu
0
(hu− v) f ′′(x− v) dv
and g has the properties (i) |g| ≤ B1 where B1 = 12 (sup |f ′′|)
∫u2K(u) du,
and (ii)
supx
∣∣g(x, h1)− g(x, h2)∣∣ ≤ B2 max
[ |h1 − h2|min(h1, h2)
,|h1 − h2|2{min(h1, h2)}2
].
Property (i) of g uses the assumption, in CB(a), that |f ′′| is bounded, and
permits it to be proved that, for each r ≥ 1,
suph∈H
E
[{∑
i∈S(1− EF ) g(Xi, h)
}2r]= O
(nr).
Therefore, using Markov’s inequality, it can be shown that for each ε, C > 0,
suph∈H
P
{∣∣∣∣∑
i∈S(1− EF ) g(Xi, h)
∣∣∣∣∣ > n(1/2)+ε}
= O(n−C
).
Hence, if Hn is any subset of H containing only polynomially many points (as
a function of n), then for each ε, C > 0,
P
{suph∈Hn
∣∣∣∣∑
i∈S(1− EF ) g(Xi, h)
∣∣∣∣∣ > n(1/2)+ε}
= O(n−C
). (B.36)
Property (ii) of g permits us to extend (B.36) from Hn to H. Combining this
result with (B.35) we deduce that (B.31) holds with
Q2n =1
n
∑
i∈S(1− EF ) f(Xi) .
28
Step 6: Bound to E(T61 | F)− T7. The bound is given at (B.48). Observe from
(B.28), and the argument leading to (B.26), that
E(T61 | F) = hm∑
k=1
(ik − ik−1 − 1) qk
m∑
k1=1
(ik1 − ik1−1 − 1)
×∫ X(ik)
X(ik−1)
E{fk1(x)
∣∣ F}fk(x) dx
=1
n
m∑
k=1
(ik − ik−1 − 1) qk
∫ X(ik)
X(ik−1)
fk(x)
×[
m∑
k1=1
ik1−1∑
i=ik1−1+1
E
{K
(X(i) − x
h
) ∣∣∣∣ F}]
dx .
Note too, from (B.6), that
T7 =
m∑
k1=1
ik1−1∑
i=ik1−1+1
m∑
k2=1
ik2−1∑
j=ik2−1+1
I(i 6= j) p(j)E
{K
(X(i) −X(j)
h
) ∣∣∣∣ F}
=m∑
k=1
(ik − ik−1 − 1) (ik − ik−1 − 2) qk
∫ X(ik)
X(ik−1)
fk(x) dx
×∫ X(ik)
X(ik−1)
fk(y)K
(y − xh
)dy
+m∑
k=1
(ik − ik−1 − 1) qk
∫ X(ik)
X(ik−1)
fk(x)
×[
m∑
k1 : k1 6=k
ik1−1∑
i=ik1−1+1
E
{K
(X(i) − x
h
) ∣∣∣∣ F}]
dx
= E(T61 | F)−m∑
k=1
(ik − ik−1 − 1) qk
∫ X(ik)
X(ik−1)
fk(x) dx
×∫ X(ik)
X(ik−1)
fk(y)K(y − x
h
)dy . (B.37)
Since qk ≤ C3 n−1 (see CB(c)) then the subtracted term on the far right-hand
side is bounded above by
C3
n
m∑
k=1
ik − ik−1 − 1
{F (X(ik))− F (X(ik−1))}2∫ X(ik)
X(ik−1)
f(x) dx
∫ X(ik)
X(ik−1)
f(y)K(y − x
h
)dy
= Op(m3 h
), (B.38)
29
uniformly in h ∈ H and p ∈ P. (Here we have used (B.11).) Therefore,
combining (B.37) and (B.38),
E(T61 | F)− T7 = Op(m3 h
), (B.39)
uniformly in h ∈ H and p ∈ P.
Step 7: Approximation to T9. Recall that T9 is defined at (B.8). Combining
(B.23), (B.33) and (B.39) we deduce that
(nh)−1 T9 =
∫f(x |h, p) f(x) dx+Q1n +Q2n
+Op
(m2 nε−(3/2) +mnε−(1/2) h2 sup
k|rk|+mn−1
), (B.40)
uniformly in h ∈ H and p ∈ P.
Step 8: Bound to T8. The bound is given at (B.42). Recall that T8 was defined
at (B.8). In all the arguments in this step, including the randomisation in the300
next paragraph, we condition on F .
Randomise the order statistics X(i) for ik−1 + 1 ≤ i ≤ ik − 1, obtaining
Xk1, . . . , Xkνk , say, where νk = ik − ik−1 − 1; and do this independently for
k = 1, . . . ,m. Write (k1, j1) ≺ (k2, j2) if either k1 < k2, or k1 = k2 and j1 < j2.
In this way we can order each of the ν =∑k≤m νk pairs (k, j). Let π(`) denote
the `th pair, satisfying π(`− 1) ≺ π(`) ≺ π(`+ 1). Define
Zπ(`1),π(`2) = (qk1 + qk2)
[K
(Xk1j1 −Xk2j2
h
)− E
{K
(Xk1j1 −Xk2j2
h
) ∣∣∣∣ F , Xk1j1
}
− E{K
(Xk1j1 −Xk2j2
h
) ∣∣∣∣ F , Xk2j2
}+ E
{K
(Xk1j1 −Xk2j2
h
) ∣∣∣∣ F}]
when (k1, j1) and (k2, j2) are the `1th and `2th pairs, respectively, and write
Z` =`−1∑
`1=1
Zπ(`),π(`1) .
30
In this notation,
T8 = T4 − (T5 + T6) + T7
=∑∑
(k1,j1)≺(k2,j2)(qk1 + qk2)
[K
(Xk1j1 −Xk2j2
h
)
− E{K
(Xk1j1 −Xk2j2
h
) ∣∣∣∣ F , Xk1j1
}− E
{K
(Xk1j1 −Xk2j2
h
) ∣∣∣∣ F , Xk2j2
}
+ E
{K
(Xk1j1 −Xk2j2
h
) ∣∣∣∣ F}]
=
ν∑
`2=2
`2−1∑
`1=1
Zπ(`1),π(`2) =
ν∑
`=2
Z`
We write π(`1) � π(`2) if either π(`1) ≺ π(`2) or π(`1) = π(`2). Suppose
π(`) = (k0, j0); here, 1 ≤ k0 ≤ m and 1 ≤ j0 ≤ νk0 . Define F` to be the
intersection of F with the smallest σ-field generated by Xkj for all pairs (k, j)
satisfying (k, j) � (k0, j0). Then, Z` is measurable in F`, and E(Z` | F`−1) = 0.
Therefore, Z1, . . . , Zν is a sequence of zero-mean martingale differences, adapted
to the σ-fields F1, . . . ,Fν . Hence, using Rosenthal’s inequality, we have for each
r ≥ 1
EF(T 2r8
)≤ C(r)
[{ ν∑
`=2
EF(Z2`
)}r+
ν∑
`=2
EF(Z2r`
)], (B.41)
where the constant C(r) ≥ 1 depends only on r, and all the bounds in this
paragraph hold uniformly in h ∈ H and p ∈ P. If π(`) = (k, j), write simply
Xπ(`) for Xkj . Conditional on F and Xπ(`), the random variables Zπ(`1),π(`),
for 1 ≤ `1 ≤ `− 1, are independent and have zero mean. Therefore,
EF(Z2`
)= E
{( `−1∑
`1=1
Zπ(`1),π(`)
)2 ∣∣∣∣∣ F}
= E
[E
{( `−1∑
`1=1
Zπ(`1),π(`)
)2 ∣∣∣∣∣ F , Xπ(`)
} ∣∣∣∣∣ F]
= E
{`−1∑
`1=1
E(Z2π(`1),π(`)
∣∣ F , Xπ(`)
)∣∣∣∣∣ F
}=
`−1∑
`1=1
E(Z2π(`1),π(`)
∣∣ F)
≤ B1 n−2
`−1∑
`1=1
E
{K
(Xπ(`1) −Xπ(`)
h
)2 ∣∣∣∣∣ F}
≤ B2 n−2 nh = B2 n
−1h . (B.42)
31
A similar bound for EF (Z2r` | F) can be derived using Rosenthal’s equality again
EF(Z2r`
∣∣ F)≤ C(r)
[{ `−1∑
`1=1
E(Z2π(`1),π(`)
∣∣ F)}r
+
`−1∑
`1=1
E(|Zπ(`1),π(`)|2r
∣∣ F)]
≤ C(r){(B2 n
−1 h)r
+Br3 n−2r nh
}≤ C(r)
(B4 n
−1 h)r, (B.43)
where, here and below, Bj does not depend on h, m, n or r. Combining (B.41)–
(B.43), and recalling from AA that ik − ik−1 ≤ C2 n/m, we deduce that
EF(T 2r8
)≤ C(r)2
(B5 ν n
−1 h)r ≤ C(r)2
(B5 C2m
−1 h)r. (B.44)
It follows from Hitczenko (1990) that C(r)2 (B5 C2)r ≤ (B6 r/ log r)B7r.
Therefore, provided that
m−1 hn2δ ≤ n−B8 , (B.45)
where B8 > 0, we have: E{(nδ T8)2r} ≤ (B6 r/nB8)B7r. The right-hand
side here is minimised by taking r = (B6 e)−1 nB8 , in which case it equals
exp(−B9 nB8), where B9 = B7/(B6e). Therefore, by Markov’s inequality,
suph∈H
supp∈P
P{∣∣(nh)−1 T8(h, p)
∣∣ > (nh)−1 n−δ}≤ suph∈H
supp∈P
E{(nδ T8
)2r}
≤ exp(−B9 n
B8). (B.46)
Let B10, B11 > 0 be fixed but arbitrarily large. If c3, in CB(d), satisfies c3 < B8
if Hn consists of a lattice of nB10 distinct values of h ∈ H and if Pn consists of
probability distributions p ∈ P, defined as at AA, when there are just m ≤ nc3
distinct pis and each of these takes at most nB11 possible values, all satisfying
(3.2), then the number of pairs (h, p) ∈ Hn ×Pn is at most nB10+mB11 , and so
(B.46) implies that
P
{suph∈Hn
supp∈Pn
∣∣(nh)−1 T8(h, p)∣∣ > (nh)−1 n−δ
}= O
{nB10+mB11 exp
(−B9 n
B8)}
= O{
exp(− nB8−η)
}, (B.47)
for all η > 0, provided that c3 < B8. In this case, choosing B10 and B11
arbitrarily large, (B.47) can be extended from (h, p) ∈ Hn × Pn, inside the
32
probability statement on the left-hand side, to all (h, p) ∈ H × P. It therefore
follows that, provided δ > 0 is chosen so small that (B.45) holds, which in turn
requires only that c1 − (1/5) + 2 δ < −B8, we have
P
{suph∈H
supp∈P
∣∣(nh)−1 T8(h, p)∣∣ > (nh)−1 n−δ
}→ 0 . (B.48)
Step 9: Completion. Write Qn = Q1n + Q2n. In (3.1), choose c1 ∈ (0, 130 ); let
c2, in (3.2), be in the interval (0, c1); choose B8 ∈ (0, 110 − 2 c1 − c2); let c3, in
CB(d), lie in the interval (0, 110 − 2 c1 − c2 −B8); and in (B.45)–(B.48), choose
δ ∈ (0, ( 15 − c1 −B8)/2). These choices are appropriate for the argument in the
last paragraph of the previous step. Combining (B.3), (B.7), (B.40) and (B.48),
we deduce that, for some η > 0,
1
n
n∑
i=1
f−i(Xi |h, p) =
∫f(x |h, p) f(x) dx+Qn +Op
(m2 nε−(3/2) +m2 n−2 h−1
+ (nh)−1 n−δ +mnε−(1/2) h2 supk|rk|+mn−1
)
=
∫f(x |h, p) f(x) dx+Qn +Op
[{(nh)−1 + h4
}n−η
],
uniformly in h ∈ H and p ∈ P, where the first identity holds for all ε > 0, and
the last holds if ε is sufficiently small. The theorem follows directly from this
property.
REFERENCES
HITCZENKO, P. (1990). Best constants in martingale version of Rosenthal’s305
inequality. Ann. Probab. 18, 1656–1668.
33
Highlights:
A cross-validation criterion to choose both the bandwidth and the
tilted estimator parameters, has been proposed.
It’s demonstrated theoretically that the proposed estimator provides
a convergence rate which is strictly faster than the usual rate
attained using a conventional kernel estimator.
The performance of the proposed tilted estimator through both
theoretical and numerical studies was investigated.
*Highlights (for review)
Recommended