Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Methods for Sparse Functional Data
by
Edwin Kam Fai Lei
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
Graduate Department of Statistical SciencesUniversity of Toronto
c© Copyright 2014 by Edwin Kam Fai Lei
Abstract
Methods for Sparse Functional Data
Edwin Kam Fai Lei
Doctor of Philosophy
Graduate Department of Statistical Sciences
University of Toronto
2014
The primary aim of this thesis is to study methods for the analysis of sparse functional
data. Since this type of data is observed infrequently and irregularly for each subject, even
simple descriptive statistics such as the mean and covariance must be reformulated. In
the first part of this thesis, we study a related but more challenging problem of recovering
the underlying functional trajectories when the subjects are genetically correlated. The
key idea is to reconstruct the trajectories by using the Karhunen-Loeve expansion of a
random function with a data-driven eigenbasis. In the second part of this thesis, we study
effective dimension reduction for regression of a scalar response on a sparse functional
predictor. Our proposal estimates the effective dimension reduction space under the
presence of sparse functional data, which has the important property that the projection
of the functional predictor onto it contains as much information on the response as the
functional predictor itself. We derive our estimator’s asymptotic properties and study
its finite sample performance. Lastly, we consider extensions of our effective dimension
reduction procedure for the classification of sparse functional data.
ii
Acknowledgements
First and foremost I would like to thank my supervisor Fang Yao for his patience and
support during the four years of my doctoral studies. Without his timely insights, I
would not have been able complete this thesis. Secondly, I would like to thank my family
for their unwavering support of my education. Thirdly, I would like to thank the faculty
and staff of the Department of Statistical Sciences for their dedication to the program.
Last but not least I would like to thank Andriy, Angel, Avideh, Darren, David, Eric D.,
Eric Y., Eugene, Jason, Lily, Natalie, and Steve for being great friends.
iii
Contents
1 Introduction 1
1.1 Notation, Definitions, and Basic Results . . . . . . . . . . . . . . . . . . 4
1.1.1 Theory on Bounded Linear Operators . . . . . . . . . . . . . . . . 4
1.1.2 Linear Processes in Function Spaces . . . . . . . . . . . . . . . . . 6
1.1.3 Local Polynomial Regression . . . . . . . . . . . . . . . . . . . . . 9
1.1.4 Data Model for Independent Subjects . . . . . . . . . . . . . . . . 12
1.2 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2 Data Model for Genetically Correlated Subjects 19
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Motivating Application . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Genetic Relationship and Proposed Functional Model . . . . . . . . . . . 22
2.2.1 Background on the Quantitative Genetic Model . . . . . . . . . . 22
2.2.2 Functional Data Model for Genetically Related Individuals . . . . 24
2.3 Model Estimation and FPC Representation . . . . . . . . . . . . . . . . 27
2.3.1 Estimation of Model Components . . . . . . . . . . . . . . . . . . 27
2.3.2 FPC Representation for Genetically Related Individuals . . . . . 30
2.4 Application to Weights of Beef Cattle . . . . . . . . . . . . . . . . . . . . 32
2.5 Simulated Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
iv
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Cumulative Slicing Estimation for Dimension Reduction 40
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.1 Validity of Functional Cumulative Slicing . . . . . . . . . . . . . . 44
3.2.2 Functional Cumulative Slicing for Sparse Functional Data . . . . 46
3.3 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 Data Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5.1 Ebay auction data . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5.2 Spectrometric data . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.A Regularity Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.B Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.C Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.D Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4 Cumulative Variance Estimation for Classification 73
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.1 Validity of Functional Cumulative Variance . . . . . . . . . . . . . 76
4.2.2 Functional Cumulative Variance for Sparse Functional Data . . . 78
4.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4 Data Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.A Appendix: Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . 83
Bibliography 84
v
List of Tables
1.1 Commonly used kernel functions in local polynomial regression. . . . . . 10
2.1 ISE improvement (%) of the proposed FACE method upon PACE, where
Simulation I uses data-based models with different values of (Kg, Ke) and
Simulation II examines half-sibling (α = 0.25) and full-sibling (α = 0.5)
family relationships. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 Shown are the model error in form of the operator norm ‖PK,sn − P‖
with its standard error (in parentheses), and the optimal K and sn that
minimize the average model error over 100 Monte Carlo repitetions. . . 54
3.2 Shown are the average MSPE with its standard error (in parentheses), and
the optimal K and sn that minimize the average MSPE over 100 Monte
Carlo repitetions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Average 5-fold cross-validated prediction error over 20 Monte Carlo runs
with selected K and sn, for dense spectrometric data. . . . . . . . . . . . 59
4.1 Shown are the combinations of θkj and µkj we use in our simulation study. 81
4.2 Shown are the average misclassification error (×100%) with its standard
error (in parentheses), and the optimal K and sn that minimize the av-
erage misclassification error over 100 Monte Carlo repetitions for sparse
functional data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
vi
4.3 Shown are the average misclassification error (×100%) with its standard
error (in parentheses), and the optimal K and sn that minimize the average
5-fold cross-validated classification error for the temporal gene expression
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
vii
List of Figures
1.1 Growth information of 8 girls measured between 1 and 18 years of age. . 2
1.2 Proportion of CD4 cells of 6 HIV-positive males at each visit (years). . . 3
1.3 Commonly used kernel functions in local polynomial regression. . . . . . 11
2.1 Beef cattle data: frequency distributions. . . . . . . . . . . . . . . . . . . 33
2.2 Estimated mean function (dark) with observed trajectories (light) for the
beef cattle data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Non-negative definite estimates of the genetic and environmental covari-
ance functions for the beef cattle data. . . . . . . . . . . . . . . . . . . . 35
2.4 Shown are the first (solid), second (dashed), third (dash-dot), and fourth
(dotted) eigenfunctions. Left: first three eigenfunctions of the genetic pro-
cess, counting for 98% of the genetic variance. Right: first four eigenfunc-
tions of the environmental process, explaining 98.3% of the environmental
variance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Estimated trajectories by leave-one-family-out cross-validation (CV) for
two families of cows obtained using FACE method (solid) and PACE
method (dashed), where the first row presents two half-siblings from one
family and the bottom three rows present six half-siblings from another
family. The legend shows the relative CV error of each cow,∑Nij
k=1{Uijk −
X−iij (Tijk)}2/U2ijk, obtained from two methods, where X−iij is as described
in Section 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
viii
3.1 Irregularly and sparsely observed log bid price trajectories of 9 randomly
selected auctions over the 7-day duration. . . . . . . . . . . . . . . . . . . 56
3.2 Average 5-fold cross-validated prediction errors over 20 random partitions
across various time domains [0, T ], for sparse Ebay auction data. . . . . . 57
3.3 Estimated model components for sparse Ebay auction data using FCS with
K = 2 and sn = 2. The first and second row of plots shows the estimated
index functions, i.e., the EDR directions, and the additive link functions,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Absorbance trajectories of 215 meat samples measured over 100 equally
spaced wavelengths between 850 nm and 1050 nm. . . . . . . . . . . . . . 58
3.5 Estimated model components for spectrometric data using FCS for (K, sn) =
(2, 5). The first and second row of plots shows the estimated EDR direc-
tions and additive link functions, respectively. . . . . . . . . . . . . . . . 59
4.1 Temporal gene expressions. . . . . . . . . . . . . . . . . . . . . . . . . . . 83
ix
Chapter 1
Introduction
Functional data analysis (FDA) is concerned with the study of infinite-dimensional data,
such as curves, shapes, and images. Muller (2005) writes,
[Functional data] are affected by time-neighborhood and smoothness rela-
tions; time-order is crucial. The analysis changes in a basic way whenever
the time order of observations is changed. In contrast, in multivariate statisti-
cal analysis, the order of the components of observed random vectors is quite
irrelevant, and any changes in this order leads to the same results. This fact
and the continuous flow of time, which serves as argument, lead to differences
in perspective.
Figure 1.1a provides an example; it shows the heights (cm) of 8 girls measured between
1 and 18 years of age from the Berkeley Growth Study (Tuddenham and Snyder, 1954).
Even though each of the measurements of height involves only discrete values, as indicated
by the circles on each curve, it is not unreasonable to expect that had measurements been
made at every age the data would be a smooth curve, as indicated by the linearly con-
nected trajectories between each observation. Ramsay and Silverman (2005) elaborates
further on the crucial nature of time-order for this dataset,
The ages themselves must also play an explicit role in our analysis... Although
1
Chapter 1. Introduction 2
it might be mildly interesting to correlate heights at ages 9, 10 and 10.5, this
would not take account of the fact that we expect the correlation for two ages
separated by only half a year to be higher than that for a separation of one
year.
Under the assumption within FDA that stochastic processes are ultimately smooth
curves, Ramsay et al. (1995) estimated the acceleration curve of the girls’ growth, shown
in Figure 1.1b.
●
●●
●●
●
●
●
●
●
●●
●●
●
●
●
●●
● ● ● ● ● ● ● ● ● ● ● ●
2 4 6 8 10 12 14 16 18
8010
012
014
016
018
0
Age (years)
Hei
ght (
cm)
●
●
●
●●
●
●
●
●
●
●●
●
●●
●●
●
●
●●
● ● ●● ● ● ● ● ● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ● ● ● ● ● ● ● ● ● ● ● ●
●●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●●
●● ● ● ● ● ● ● ● ●
●
●●
●●
●
●
●
●
●
●●
●●
●●
●●
●
●
●●
● ● ● ● ● ● ● ● ●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●● ● ● ● ● ● ● ● ●
●
●
●
●●
●
●
●
●
●
●●
●●
●●
●●
●●
●●
●●
●●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ● ● ● ● ● ● ● ● ● ● ●
(a) Observed height
2 4 6 8 10 12 14 16 18
−8
−6
−4
−2
02
Age (years)
Gro
wth
acc
eler
atio
n (c
m /
year
^2)
(b) Estimated growth acceleration
Figure 1.1: Growth information of 8 girls measured between 1 and 18 years of age.
Functional data can be further categorized by the observed spacing between measure-
ments. If a stochastic process is observed in its entirety, we call this completely observed
functional data, a type rarely encountered in practice. If it is observed on a fine grid,
we call this dense functional data. The data shown in Figure 1.1a is an example, even
though the measurements are not equally spaced. Finally, if each sample of a stochastic
process contains very few observations, we call this sparse functional data. Figure 1.2
provides an example; it shows the proportion of CD4 cells (number of CD4 cells divided
by total number of lymphocytes) of 6 out of a total of 283 HIV-positive homosexual
Chapter 1. Introduction 3
males during each of their visits to the clinic (Kaslow et al., 1987). Longitudinal data is
very similar in this regard, although longitudinal data analysis typically places a greater
emphasis on inferential procedures (Rice, 2004). Muller (2008) elaborates further on the
●
●
●
●
●
●
●
0 1 2 3 4 5
010
2030
4050
Visit (year)
CD
4 (%
)
●
●
●●
●
● ● ● ●
●
●
●
0 1 2 3 4 5
010
2030
4050
●
●
●
●
●
●
● ●
●
0 1 2 3 4 50
1020
3040
50
●
●
●
●
●
0 1 2 3 4 5
010
2030
4050
●
●
0 1 2 3 4 5
010
2030
4050
●
●
●
●
●
0 1 2 3 4 50
1020
3040
50
Figure 1.2: Proportion of CD4 cells of 6 HIV-positive males at each visit (years).
practical differences between the three types of functional data designs,
If one was given a sample of entirely observed trajectories Yi(t), i = 1, . . . , N ,
for N subjects, a mean trajectory could be defined as sample average, Y (t) =
N−1∑N
i=1 Yi(t). However, this relatively straightforward situation is rather
the exception than the norm, as we face the following difficulties: The tra-
jectories may be sampled at sparsely distributed times, with timings varying
from subject to subject; the measurements may be corrupted by noise and
are dependent within the same subject.
This thesis’ primary focus is in modeling and analyzing sparsely observed functional data.
In extreme situations where only a few observations are available for some, or even all,
Chapter 1. Introduction 4
of the subjects, one must adopt the strategy of “pooling” together data across subjects
with the aim that the entire sample is dense for consistent estimation. Variations on
this strategy will permeate throughout this thesis. A second common theme within this
thesis, and within FDA in general, is the use of dimension reduction to achieve tractable
solutions. Owing to the rich history of dimension reduction in multivariate data anal-
ysis, many of the methods in this thesis are the functional counterparts to established
multivariate techniques, such as principal components analysis and effective dimension
reduction. Finally, the critical assumption within FDA that underlying stochastic pro-
cesses are smooth leads to an extensive use of smoothing methods such as local polynomial
kernel regression.
1.1 Notation, Definitions, and Basic Results
In this section we introduce the notation and present some definitions and basic theorems
(proofs omitted) we will be using throughout this thesis. The following material on op-
erator theory, linear processes in function spaces, and local polynomial kernel regression
is primarily adapted from Kato (1995), Bosq (2000), and Fan and Gijbels (1996) respec-
tively. In Chapter 1.1.4, we will introduce a sparse functional data model for independent
subjects, adapted from Yao et al. (2005a).
1.1.1 Theory on Bounded Linear Operators
Let H be a separable Hilbert space endowed with inner product 〈·, ·〉 and a norm induced
by its inner product as ‖ · ‖ =√〈·, ·〉. Recall an operator T acting on H is bounded
if there exists M < ∞ such that ‖Tf‖ ≤ M‖f‖ for all f ∈ H. Let B be the space
of bounded linear operators from H to itself. B is a Banach space equipped with the
uniform, or operator, norm
‖T‖B = sup‖f‖≤1
‖Tf‖.
Chapter 1. Introduction 5
Definition 1.1. The adjoint operator of T ∈ B, namely T ∗, satisfies
〈Tf, g〉 = 〈f, T ∗g〉, ∀f, g ∈ H.
Definition 1.2. An operator T is said to be self-adjoint if it is its own adjoint, i.e.,
T = T ∗.
Definition 1.3. A bounded operator T is compact if it can be expressed as
Tf =∞∑j=1
tj〈f, vj〉uj, ∀f ∈ H, (1.1)
where {tj}j∈N is a decreasing sequence of positive numbers with limit zero, and {uj}j∈N
and {vj}j∈N are two orthonormal but not necessarily complete sets.
Note that the operator T can be written succinctly as
T =∞∑j=1
tjvj ⊗ uj,
where the tensor product f ⊗ g denotes the rank one operator on H that maps h to
(f ⊗ g)h = 〈h, f〉g. If T is self-adjoint, then
T =∞∑j=1
tjvj ⊗ vj, (1.2)
where {vj}j∈N forms a complete and orthonormal basis of H. Observe that (1.2) implies
Tvj = tjvj and thus {(tj, vj)}j∈N are the eigenelements of T .
Definition 1.4. If there exists K <∞ such that
T =K∑j=1
tjvj ⊗ uj,
then T is said to be a finite-rank operator with rank K.
Chapter 1. Introduction 6
Definition 1.5. A compact operator T with the expansion in (1.2) is said to be a Hilbert-
Schmidt operator if∑∞
j=1 t2j <∞.
Denote the set of Hilbert-Schmidt operators on H by S, which itself is a Hilbert space
equipped with inner product
〈S, T 〉S =∞∑j=1
〈Svj, T vj〉, S, T ∈ S
and norm
‖T‖S =√〈T, T 〉S =
(∞∑j=1
t2j
)1/2
,
where {(tj, vj)}j∈N are the eigenelements of T . It is easy to check that ‖ ·‖S ≥ ‖·‖B. The
following theorem connects the difference of two compact operators with their respective
eigenelements.
Theorem 1.1. Let S and T be two linear, self-adjoint, and compact operators on H
whose respective spectral expansions are given by
S =∞∑j=1
sjvj ⊗ vj, T =∞∑j=1
tjuj ⊗ uj.
Then, for any j ∈ N,
|tj − sj| ≤ ‖T − S‖B,
and
‖uj − vj‖ ≤2√
2
aj‖T − S‖B,
where a1 = s1 − s2 and aj = min(sj−1 − sj, sj − sj+1) for j ≥ 2.
1.1.2 Linear Processes in Function Spaces
Hereafter, let H denote the real and separable Hilbert space L2(T ) for a compact interval
T . H is equipped with inner product 〈f, g〉 =∫T f(t)g(t)dt, and norm ‖f‖ =
√〈f, f〉.
Chapter 1. Introduction 7
We assume our stochastic process X is H-valued with continuous sample paths. The
expectation of a H-valued random function X is defined as µ(t) := E(X)(t) = E(X(t))
for any t ∈ T . The covariance function of X is defined by Σ(s, t) := cov(X(s), X(t)) =
E{(X(s) − µ(s))(X(t) − µ(t))} for any s, t ∈ T . Recall the covariance function is sym-
metric and positive-definite. We now turn to our first major theorem from functional
analysis.
Theorem 1.2 (Mercer’s Theorem). Let K(s, t) be a continuous, symmetric and positive-
definite function on L2(T × T ). Then there exists an orthonormal basis of H, namely
{φj}j∈N, and a sequence of decreasing positive numbers {λj}j∈N such that
K(s, t) =∞∑j=1
λjφj(s)φj(t), (1.3)
where the convergence is uniform on T × T .
Corollary 1.1 (Spectral Expansion). If X has a finite second moment, i.e., E‖X‖2 <∞,
then Σ(s, t) admits the decomposition
Σ(s, t) =∞∑j=1
λjφj(s)φj(t).
Moreover, ∫T
Σ(s, t)φj(s)ds = λjφj(t), j ∈ N,
and thus φj(t) is the eigenfunction of Σ(s, t) associated with eigenvalue λj. We also have
the identity ∫T
Σ(t, t)dt =∞∑j=1
λj <∞.
The next result will appear in many instances throughout the thesis and serves as the
backbone to functional principal components analysis.
Theorem 1.3 (Karhunen-Loeve Expansion). Let X be zero-mean with a finite second
Chapter 1. Introduction 8
moment. Let {(λj, φj)}j∈N be the eigenelements of Σ(s, t). Then X admits the expansion
X(t) =∞∑j=1
ξjφj(t), (1.4)
where {ξj}j∈N are pairwise uncorrelated zero-mean real-valued random variables with λj =
Eξ2j , and the convergence is uniform with respect to the H-norm.
Corollary 1.2. Let X be a zero-mean Gaussian process with covariance Σ(s, t). Let
{(λj, φj)}j∈N be the eigenelements of Σ(s, t). Then X admits the expansion
X(t) =∞∑j=1
ξjφj(t),
where ξj are mutually independent and distributed as N(0, λj) for j ∈ N.
We are now ready to connect stochastic processes in function spaces to our previous
discussion on operator theory. For specificity, we are still working in H = L2(T ). If X
has a finite second moment, i.e., E‖X‖2 <∞, then the kernel operator
(Σf)(t) =
∫T
Σ(s, t)f(s)ds, ∀t ∈ T , f ∈ H
associated with the kernel function Σ(s, t) is a bounded linear operator on H, i.e., Σ ∈ B.
From this definition, it is easy to show that Σf = E(〈X, f〉X
)and that the symmetry
of Σ(s, t) implies Σ is self-adjoint. Further,
∫T
∫T
Σ2(s, t)dsdt =
∫T
∫T
(∞∑j=1
λjφj(s)φj(t)
)2
dsdt
=∞∑j=1
∞∑k=1
λjλk〈φj, φk〉2
=∞∑j=1
λ2j <∞,
Chapter 1. Introduction 9
where the first equality follows from applying Mercer’s Theorem, the second from the
uniform convergence in Mercer’s Theorem, and the third from the orthonormal nature
of {φj}j∈N. Thus, Σ is a self-adjoint Hilbert-Schmidt operator whose spectral expansion
is given by
Σ =∞∑j=1
λjφj ⊗ φj.
In fact, the last identity in Corollary 1.1 implies that Σ belongs to the class of nuclear
operators, a subset of the Hilbert-Schmidt class, but we will not need this result in
the thesis. Note we have incidentally shown that the H-norm of Σ(s, t) is equal to the
Hilbert-Schmidt norm of the operator Σ.
1.1.3 Local Polynomial Regression
Local polynomial regression provides a flexible approach to studying the relationship
between dependent and independent variables without imposing strong functional as-
sumptions on the nature of this relationship. To be precise, given the population pair
(X, Y ), our primary interest is to study the regression function m(x) = E(Y |X = x).
From a statistical perspective, we typically assume observed data pairs {(Xi, Yi)}i∈N are
independent and identically distributed (i.i.d.) according to the model
Y = m(X) + ε, (1.5)
where the regression error ε has zero mean, finite variance, and is independent of X.
If we assume that the (p + 1)th derivative of the conditional mean m(x) exists at a
point x0, then we can approximate m(x) by a polynomial of order p. Taylor’s expansion
in a neighborhood around x0 gives
m(x) ≈ m(x0) +
p∑r=1
m(r)(x0)
r!(x− x0)r,
Chapter 1. Introduction 10
where m(r)(x0) is the rth derivative of m evaluated at the point x0. Let m(x0) = β0 and
m(r)(x0)/r! = βr. These are fitted by solving the weighted least squares problem
(β0, . . . , βp)> = argmin
β0,...,βp
n∑i=1
K
(Xi − x0
h
){Yi −
p∑r=0
βr(Xi − x0)r}2, (1.6)
where K is a kernel function that assigns larger weights to points closer to x0, and,
conversely, smaller weights to points farther away. The bandwidth h controls the size of
the neighborhood around x0. To estimate the entire function m, we solve (1.6) for all
points x0 in the domain of interest.
Let 1(A) be the indicator function on the set A. Table 1.1 lists several commonly
used kernel functions in local polynomial regression with the corresponding curves in
Figure 1.3. It is well known that the choice of kernel is secondary to the choice of
bandwidth h.
Uniform K(u) = 121(|u| ≤ 1)
Triangular K(u) = (1− |u|)1(|u| ≤ 1)Epanechnikov K(u) = 3
4(1− u2)1(|u| ≤ 1)
Gaussian K(u) = 1√2πe−u
2/2
Table 1.1: Commonly used kernel functions in local polynomial regression.
The large sample performance of local polynomial estimators is almost always assessed
by its (integrated) mean squared error (MSE) under the scenario h → 0 as sample size
n → ∞. The intuition behind “small-h” asymptotics is that one typically requires a
smaller neighborhood with larger sample sizes. The MSE has the familiar decomposition
into its bias and variance components
MSE(m(x)) =
∫E[(m(x)−m(x))2
]dx
=
∫bias2(m(x))dx+
∫variance(m(x))dx.
Chapter 1. Introduction 11
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
u
K(u
)
UniformTriangularEpanechnikovGaussian
Figure 1.3: Commonly used kernel functions in local polynomial regression.
Chapter 1. Introduction 12
1.1.4 Data Model for Independent Subjects
Functional data analysis (FDA) has attracted substantial research interest and has pro-
vided powerful tools to study data arising from a collection of curves rather than from
scalars or vectors. Ramsay and Silverman (2005) offer a comprehensive introduction to
FDA. A key issue in modeling functional data is the representation of the underlying
process X, which is often of a complex nature and requires regularization. A common
approach is to utilize functional principal component (FPC) analysis (FPCA), exploiting
a data-driven eigenbasis to represent X. When the design of the functional data is dense,
FPCA has been studied extensively by Rice and Silverman (1991), James et al. (2000),
Yao et al. (2005a), Hall and Hosseini-Nasab (2006), Hall et al. (2006), and references
therein. The eigenbasis is the unique canonical basis leading to a generalized Fourier
series, i.e., the Karhunen-Loeve expansion (Theorem 1.3). The advantage of this expan-
sion is that it gives the most rapidly convergent representation of X in the L2 sense (Ash
and Gardner, 1975). In addition, the connection between the Karhunen-Loeve expansion
and Mercer’s Theorem (Theorem 1.2) implies that FPCA also characterizes the domi-
nant modes of variation of a sample of functional data. These theoretical and practical
considerations have led FPCA to be one of the standard procedures in FDA.
However, when the functional data is sparse, for example when there is only one
or two observations per subject, the standard approach of estimating the FPC scores,
i.e., generalized Fourier coefficients, by numerical integration does not work well. Using a
reduced rank mixed effects approach, Rice and Wu (2001), James et al. (2000), and James
and Sugar (2003) overcame this issue by modeling each individual trajectory as B-splines
with random coefficients. However, as noted by Yao et al. (2005a), James et al. (2000)
did not study the asymptotic properties of their estimators owing to the complexity of the
mixed effects approach, deciding instead to construct pointwise confidence intervals using
the bootstrap. In contrast, we review in this section the method of Principal components
Analysis through Conditional Expectation (PACE) by Yao et al. (2005a). It recovers the
Chapter 1. Introduction 13
individual trajectories directly through the Karhunen-Loeve expansion and thus allows
for the derivation of the relevant asymptotic properties.
Methodology
As in Chapter 1.1 we assume X is a random function defined on H = L2(T ) for a com-
pact interval T . Additionally, X has mean function µ(t) = EX(t), finite second moment,
i.e., E‖X‖2 <∞, and covariance function Σ(s, t) = cov(X(s), X(t)). Let X1, . . . , Xn be
independently and identically distributed (i.i.d.) as X. Mercer’s Theorem implies that
there exists a spectral expansion of Σ(s, t) whose eigenelements are {(λk, φk)}k∈N. Ad-
ditionally, the Karhunen-Loeve expansion implies that there exists a generalized Fourier
expansion of Xi(t) for i = 1, . . . , n given by Xi(t) = µ(t) +∑∞
k=1 ξikφk(t), where ξik has
zero mean and E[ξikξi′k′ ] = λk if i = i′, k = k′ and 0 otherwise.
In reality, sparse functional data is often observed with additive measurement error ε,
whose mean is zero and variance σ2. To accurately reflect the nature of sparse functional
data, we assume both the number of observations per subject and the observation times
to be random. To be precise, let the number of observations per subject Ni be i.i.d. N ,
where N is a bounded positive discrete random variable, and Tij be a random variable
on T that denotes the jth observation of Xi. Then, the data model for noisy sparse
functional data is
Uij = Xi(Tij) + εij
= µ(Tij) +∞∑k=1
ξikφk(Tij) + εij, Tij ∈ T , 1 ≤ j ≤ Ni, 1 ≤ i ≤ n (1.7)
where εij is i.i.d. ε. This eigenfunction approach differs from a random regression model
with spline basis functions, as the eigenfunction basis is completely data-driven, while
the spline function basis is pre-specified without knowledge of the data.
We use local linear smoothing over the pooled noisy sparse observations to estimate
Chapter 1. Introduction 14
the mean function µ(t). To be specific, µ(t) = a0, where
(a0, a1)> = argmina0,a1
n∑i=1
Ni∑j=1
K1
(Tij − th1
){Uij − a0 − a1(Tij − t)}2, (1.8)
K1 is a non-negative and symmetric univariate kernel density function and h1 = h1(n) is
the bandwidth to control the amount of smoothing. Note that h1 depends only on the
sample size n and thus ignores the dependency between measurements made on the same
subject, which Lin and Carroll (2000) showed to be the most asymptotically efficient.
We use leave-one-curve-out cross-validation to select h1, although a subjective choice is
often sufficient in practice.
For 1 ≤ i ≤ n, 1 ≤ j ≤ Ni, let Gi(Tij, Til) = {Uij − µ(Tij)}{Uil − µ(Til)} denote the
observed covariance. Observe that
E[Gi(Tij, Til)|Tij, Til] = E[UijUil|Tij, Til]− µ(Tij)µ(Til)− µ(Til)µ(Tij) + µ(Tij)µ(Til)
≈ E[UijUil|Tij, Til]− µ(Tij)µ(Til)
= cov[Xi(Tij), Xi(Til)|Tij, Til] + σ2δjl,
where δjl = 1 if j = l and 0 otherwise. This suggests only {Gi(Tij, Til) : 1 ≤ i ≤ n, 1 ≤
j 6= l ≤ Ni} should be included as input data for estimation of the covariance surface
Σ(s, t). Thus, Σ(s, t) = b0, where
(b0, b1, b2)> = argminb0,b1,b2
n∑i=1
∑1≤j 6=l≤Ni
K2
(Tij − sh2
,Til − th2
)× {Gi(Tij, Til)− b0 − b1(Tij − s)− b2(Til − t)}2,
(1.9)
K2 is a non-negative and symmetric bivariate kernel density function and h2 = h2(n) is
the bandwidth to control the amount of smoothing. We again use leave-one-curve-out
cross validation to select h2.
Chapter 1. Introduction 15
The smoothing step in (1.9) also hints at the estimation of the σ2 by
σ2 = |T1|−1
∫T1{Σ(t)− Σ(t, t)}dt, (1.10)
where Σ is obtained by smoothing Gi(Tij, Tij) over all individuals. The region of inte-
gration, T1, of length |T1|, is taken as the middle half of the whole interval T to reduce
boundary effects introduced by smoothing. To better estimate Σ(s, t) along the “height
ridge” when s ≈ t, we adjust the estimate Σ(t) using a local quadratic smoother, see Yao
et al. (2003) for details.
The estimated eigenelements {(λk, φk)}k∈N thus solve the eigenvalue problem
∫T
Σ(s, t)φk(s)ds = λkφk(t),
subject to the orthonormality constraint 〈φk, φm〉 = δkm. This can be solved numerically
by discretizing Σ(s, t) into a fine grid of equally spaced time points and carrying out
multivariate principal components analysis (Ramsay and Silverman, 2005, Chapter 8.4).
Principal Components Analysis through Conditional Expectation
It is obvious that when functional data is observed sparsely, the standard approach
to estimating the FPC scores ξik =∫τ{Xi(t) − µ(t)}φk(t)dt via numerical integration
does not work. Further, since under model (1.7) the trajectories are observed with
noise, substituting Xi(Tij) with Uij thus leads to biased estimates of ξik. These two
observations are the primary motivations for Principal components Analysis through
Conditional Expectation (PACE, Yao et al., 2005a).
If we assume model (1.7) can be well-approximated by the first K functional principal
components, then we can write it as
U i = µi + Φiξi + εi,
Chapter 1. Introduction 16
where U i = (Ui1, . . . , UiNi)>, µi = (µ(Ti1), . . . , µ(TiNi
))>, φik = (φk(Ti1), . . . , φk(TiNi))>,
εi = (εi1, . . . , εiNi)> and ξi = (ξi1, . . . , ξiK)> are vectors, and Φi = (φi1, . . . ,φiK) is an
Ni ×K matrix.
The best linear unbiased predictor (BLUP, Henderson, 1950) of ξi is given by
ξi = λΦ>i Σ−1i (U i − µi), (1.11)
where λ = diag(λ1, . . . , λK), and Σi is an Ni ×Ni matrix whose (j, l)th element is given
by cov[Uij, Uil|Tij, Til] = Gi(Tij, Til) + σ2δjl. Let T i = (Ti1, . . . , TiNi)>. It is well known
that if ξi and εi are additionally jointly Gaussian, then ξi = E[ξi|U i,T i] and is optimal
in mean squared error. The PACE estimate of ξi is thus given by
ξi = λΦ>i Σ−1
i (U i − µi), (1.12)
where the (j, l)th element of Σi is given by Σ(Tij, Til) + σ2δjl. The prediction for Xi(t)
with dimension reduction is thus
Xi(t) = µ(t) +K∑k=1
ξikφk(t). (1.13)
Selecting the Number of Functional Principal Components
Let µ−(i) and {φ−(i)k }k∈N denote the mean and eigenfunctions estimated from the data
excluding subject i, respectively. We use leave-one-curve-out cross-validation to select
the number of principal components K in the prediction of X in (1.13). To be precise,
we select K as
K = argminK
n∑i=1
Ni∑j=1
{Uij − X−(i)i (Tij)}2,
where X−(i)i (Tij) = µ−(i)(Tij) +
∑Kk=1 ξikφ
−(i)k (Tij) represents the predicted trajectory for
subject i. However, in practice a subjective choice such as fraction of variance explained
Chapter 1. Introduction 17
is often sufficient. More specifically, for a user-defined threshold 0 < α < 1, we select K
as
K = min{K :
∑Kk=1 λk∑∞k=1 λk
≥ α}.
For an AIC-type criterion, we refer the reader to equation (11) in Yao et al. (2005a).
Chapter 1. Introduction 18
1.2 Outline of Thesis
In Chapter 2, we extend the Principal components Analysis through Conditional Ex-
pectation procedure of Chapter 1.1.4 to the case of genetically correlated subjects. The
motivating example concerns sparse measurements of mass of sibling cows from several
independent families. In Chapter 3, we consider the problem of dimension reduction in
functional regression under the framework of effective dimension reduction. Our pro-
posal draws inspiration from multivariate cumulative slicing estimation; it provides an
innovative solution to the challenging problem of characterizing the effective dimension
reduction space in the presence of sparse functional data. In Chapter 4, we apply our
effective dimension reduction proposal to study functional classification.
Chapter 2
Data Model for Genetically
Correlated Subjects
19
Chapter 2. Data Model for Genetically Correlated Subjects 20
2.1 Introduction
The aforementioned works on FPC approaches in Chapter 1.1.4 deal exclusively with
independent subjects. Very little work has appeared involving the analysis of correlated
subjects or of clusters. Due to the difficulty in appropriate modelling of complex de-
pendence structures, existing work on feasible models for correlated functional data has
usually been motivated in the context of specific applications. For instance, Peng and
Paul (2011) adopted a separable covariance structure for weakly correlated functional
data, e.g., for growth profiles from different locations in agricultural land, while Zhou
et al. (2010) considered spatially correlated FPC analysis by coupling linear mixed ef-
fects (LME) models with penalized splines. In this chapter, we propose a functional data
model for family-wise related individuals. Our proposal models the genetic and environ-
mental processes both at subject level, and allows for genetic dependencies introduced by
varied familial associations. This is distinct from hierarchical or multilevel FPCA (Morris
et al., 2003, Di et al., 2011), where the assumptions on the within-family covariance do
not allow for a variety of familial relationships.
2.1.1 Motivating Application
Our motivating example concerns the growth (in kilograms) as a function of age (in
days) of half-sibling cows in fifteen independent families. A key issue in the analysis is
the incorporation of genetic information that helps researchers understand how selective
breeding can change the physical traits passed down to future generations. This under-
standing has economic consequences, as accurate estimation of the genetic component
of an individual’s trait can lead to better breeding decisions. Even small improvements
in breeding practices can greatly increase food production. However, the estimation of
the genetic component is complicated by the fact that it is unobservable and must be
inferred from the observed physical trait. The physical trait depends not only on the
Chapter 2. Data Model for Genetically Correlated Subjects 21
genotype but also on the environmental effect, which includes factors such as habitat or
food availability. Fortunately, genetic theory makes inference possible when data include
information from related individuals.
This data set was first analyzed using a multivariate approach in Meyer (1985) and
later, with a random regression approach for individual growth in Meyer and Hill (1997).
The random regression approach uses a basis expansion with an individual’s coefficients
modeled as random effects. Statistical analysis is implemented with an LME model,
see Demidenko (2004) and references therein for a general treatment of the random
regression model using LME. However, in random regression, the choice of pre-specified
basis functions is not straightforward. Although splines (in particular B-splines) have
been a popular option, simulation studies in Griswold et al. (2008) indicated that B-
splines do not necessarily perform well in many realistic settings. This might be caused
by the “one-size-fits-all” character of B-splines, which may result in needing a fairly large
number of B-spline functions. A natural approach to constructing a parsimonious model
is to exploit the FPCA technique to find a data-adaptive eigenbasis, which often requires
only a few leading eigenfunctions to adequately reconstruct trajectories.
2.1.2 Overview
The main contribution of this chapter is to develop a new FPCA framework that effec-
tively takes into account genetic information and can be used in a variety of biological
applications. The key is to generalize the canonical eigenbasis model to genetically re-
lated subjects within independently sampled families. As the individual phenotype is
irregularly and sparsely observed with noise, a common occurrence in many settings, it
is desirable to borrow strength from the whole sample. Yao et al. (2005a) proposed a
version of FPC analysis, called Principal components Analysis through Conditional Ex-
pectation (PACE), that is particularly useful for such sparse functional data. Compared
to spline-based FPC methods that implicitly treat truncated models as the target (James
Chapter 2. Data Model for Genetically Correlated Subjects 22
et al., 2000), PACE emphasizes genuine nonparametric modeling of the covariance and
finds data-driven eigenfunctions to be used as basis functions. Thus PACE allows for
theoretical investigation of the underlying process itself. Given these advantages of the
PACE approach, we couple the PACE principle with the genetic information to develop
a novel FPCA framework, called Familial principal components Analysis through Con-
ditional Expectation (FACE). Our approach naturally decomposes the total covariance
into genetic and environmental components, both of which are estimated by smoothing
techniques. Data-adaptive eigen-components associated with both covariance structures
are obtained and used in the proposed FACE estimation of the genetically related indi-
viduals.
The remainder of this chapter is organized as follows. In Section 2.2, we introduce bi-
ological modeling of the genetic component of a physical trait, and motivate the proposed
FPC model for related individuals. Section 2.3 describes the methodology for estimation
of the model components, including the genetic and environmental covariances and their
respective eigen-components. The known familial genetic relationship is utilized and
leads to the proposed FACE estimation for subject-level signal extraction. We analyze
the growth of beef cattle in Section 2.4, while Section 2.5 contains simulation examples.
Concluding remarks are offered in Section 3.6.
2.2 Genetic Relationship and Proposed Functional
Model
2.2.1 Background on the Quantitative Genetic Model
To describe the standard quantitative genetic model for physical traits, let Xj denote
the phenotype of individual j, Uj the phenotype observed with error εj, gj the genetic
component, and ej the environmental factor. Suppose for now that these quantities
Chapter 2. Data Model for Genetically Correlated Subjects 23
are either all scalar, p-vectors, or functions. The simplest genetic model is an additive
structure with gj, ej, and εj uncorrelated with expected values equal to 0,
Uj = Xj + εj = µ+ gj + ej + εj. (2.1)
Individuals raised in different environments have uncorrelated ej’s, while related indi-
viduals from the same family have correlated underlying genotypes, the gj’s, with the
amount of correlation depending on the individuals’ relationship. For instance, suppose
that gj is a p-vector with p× p covariance matrix G. The p× p cross-covariance matrix
defined as E[gjg>j′ ], j 6= j′, is equal to αjj′G, where αjj′ ∈ [0, 1] is called a relationship
coefficient and depends on the relationship between individuals j and j′. If the individ-
uals are full siblings, i.e., they have the same mother and father, then αjj′ = 1/2. If
the individuals are half-siblings, that is, if they have only one parent in common, then
αjj′ = 1/4. If the individuals are unrelated then αjj′ = 0, and if they are clones or the
same individual then αjj′ = 1. The intuition behind the value of αjj′ is that αjj′ equals
the expected proportion of genes that individuals j and j′ share via inheritance.
This model for genetic correlation and the use of these values of αjj′ are well-supported
by both theoretical calculations and empirical studies. Their use is standard in animal
breeding and in laboratory experiments in evolutionary biology. The model was first
introduced, with values of αjj′ calculated, in Fisher (1918). Also see Lynch and Walsh
(1998, Chapter 7) for a modern treatment and Heckman (2003) for a statistician-friendly
derivation of E[gjg>j′ ] = G/2 for a mother-child relationship. Analysis of (2.1) is straight-
forward when the traits are scalar or vector-valued, the relationships are all the same and
the design is balanced – for instance, for data from N independent families, with k full
siblings in each family. In this case, variance/covariance parameters are easily estimated
in closed form by analysis of variance and method of moments. For more general de-
signs and combinations of relationships, numerical estimation is possible via (restricted)
Chapter 2. Data Model for Genetically Correlated Subjects 24
maximum likelihood (Lynch and Walsh, 1998, Chapter 27), and is implemented in soft-
ware such as ASReml (http://www.vsni.co.uk/software/asreml) and WOMBAT (Meyer,
2007).
2.2.2 Functional Data Model for Genetically Related Individu-
als
Data such as weights of cows can be viewed as arising from smooth functions, even if the
weights are sampled at irregular and, possibly, sparse discrete times across subjects. We
consider the situation where there are N independent families with Ni members in family
i. Let αi,jj′ denote the known relationship coefficient for individuals j and j′ of family
i and assume that the within-family relationship coefficients are non-zero. While our
methodology holds for general αi,j,j′ ’s, in the data we analyze in Section 2.4, all family
members are half-siblings, i.e., αi,jj′ = 1/4 for j 6= j′ and αi,jj = 1 otherwise.
The functional version of (2.1) for the phenotype of the jth individual in the ith
family is
Xij(t) = µ(t) + gij(t) + eij(t), (2.2)
where µ is the population mean curve, gij is what is called the random genetic effect,
and eij models any other random effects (mainly environmental) giving rise to within
individual covariances that are not due to gij. As is common (see, e.g., Lynch and
Walsh, 1998), we will refer to eij as the environmental effect and gij simply as the
genetic effect. In this model, gij and eij are (i) mean zero with the variance of gij(t)
and eij(t) finite for all t, (ii) uncorrelated, (iii) cov(gij(s), gij(t)
)= G(s, t), and (iv)
cov(eij(s), eij(t)
)= E(s, t). These four properties imply that the total covariance is
cov(Xij(s), Xij(t)
)= Σ(s, t) = G(s, t) + E(s, t). The within-family genetic correlation
Chapter 2. Data Model for Genetically Correlated Subjects 25
between two individuals depends on G and the individuals’ relationship coefficient:
cov(gij(s), gij′(t)
)= αi,jj′G(s, t). (2.3)
The processes eij(·) and ei′j′(·) are independent when (i, j) 6= (i′, j′). Assume that the
measurements are taken on a closed and bounded interval T , i.e., t ∈ T . Note that model
(2.2) is not the classical functional model that assumes that data come from independent
realizations of Xij(t) = µ(t) +vij(t). In (2.2), we have decomposed the random deviation
vij(t) as gij(t) + eij(t), where the genetic effect gij(t) induces a within-family correlation.
A stochastic process with finite covariance admits a Karhunen-Loeve expansion and
its covariance function admits a spectral basis expansion (Loeve, 1978, Adler and Taylor,
2007). The key proposal is to exploit such expansions for both genetic and environmental
processes, whilst maintaining the dependence structure of related individuals. For the
genetic process gij, we have for s, t ∈ T ,
gij(t) =∞∑l=1
ξijlφl(t), G(s, t) =∞∑l=1
λlφl(s)φl(t), (2.4)
where the φl’s are orthonormal eigenfunctions, ξij1, ξij2, . . . are the FPC scores, which are
uncorrelated random variables with zero mean and variances λ1 > λ2 > . . ., satisfying∑∞l=1 λl < ∞. Based on the underlying genetic model in equation (2.3), we can deduce
that the correlation between ξijl and ξi′j′l′ is λl αi,jj′ for i = i′ and l = l′, and zero
otherwise. This genetic association is the key to consistent parameter estimation, as
it enables us to borrow information across related individuals. This model and basis
expansion in the context of selection and genetics was first described in Kirkpatrick
and Heckman (1989). Similar expansions hold for the environmental process eij with
orthonormal eigenfunctions {ψm}m≥1 and nonincreasing eigenvalues {ρm}m≥1, i.e., for
Chapter 2. Data Model for Genetically Correlated Subjects 26
s, t ∈ T
eij(t) =∞∑m=1
ζijmψm(t), E(s, t) =∞∑m=1
ρmψm(s)ψm(t), (2.5)
where ζijm are uncorrelated FPC scores of eij with zero mean and finite variance ρm. It
is obvious that the correlation between ζijm and ζi′j′m′ is always zero given independent
environmental processes, unless (i, j,m) = (i′, j′,m′).
Therefore the proposed FPC model for Xij(t) based on these Karhunen-Loeve expan-
sions is given by
Xij(t) = µ(t) +∞∑l=1
ξijlφl(t) +∞∑m=1
ζijmψm(t), t ∈ T . (2.6)
The deviation of each curve Xij from the overall trend µ is a sum of curves φl and ψm with
random amplitudes ξijl and ζijm, respectively. Although the underlying model (2.6) is
infinite-dimensional, the typically rapid decay of eigenvalues often allows us to use a small
number of leading eigenfunctions to recover Xij. In practice, the infinite sums in (2.6) can
be truncated and the φl’s and ψm’s estimated, yielding a data-adaptive low-dimensional
model for Xij. The practical choice of the level of truncations is discussed in Section 2.3.
This eigenfunction approach differs from a random regression model with spline basis
functions, as the eigenfunction basis is completely data-driven, while the spline function
basis is pre-specified without knowledge of the data. A principal components approach to
model (2.2) appears in Di et al. (2011), but with a more restricted covariance structure,
which in our context would require that αi,jj′ ≡ α for all i and for all j 6= j′.
We let the data observed for individual j from family i consist of Nij repeated mea-
surements of Xij taken at discrete time points {Tijk ∈ T : k = 1 . . . , Nij}. Denoting the
kth noisy observation of Xij at Tijk by Uijk, the data model is
Uijk = Xij(Tijk) + εijk
Chapter 2. Data Model for Genetically Correlated Subjects 27
= µ(Tijk) +∞∑l=1
ξijlφl(Tijk) +∞∑m=1
ζijmψm(Tijk) + εijk, (2.7)
where the εijk’s are independent and identically distributed errors with zero mean, finite
variance σ2, and are independent of both the ξijl and the ζijm.
2.3 Model Estimation and FPC Representation
The quantities in model (2.7) are composed of two types: the population components,
such as the mean, covariances and eigenvalues/functions; and the subject-level signals,
i.e., the random amplitudes or FPC scores for the underlying genetic and environmental
processes. The main challenge in estimating these quantities is due to the irregularly and
sparsely observed functional data. More specifically, there may be only a few observations
available for some or even all of the individuals. In this case, borrowing strength across the
entire collection of data is important for obtaining consistent estimation of the population
quantities. As mentioned in the introduction, Yao et al. (2005a) provided a thorough
treatment for such sparse functional data in the case of the classical functional model with
independent realizations, and proposed, namely, the PACE method. We shall generalize
the key idea of PACE and take advantage of the genetic relationship (2.3) in model (2.7).
2.3.1 Estimation of Model Components
The mean and covariance functions are assumed to be smooth, so we can estimate them
by nonparametric regression methods, which borrow information from neighboring data
values. We use local linear smoothers (Fan and Gijbels, 1996) for function and surface
estimation. The key to estimating parameters from sparse functional data is to pool
together information from all individuals, requiring the “pooled” data to be sufficiently
dense. For these local smoothing steps, for a given level of smoothing we adopt the
strategy of ignoring the dependency among the data from the same individual/family.
Chapter 2. Data Model for Genetically Correlated Subjects 28
However we do not ignore correlation when choosing the amount of smoothing. See Lin
and Carroll (2000) for a discussion of smoothing correlated data. Automatic bandwidth
choices for the amount of smoothing of functional data are available [see Rice and Sil-
verman (1991) for leave-one-curve-out cross-validation and Muller and Prewitt (1993)
for surface smoothing], even though subjective choices are often adequate in practice.
Following Chapter 1.1.4, the mean function µ evaluated at t is estimated by µ(t) = a0,
where
(a0, a1)> = argmina0,a1
n∑i=1
Ni∑j=1
Nij∑k=1
K1
(Tijk − th1
){Uijk − a0 − a1(Tijk − t)}2. (2.8)
The kernel function K1 is a positive density symmetric about 0, and h1 is the bandwidth.
Due to the genetic correlation within family, we choose h1 by minimizing the “leave-one-
family-out” cross-validation (CV),
CV (h1) =n∑i=1
Ni∑j=1
Nij∑k=1
{Uijk − µ−(i)(Tijk;h1)
}2, (2.9)
where µ−(i)(·;h1) is the estimate of µ gotten by removing all of the ith family’s data.
The estimation of the covariance functions combines smoothing and the method of
moments and relies upon the following key facts. Recalling that the total covariance
Σ(s, t) = G(s, t) + E(s, t), we have
cov[Uijk, Uijk′
∣∣Tijk, Tijk′] = Σ(Tijk, Tijk′) + δkk′σ2
α−1i,jj′ cov
[Uijk, Uij′k′
∣∣Tijk, Tij′k′] = G(Tijk, Tij′k′), j 6= j′, (2.10)
where δkk′ = 1 for k = k′ and 0 otherwise. We define the centered observation U cijk =
Uijk− µ(Tijk), and the raw covariance observations Cijkk′ = U cijkY
cijk′ . Then we use a two-
dimensional local linear smoother as in (1.9) to estimate the overall covariance function
Chapter 2. Data Model for Genetically Correlated Subjects 29
Σ, with V = b0, where
(b0, b1, b2)> = argminb0,b1,b2
n∑i=1
Ni∑j=1
∑1≤k 6=l≤Ni
K2
(Tijk − sh2
,Tijl − th2
)× {Cijkk′ − b0 − b1(Tijk − s)− b2(Tijl − t)}2.
(2.11)
K2 is a positive bivariate density symmetric about 0, and h2 is the bandwidth. As in
equation (1.10) we can estimate the noise variance σ2 by
σ2 = |T1|−1
∫T1{Σ(t)− Σ(t, t)}dt,
where Σ is obtained by smoothing (Tijk, Cijkk) over all individuals. The bandwidths that
control the smoothness of Σ and Σ, respectively, are also chosen by the leave-one-family-
out CV in the spirit of (2.9).
To estimate the genetic covariance function G, the key relationship in (2.10) sug-
gests borrowing data across the entire family by constructing raw cross-covariances ob-
tained from individuals of the same family. Define such raw cross-covariance obser-
vations adjusted for relationship coefficients αi,jj′ by Gijj′kk′ = α−1i,jj′U
cijkY
cij′k′ . There-
fore we estimate G using a two-dimensional local linear smoother of the pooled input
{(Tijk, Tij′k′ , Gijj′kk′) : k, k′ = 1, . . . , Nij, 1 ≤ j 6= j′ ≤ Ni, i = 1, . . . , n}, yielding the
estimate G. As a consequence, the environmental covariance E is easily obtained by
E = Σ− G.
We suggest an optional step for updating the estimates of G and E. Note that the
genetic covariance G appears in the within-individual covariance and also appears in the
covariance between related individuals, coupled with the relationship coefficient, as given
in (2.3). In our initial estimate of G, we have only used the latter type of information,
the information among related individuals, that is, we have only smoothed the adjusted
cross-covariances Gijj′kk′ = α−1i,jj′U
cijkY
cij′k′ , j 6= j′. In our update, we add the information
Chapter 2. Data Model for Genetically Correlated Subjects 30
on G contained within an individual. Specifically we use our initial estimate of E and
note that for k 6= k′, E[Cijkk′− E(Tijk, Tijk′)
]≈ G(Tijk, Tijk′). Thus we can construct G∗,
a new estimate of G, by smoothing the combined “data”: {Cijkk′ − E(Tijk, Tijk′), k 6= k′}
and {Gijj′kk′ , j 6= j′}. The estimate of the environmental covariance is also updated by
E∗ = Σ− G∗ accordingly. In practice, when the number of observations per individual is
small and/or when we have a large number of individuals per family, this updating step
can often be omitted as the changes in estimates are negligible.
Estimates of the eigenfunctions and eigenvalues of G and E are obtained as solutions
to the eigen-equations
∫TG∗(s, t)φl(s)ds = λlφl(t),
∫TE∗(s, t)ψm(s)ds = ρmψm(t) (2.12)
subject to the orthonormal constraints 〈φl, φl′〉 = δll′ and 〈ψm, ψm′〉 = δmm′ . This can
be implemented by discretizing the smooth covariances G∗ and E∗ and carrying out
matrix eigen-decomposition, as described in Rice and Silverman (1991). However, the
smoothed covariance functions G∗ and E∗ are not necessarily non-negative definite. A
simple modification is to set negative estimated eigenvalues to zero, and reconstruct G
and E based on (2.4) and (2.5), i.e.,
G(s, t) =∑l:λl>0
λlφl(s)φl(t), E(s, t) =∑
m:ρm>0
ρmψm(s)ψm(t), (2.13)
which has been shown to improve the covariance estimation in terms of mean squared
error (Hall et al., 2008, Theorem 1).
2.3.2 FPC Representation for Genetically Related Individuals
We proceed to reconstruct the individual trajectory Xij in (2.6), which requires the es-
timation of the genetic and environmental FPC scores given by ξijl = 〈Xij − µ, φl〉 and
Chapter 2. Data Model for Genetically Correlated Subjects 31
ζijm = 〈Xij−µ, ψm〉, respectively. It is well-known that the classical integral approxima-
tion fails for sparsely observed functional data. The PACE method by Yao et al. (2005a)
overcomes this problem by employing the idea of the best linear unbiased prediction
(BLUP) in the context of FPCA. Here we generalize the PACE method for estimat-
ing the FPC scores ξijl and ζijm to the case where individuals are genetically related
within family. We call this generalization Familial principal component Analysis through
Conditional Expectation (FACE).
In the sequel, all expectations are understood to be taken conditional on the times Tijk.
To calculate ξijl, the BLUP of ξijl, let U ij = (Uij1, . . . , UijNij)>, U i = (U>i1, . . . ,U
>iNi
)>and Mi =
∑Ni
j=1Nij. Recall the covariance structures in (2.10). Due to the genetic
correlation within all individuals in family i, we infer the lth FPC score ξijl of the
genetic process gij from the observed data for all subjects in the ith family. Write
the Nij × Nij auto-covariance matrix of U ij as Σi,jj = cov(U ij,U ij) = [Σ(Tijk, Tijk′) +
δkk′σ2]1≤k,k′≤Nij
, and the Nij × Nij′ cross-covariance matrix between U ij and U ij′ by
Σi,jj′ = cov(U ij,U ij′) =[αi,jj′G(Tijk, Tij′k′)
]1≤k≤Nij ,1≤k′≤Nij′
, where 1 ≤ j 6= j′ ≤ Ni.
Then we have the Mi×Mi covariance matrix of U i, ΣU i= cov(U i,U i) = (Σi,jj′)1≤j,j≤Ni
.
Let φijl = (φl(Tij1), . . . , φl(TijNij))>, and noting that αi,jj = 1 one has cov(ξijl,U i) =
λl(αi,j1φ>i1l, . . . , αi,jNi
φ>iNil). Finally, denote µij = (µ(Tij1), . . . , µ(TijNij
))> and µi =
(µ>i1, . . . ,µ>iNi
)>. By the BLUP principle, we obtain the FACE formula for ξijl,
ξijl = cov(ξijl,U i)cov(U i,U i)−1(U i − µi)
= λl(αi,j1φ>i1l, . . . , αi,jNi
φ>iNil){(Σi,jj′)1≤j,j≤Ni
}−1(U i − µi), (2.14)
which is equal to E[ξill|U i] when all quantities are Gaussian. Substituting the estimates
of model components, using the generic notation “ˆ”, the FACE estimates are
ξijl = λl(αi,j1φ>i1l, . . . , αi,jNi
φ>iNil
){(Σi,jj′)1≤j,j≤Ni}−1(U i − µi). (2.15)
Chapter 2. Data Model for Genetically Correlated Subjects 32
Since the environmental processes, the eij’s, are independent across individuals, the es-
timation for the FPC scores ζiim is as in PACE, i.e., only use the observed data for that
subject. Denoting ψijm = (ψm(Tij1), . . . , ψm(TijNij))>, simple calculation by the BLUP
principle yields the FACE formulae ζiim and its plug-in estimate ζijm,
ζijm = ρmψ>ijmΣ−1
i,jj(U ij − µij),
ζijm = ρmψ>ijmΣ−1
i,jj(U ij − µij). (2.16)
The reconstruction of the individual trajectories is straightforward once we obtain
the estimates of the FPC scores. It is customary to assume that the Xij’s are well
approximated by a low-dimensional expansion. Suppose we include theKg andKe leading
eigenfunctions of gij and eij in (2.6), respectively, so that
Xij(t) = µ(t) +
Kg∑l=1
ξijlφl(t) +Ke∑m=1
ζijmψm(t). (2.17)
The values of Kg and Ke can be chosen by objective criteria, such as leave-one-family-out
cross-validation, or the AIC based on pseudo-likelihood under Gassian assumptions in a
spirit similar to that of Yao et al. (2005a). In practice, using the proportion of functional
variation explained (FVE) with a suitable threshold is often satisfactory.
2.4 Application to Weights of Beef Cattle
The dataset we analyze here is a subset of a larger dataset used in Meyer et al. (1993) and
Meyer (1999). Our data set contains weights in kilograms of 55 beef cattle from a total
of 15 independent families. The cows within a family were half-siblings, having the same
sire but different mothers. Thus the genetic correlation parameter αi,jj′ ≡ 1/4 is known
a priori, based on the half-sibling relationships. The phenotypic trajectories are notably
irregularly and sparsely observed. The number Ni of half-siblings per family ranges from
Chapter 2. Data Model for Genetically Correlated Subjects 33
one to eight; see Figure 2.1a for the distribution of ni’s. Weighings occurred at ages
ranging from 548 to 2553 days, i.e., T = [548, 2553]. The number Nij of weighings per
individual varied from 1 to 62, and a histogram of the Nij’s is shown in Figure 2.1b. Data
were affected by some additional environmental factors, but for simplicity, we have not
included them in our model. Including such fixed effects is, in general, straightforward,
and would allow the user to model variability that is not completely due to individual
effects.
1 2 3 4 5 6 7 8
Number of siblings per sire
Fre
q
01
23
(a) Siblings per sire
Number of observations per cow
Fre
q
0 10 20 30 40 50 60 70
05
1015
20
(b) Observations per cow
Figure 2.1: Beef cattle data: frequency distributions.
The estimated mean function is shown in Figure 2.2, and shows, approximately, a
yearly cyclical pattern that depicts the seasonal weight changes of beef cattle. The non-
negative definite covariance estimates (2.13) for the genetic and environmental processes
are shown in Figure 2.3a and 2.3b. We see that the genetic covariance is not as strong
as the environmental covariance. Indeed, the environmental process explains about five
and a half times the variability as the genetic process. However, the two covariances do
exhibit similar patterns, with relatively high variation at late times. Another observation
is that the environmental covariance seems to increase over time, which is not surprising
as environmental influences may accumulate as the cows age. We used a threshold of 98%
to select the number of principal components for the genetic and environmental processes.
Chapter 2. Data Model for Genetically Correlated Subjects 34
Thus, the Kg = 3 genetic principal components, λ1 = 4.4 × 105, λ2 = 2.1 × 105, and
λ3 = 3.9 × 104 explained 62.5%, 29.9%, and 5.6% of the genetic variation, respectively.
The Ke = 4 environmental principal components, ρ1 = 3.1 × 106, ρ2 = 3.1 × 105, ρ3 =
2.0×105, and ρ4 = 1.3×105 explained 81.6%, 8.1%, 5.2%, and 3.4% of the environmental
variation, respectively. The estimated genetic and environmental eigenfunctions are given
in Figures 2.4a and 2.4b, respectively. From the first two eigenfunctions in each panel,
one can see that the dominant variation in the genetic process concentrates around 2000
days and includes a contrast between weights at 1200 days and at 2300 days. The
environmental effect shows a more constant influence over time with an early slow increase
followed by a sharp drop after 2000 days (or vice versa). The updating step of the genetic
and environmental covariances did not alter the estimates obviously and was not needed
for this analysis.
600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600
200
300
400
500
600
700
800
Age (days)
Wei
ght (
kg)
Figure 2.2: Estimated mean function (dark) with observed trajectories (light) for thebeef cattle data.
We are primarily interested in predicting the growth of beef cattle from sparsely ob-
served measurements. It is thus informative to assess the proposed method by comparing
it with the PACE method that treats all individuals independently, i.e., that doesn’t
take familial genetic correlation into account. We calculate the leave-one-family-out
cross-validation error given by∑
i
∑j
∑k{Uijk−X
−iij (Tijk)}2, where X−iij is the predicted
phenotype of the jth cow in the ith family. Specifically, the model components are es-
Chapter 2. Data Model for Genetically Correlated Subjects 35
5001000
15002000
25003000
500
1000
1500
2000
2500
3000−600
−400
−200
0
200
400
600
800
1000
1200
Age (days)Age (days)
(a) Genetic
5001000
15002000
25003000
500
1000
1500
2000
2500
3000−1000
0
1000
2000
3000
4000
5000
Age (days)Age (days)
(b) Environmental
Figure 2.3: Non-negative definite estimates of the genetic and environmental covariancefunctions for the beef cattle data.
600 800 1000 1200 1400 1600 1800 2000 2200 2400
−0.05
−0.04
−0.03
−0.02
−0.01
0
0.01
0.02
0.03
0.04
Age (days)
(a) Genetic
600 800 1000 1200 1400 1600 1800 2000 2200 2400
−0.08
−0.06
−0.04
−0.02
0
0.02
0.04
Age (days)
(b) Environmental
Figure 2.4: Shown are the first (solid), second (dashed), third (dash-dot), and fourth(dotted) eigenfunctions. Left: first three eigenfunctions of the genetic process, countingfor 98% of the genetic variance. Right: first four eigenfunctions of the environmentalprocess, explaining 98.3% of the environmental variance.
Chapter 2. Data Model for Genetically Correlated Subjects 36
timated based on data excluding family i using the method described in Section 2.3.1.
Then the FPC scores ξ−iijl and ζ−iijm are obtained by substituting these leave-one-family-
out estimates, µ−i, λ−il , ρ−im , φ
−il , ψ
−im ,Σ
−ii,jj′ , into (2.15) and (2.16), leading to X−iij . We use
K−ig and K−ie leading eigenfunctions, chosen to explain 98% of, respectively, the genetic
and the environmental functional variation in the data. The reconstruction using the
PACE method is obtained in a similar manner. See Yao et al. (2005a) for details. Not
surprisingly, the proposed FACE method considerably improves upon the PACE method
by around 18%. Shown in Figure 2.5 are the cross-validated trajectory estimates for
offsprings of two of the fifteen families using FACE and PACE methods. We observe
that FACE offers improved predictions for these eight cows.
552 1224255
450
7.4%7.1%
574 2471300
575
2.9%2.2%
564 2546325
600
2.5%1.9%
558 2540321
653
4.0%2.1%
556 2538310
590
2.9%2.0%
553 2534324
692
2.3%1.4%
581 1707305
675
3.1%1.8%
574 2519330
616
3.7%2.6%
Figure 2.5: Estimated trajectories by leave-one-family-out cross-validation (CV) for twofamilies of cows obtained using FACE method (solid) and PACE method (dashed), wherethe first row presents two half-siblings from one family and the bottom three rows presentsix half-siblings from another family. The legend shows the relative CV error of each cow,∑Nij
k=1{Uijk− X−iij (Tijk)}2/U2
ijk, obtained from two methods, where X−iij is as described inSection 4.
Chapter 2. Data Model for Genetically Correlated Subjects 37
2.5 Simulated Examples
To further illustrate the performance of the proposed method, we carry out two simulation
studies. For Simulation I, we closely mimic the cow data, using the same design, e.g.,
the same family sizes and times of weighings. The underlying model is (2.7) with Kg
terms for the genetic component and Ke terms for the environmental component. The
environmental covariance is derived from the first four estimated eigenfunctions, i.e.,
Ke = 4. In view of the importance of the genetic component, we examine three values of
Kg: Kg = 1, 2, 3, and we use the corresponding genetic eigenfunctions estimated from the
data. We use the half-sibling relationship coefficient αi,jj′ = 1/4 for all i, j and j′ 6= j.
The genetic and environmental FPC scores ξijl and ζijm and the measurement errors εijk
are independently generated from normal distributions, respectively, using the estimated
eigenvalues and error variance from the data. To focus our attention on the covariances
and FPCs, we set the mean function µ to 0 in the data generation but still treat it
as unknown in our analysis. For each underlying model, we generate 100 Monte Carlo
samples, and produce two versions of Xij, the FACE estimate that respects the familial
genetic relationship, and the PACE estimate that ignores familial dependence. To select
Kg and Ke, we again use a 98% threshold for the fraction of variance explained. Within
each sample and for each estimation method, we calculate the integrated squared error
(ISE) for the jth individual in the ith family, ISEij =∫T
{Xij(t) − Xij(t)
}2dt, and the
overall ISE is defined as ISE =∑
i,j ISEij. Improvements of the proposed FACE method
upon the PACE method are summarized in Table 2.1, which indicates a substantial
improvement of 21% to 25%.
In Simulation II, we again follow model (2.7), but with µ(t) = t + sin(2πt), φ1(t) =
ζ1(t) = − cos(2πt/10)/√
5 and φ2(t) = ζ2(t) = sin(2πt/10)/√
5 and corresponding eigen-
values λ1 = 10, λ2 = 5 and ρ1 = 100, ρ2 = 10. The genetic and environmental FPC
scores are generated from normal distributions, and the measurement error εijk is from
N(0, 0.01). We still generate data for 15 families, but the number of siblings within
Chapter 2. Data Model for Genetically Correlated Subjects 38
family is chosen uniformly from {2, . . . , 6} and the number of observations per subject is
chosen uniformly from {5, . . . , 20}. The observation times are uniformly distributed on
[0, 10]. With 100 Monte Carlo samples, the ISE based on the FACE method incorporating
genetic correlation outperformed the PACE method by 30% for the case of half-sibling
families with αi,jj′ = 1/4 for j 6= j′, and by 25% for the case of full-sibling families with
αi,jj′ = 1/2 for j 6= j′. See Table 2.1.
Table 2.1: ISE improvement (%) of the proposed FACE method upon PACE, whereSimulation I uses data-based models with different values of (Kg, Ke) and Simulation IIexamines half-sibling (α = 0.25) and full-sibling (α = 0.5) family relationships.
Simulation I
(Kg, Ke) Mean (SE) 1st Quartile Median 3rd Quartile(1, 4) 21.4 (1.5) 15.1 23.5 28.7(2, 4) 25.1 (1.6) 12.9 28.9 36.3(3, 4) 21.9 (1.6) 10.9 24.7 32.6α Mean (SE) 1st Quartile Median 3rd Quartile
Simulation II 0.25 30.4 (3.1) 13.4 39.0 52.80.50 25.4 (3.0) 11.7 30.4 45.4
2.6 Conclusion
In this chapter, we propose a version of functional data analysis for trajectories of geneti-
cally related individuals from independent families. We are able to estimate various levels
of variation: the genetic covariance, the environmental covariance induced by external
factors, and the measurement error variance. A new method, named FACE, is proposed
to taking into account the familial correlation for estimating the genetic random effects.
By making use of the auto-covariance function of each individual, we also develop a sim-
ple step to update estimates of the genetic and environmental covariance functions. We
apply our method to study the growth over time of families of half-sibling cows, which
shows via data analysis and simulation studies that, for predicting underlying trajecto-
ries, our proposal improves considerably upon the existing PACE method designed for a
sample of independent subjects.
Chapter 2. Data Model for Genetically Correlated Subjects 39
While our method does well on its own, it can also be part of a hybrid approach. Our
proposed methodology can be used for dimension reduction, specifically to determine a
handful of eigenfunctions that can then be used as basis functions in further analysis.
For instance, the basis functions might be used in a parsimonious mixed effects random
regression analysis, a method that is computationally burdensome with even a moderate
number of basis functions.
Chapter 3
Cumulative Slicing Estimation for
Dimension Reduction
40
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 41
3.1 Introduction
In functional data analysis (FDA), one is often interested in how a scalar response Y ∈ R
varies with a smooth trajectory X(t), where t is an index variable defined on a closed
interval T (Ramsay and Silverman, 2005, for a comprehensive overview). To be specific,
one seeks to model the relationship Y = M(X; ε), where M is a smooth functional
and the error process ε has zero mean, finite variance σ2, and is independent of X.
While modeling M parametrically can be restrictive in many applications, modeling M
nonparametrically is practically infeasible due to slow convergence rates associated with
the “curse of dimensionality”. Therefore a class of semiparametric index models has been
proposed to approximate M(X; ε) with an unknown link function g : RK+1 → R,
Y = g(〈β1, X〉, . . . , 〈βK , X〉; ε
), (3.1)
where K is the reduced dimension of the model, β1, . . . , βK are linearly independent index
functions, and 〈u, v〉 =∫u(t)v(t)dt is the usual L2 inner product. Functional linear model
(FLM) Y = β0 +∫β1(t)X(t)d(t) + ε is a special case and has been extensively studied
(Cardot et al., 1999, Muller and Stadtmuller, 2005, Yao et al., 2005b, Cai and Hall, 2006,
Hall and Horowitz, 2007, Yuan and Cai, 2010, among others).
In this chapter, we tackle the index model (3.1) from the perspective of effective di-
mension reduction (EDR), in the sense that the K linear projections 〈β1, X〉, . . . , 〈βK , X〉
form a sufficient statistic. This is particularly useful when the process X is infinite-
dimensional. Our primary goal is to offer a novel treatment of dimension reduction for
functional data, especially when the trajectories are corrupted with noise and sparsely
observed with a few observations for some, or even all of the subjects. Pioneered by Li
(1991) for multivariate data, EDR methods are typically “link-free”, requiring neither
specification nor estimation of the link function (Duan and Li, 1991), and the objective
is to characterize the K-dimensional EDR space SY |X = span(β1, . . . , βK) onto which to
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 42
project X. Such index functions βk are referred to as EDR directions, K is called the
structural dimension of the EDR space, and SY |X is also known as the central subspace
(Cook, 1998). Li (1991) characterized SY |X via the inverse mean E[X|Y ], namely the
sliced inverse regression (SIR), and has since motivated a large body of related works for
multivariate data: Cook and Weisberg (1991) estimated var(X|Y ), Li (1992) dealt with
the Hessian matrix of the regression curve, Xia et al. (2002) proposed minimum aver-
age variance estimation as an adaptive approach based on kernel methods, Chiaromonte
et al. (2002) used partial SIR for categorical predictors, Li and Wang (2007) worked with
empirical directions, Zhu et al. (2010) proposed cumulative slicing estimation (CUME)
to improve upon SIR, and among others.
The literature of EDR methods for functional data has been relatively scarce. Notably,
Ferre and Yao (2003) extended SIR to completely observed functional data (FSIR), and
Li and Hsing (2010) developed sequential χ2 testing procedures to decide the structural
dimension of the EDR space obtained using FSIR. Besides EDR approaches, James and
Silverman (2005) estimated the index and link functions jointly for an additive form
g(〈β1, X〉, . . . , 〈βK , X〉; ε
)= β0 +
∑Kk=1 gk
(〈βk, X〉
)+ ε, assuming that the trajectories
were densely or completely observed, and the index and link functions were elements of a
finite dimensional spline space. Chen et al. (2011) estimated the index and additive link
functions nonparametrically and relaxed the finite dimensional assumption for theoretical
analysis, but retained the dense design as a crucial condition.
To the best of our knowledge, none of the existing work addresses dimension reduc-
tion for sparse functional data in the context of multiple-index type models (3.1). Similar
to suggestions from James and Silverman (2005) and Chen et al. (2011), Ferre and Yao
(2003) remarked that, in practice, the functional trajectories could first be recovered,
but this cumbersome two-step procedure deviates from the spirit of EDR analysis. In
contrast, we aim to estimate the EDR space directly by drawing our inspiration from cu-
mulative slicing for multivariate data (Zhu et al., 2010). When adapted to the functional
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 43
setting, cumulative slicing offers a novel way of borrowing strength across subjects to
handle sparsely observed trajectories. This key advantage has not been leveraged upon
elsewhere. As we will demonstrate later, though the extension of cumulative slicing to
completely observed functional data is straightforward, it takes a materially different
strategy for sparse design by maximizing the use of available data. We also provide a
rigorous theoretical analysis of the proposed method, namely Functional Cumulative Slic-
ing (FCS), for sparse functional data, which reveals the bias-variance tradeoff associated
with the regularizing truncation and the decaying structures of the predictor process and
the EDR space.
The rest of the chapter is organized as follows. We present the proposed FCS method-
ology and its sample estimation procedure in Chapter 3.2. Chapter 3.3 details asymptotic
properties of the relevant estimates obtained from FCS. Chapter 3.4 provides numeri-
cal studies of simulated examples, and Chapter 3.5 offers two data applications, one on
sparsely observed functional data and the other on densely observed functional data.
Conclusion remarks are given in Chapter 3.6, while technical proofs are relegated to the
Appendix.
3.2 Methodology
Let T be a compact interval, and X be a random variable defined on the real and
separable Hilbert space H ≡ L2(T ) endowed with inner product 〈f, g〉 =∫T f(t)g(t)dt
and norm ‖f‖H =√〈f, f〉. We assume for simplicity that
Assumption 3.1. X is centered and has a finite fourth moment∫τE[X4(t)]dt <∞.
Under Assumption 3.1, the covariance surface of X is given by Σ(s, t) = E[X(s)X(t)],
which generates a Hilbert-Schmidt operator Σ onH that maps f to (Σf)(s) =∫τ
Σ(s, t)f(t)dt.
This operator can be written succinctly as Σ = E[X⊗X], where the tensor product u⊗v
denotes the rank one operator on H that maps w to (u ⊗ v)w = 〈u,w〉v. By Mercer’s
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 44
Theorem, Σ admits a spectral decomposition Σ =∑∞
j=1 αjφj ⊗ φj, where eigenfunctions
{φj}j=1,2,... form a complete and orthonormal system in H and eigenvalues {αj}j=1,2,...
are assumed to be strictly decreasing and positive such that∑∞
j=1 αj <∞. Finally, recall
that the EDR directions β1, . . . , βK in model (3.1) are linearly independent functions in
H, and the response Y ∈ R is assumed to be conditionally independent of X given the
K projections 〈β1, X〉, . . . , 〈βK , X〉.
As a close comparison, we briefly review functional sliced inverse regression that
targets the EDR space through ΛSIR = var(E[X|Y ]
), the operator associated with the
covariance of the inverse mean. It partitions the range of Y into a user-specified partition
of S slices I1, . . . , IS, where Is denotes the interval (ys−1, ys] with −∞ = y0 < y1 < . . . <
yS = +∞. Observe that E[X|Y ∈ Is
]= E
[X1(Y ∈ Is)
]/P (Y ∈ Is) ≡ ms/ps. Then
FSIR approximates ΛSIR by its sliced version Λ0 =∑S
s=1 p−1s ms⊗ms. From multivariate
SIR, it is well known that the number of slices is associated with a bias-variance tradeoff.
The number of slices must be larger than the structural dimension in order to fully
characterize SY |X , but if it is too large, the variance will increase as ps will be close
to zero. It is easy to see that applying FSIR to sparsely observed functional data is
practically infeasible, since the combination of the sparsely observed X and the delicate
need of choosing a sufficiently large number of slices would inevitably result in too few
observations in each slice with which to estimate Λ0.
3.2.1 Validity of Functional Cumulative Slicing
To avoid the nontrivial selection of the number of slices in SIR, Zhu et al. (2010) noted
that for a fixed y, using two slices I1 = (−∞, y] and I2 = (y,+∞) would maximize the use
of data and minimize the variability in each slice. In light of the foregoing discussion on
the limitations of FSIR for sparse functional data, the choice of two slices is thus critical to
ensure that each slice has a sufficient number of observations. The kernel of the operator
Λ0 then reduces to Λ0(s, t; y) ∝ m(s, y)m(t, y), where m(·, y) = E[X(·)1(Y ≤ y)
]is an
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 45
unconditional expectation in contrast to the conditional expectation E[X(·)|Y ∈ Is] of
FSIR. However, such a kernel Λ0 can recover at most one direction of SY |X for a fixed
y ∈ R. It is necessary to combine all possible estimation of m(·, y) by letting y run
across the support of Y , an independent copy of Y . Therefore the kernel of the proposed
functional cumulative slicing (FCS) is given by
Λ(s, t) = E[m(s, Y )m(t, Y )w(Y )
], (3.2)
where w(y) is a known nonnegative weight function for generality. Denote the corre-
sponding integral operator of Λ(s, t) also by Λ. The following theorem establishes the
validity of FCS. Analogous to the multivariate case, the linearity assumption is needed,
Assumption 3.2 (Linearity). For any function b ∈ H, there exists constants c0, . . . , cK ∈
R such that
E[〈b,X〉|〈β1, X〉, . . . , 〈βK , X〉
]= c0 +
K∑k=1
ck〈βk, X〉.
Condition 3.2 is satisfied when X has an elliptically contoured distribution, which is
more general than, but bears a close connection to, a Gaussian process (Cambanis et al.,
1981, Li and Hsing, 2010).
Theorem 3.1. If assumptions 3.1 and 3.2 hold for model (3.1), then the linear space
spanned by m(t, y), y ∈ R, is contained in the linear space spanned by {Σβ1, . . . ,ΣβK},
i.e., span({m(t, y) : y ∈ R}
)⊆ span
(Σβ1, . . . ,ΣβK
).
An important observation from Theorem 3.1 is that for any b ∈ H orthogonal to
the space spanned by {Σβ1, . . . ,ΣβK}, we have 〈b,Λx〉 = 0, implying range(Λ) ⊆
span(Σβ1, . . . ,ΣβK). If Λ has K non-zero eigenvalues, the space spanned by its eigenfunc-
tions is precisely span(Σβ1, . . . ,ΣβK). In principle, we can deduce SY |X = span(β1, . . . , βK)
from Σ and Λ. Recall that our target is the subcentral space SY |X , even though the EDR
directions themselves are not identifiable. For specificity, we regard these eigenfunc-
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 46
tions of Σ−1Λ associated with the K largest non-zero eigenvalues as the index functions
β1, . . . , βK themselves unless stated otherwise.
It is worth mentioning that since the covariance operator Σ is Hilbert-Schmidt, its
inverse Σ−1 is not well-defined so the EDR directions may not even exist in H. Following
He et al. (2003) for functional canonical correlation, let RΣ denote the range of Σ and
R−1Σ =
{b ∈ H :
∑∞j=1 α
−1j 〈b, φj〉2 < ∞, b ∈ RΣ
}. Restricted to R−1
Σ , Σ is a one-to-one
operator from R−1Σ ⊂ H onto RΣ whose inverse is defined by Σ−1 =
∑∞j=1 α
−1j φj ⊗ φj.
Let ξj = 〈X,φj〉 denote the jth principal component (or generalized Fourier coefficient)
of X, and assume that
Assumption 3.3.∑∞
j=1
∑∞l=1 α
−2j α−1
l E2{E[ξj1(Y ≤ Y )|Y ]E[ξl1(Y ≤ Y )|Y ]
}<∞.
Proposition 3.1. Under assumptions 3.1-3.3, the eigenspace associated with the K non-
null eigenvalues of Σ−1Λ is well defined in H.
This is a direct analogue to Theorem 4.8 in He et al. (2003) and Theorem 2.1 in Ferre
and Yao (2005), thus the proof is omitted for conciseness.
3.2.2 Functional Cumulative Slicing for Sparse Functional Data
For the data{
(Xi, Yi) : 1 ≤ i ≤ n}
independently and identically distributed (i.i.d.)
as (X, Y ), the predictor trajectories Xi are observed intermittently, contaminated with
noise, and collected in the form of repeated measurements{
(Tij, Uij) : 1 ≤ i ≤ n, 1 ≤
j ≤ Ni
}, where Uij = Xi(Tij) + εij with i.i.d. measurement error εij that are of zero
mean, constant variance σ2x, and independent of all other random variables. When only
a few observations are available for some or even all subjects, individual smoothing to
recover Xi is infeasible and one must adopt the strategy of pooling together data from
across subjects for consistent estimation.
To estimate the FCS kernel Λ defined in (3.2), the key quantity is the unconditional
mean m(t, y) = E[X(·)1(Y ≤ y)
]. For sparsely and irregularly observed Xi, the cross-
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 47
sectional estimation used in multivariate cumulative slicing is no longer applicable. To
maximize the use of available data, we propose to pool together the repeated measure-
ments across subjects via a scatterplot smoother, which works seamlessly in conjunction
with the strategy of cumulative slicing. For specificity, we use a local linear estimator
m(t, y) = a0 (Fan and Gijbels, 1996), minimizing
min(a0,a1)
n∑i=1
Ni∑j=1
{Uij1(Yi ≤ y)− a0 − a1(Tij − t)
}2
K1
(Tij − th1
), (3.3)
where K1 is a non-negative and symmetric univariate kernel density and h1 = h1(n)
is the bandwidth to control the amount of smoothing. Here we follow the suggestion of
ignoring the dependency among the data from the same individual (Lin and Carroll, 2000,
for smoothing correlated data), and use leave-one-curve-out cross-validation to select h1
(Rice and Silverman, 1991). Then an estimator of the FCS kernel function Λ(s, t) is given
by its sample moment,
Λ(s, t) =1
n
n∑i=1
m(s, Yi)m(t, Yi)w(Yi). (3.4)
For the covariance operator Σ, following Yao et al. (2005a), denote the observed raw
covariances byGi(Tij, Til) = UijUil and note E[Gi(Tij, Til)|Tij, Til
]= cov(X(Tij), X(Til))+
σ2δjl, where δjl is 1 if j = l and 0 otherwise. This suggests the diagonal of the raw co-
variances should be removed, and minimizing
min(b0,b1,b2)
n∑i=1
∑1≤j 6=l≤Ni
{Gi(Tij, Til)−b0−b1(Tij−s)−b2(Til−t)
}2
K2
(Tij − sh2
,Til − th2
)(3.5)
yields Σ(s, t) = b0, where K2 is a non-negative bivariate kernel density and h2 = h2(n)
is the bandwidth chosen by leave-one-curve-out cross-validation, see Yao et al. (2005a)
for details on the implementation. Since the inverse operator Σ−1 is unbounded, we
regularize it by projection onto a truncated subspace. To be precise, let sn be a possibly
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 48
divergent sequence and Πsn =∑sn
j=1 φj ⊗ φj (resp. Πsn =∑sn
j=1 φj ⊗ φj) denote the
orthogonal projector onto the eigensubspace associated with the sn largest eigenvalues of
Σ (resp. Σ). Then, Σsn = ΠsnΣΠsn (resp. Σsn = ΠsnΣΠsn) is a sequence of finite rank
operators converging to Σ (resp. Σ) as n→∞ with bounded inverse
Σ−1sn =
sn∑j=1
α−1j φj ⊗ φj, Σ−1
sn =sn∑j=1
α−1j φj ⊗ φj, (3.6)
respectively. Finally we obtain the eigenfunctions associated with the K largest nonzero
eigenvalues of Σ−1sn Λ as the estimates of the EDR directions {βk,sn}k=1,...,K .
The situation for completely observedXi is similar to the multivariate case and consid-
erably simpler. The quantities m(t, y) and Σ(s, t) are easily estimated by their respective
sample moments m(t, y) = n−1∑n
i=1 Xi(t)1(Yi ≤ y) and Σ(s, t) = n−1∑n
i=1 Xi(s)Xi(t),
while the estimate of Λ remains the same as (3.4). For densely observed Xi, individual
smoothing can be used as a preprocessing step to recover smooth trajectories and the
estimation error introduced in this step can be shown to be asymptotically negligible un-
der certain design conditions, i.e., it is equivalent to the ideal situation of the completely
observed Xi’s (Hall et al., 2006).
Remarks. (i) For small values of Yi, m(·, Yi) obtained by (3.3) may be unstable due
to the smaller number of pooled observations in the slice. A suitable weight function w
may be used to refine the estimator Λ(s, t). In our numerical studies, the naive choice
of w ≡ 1 performed fairly well compared to other methods. Analogous to multivariate
case, choosing an optimal w remains an open question. (ii) Ferre and Yao (2005) avoided
inverting Σ with the claim that for a finite rank operator Λ, range(Λ−1Σ) = range(Σ−1Λ);
however, Cook et al. (2010) showed that this required more stringent conditions that
are not easily fulfilled. (iii) Regularization can also be tackled with a ridge penalty
(Σ + ρI)−1, where ρ > 0 and I is the identity operator. However, numerical results from
this regularization scheme are observed to be inferior to those from spectral truncation,
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 49
and thus not pursued further. (iv) For selecting the structural dimension K, the only
relevant work to date is Li and Hsing (2010), where sequential χ2 tests are developed to
determine K in the context of FSIR for completely observed functional data. How to
extend such tests (if feasible at all) to sparse functional data is a substantive problem
that deserves further exploration. Nevertheless, since prediction is the primary concern
in many applications, both K and sn can be easily chosen by minimizing prediction error
when there is a sensible model in place. In the simulated and real examples we adopt
this principle that empirically performs well.
3.3 Asymptotic Properties
In this section we present asymptotic properties of the FCS kernel operator and the EDR
directions for sparsely observed functional data. Here the number of measurements Ni
and the observation times Tij are considered random to reflect a sparse and irregular
design. Specifically, we assume that
Assumption 3.4. Ni are random variables with Nii.i.d.∼ N , where N is a bounded
positive discrete random variable with P{N ≥ 2} > 0, and({Tij, j ∈ Ji}, {Uij, j ∈ Ji}
)are independent of Ni for Ji ⊆ {1, . . . , Ni}.
Writing Ti = (Ti1, . . . , TiNi)> and Ui = (Ui1, . . . , UiNi
)>, the data quadruplets Zi =
{Ti, Ui, Yi, Ni} are thus i.i.d.. Note that extremely sparse designs are covered, with only
a few measurements for each subject. Other regularity conditions are standard and listed
in the Appendix, including assumptions on the smoothness of the mean and covariance
functions of X, the distributions of the observation times, the bandwidths and kernel
functions used in smoothing steps. Denote ‖A‖2H =
∫T
∫T A
2(s, t)dsdt for A ∈ L2(T ×T ).
Theorem 3.2. Under assumptions 3.1, 3.4 and 3.7–3.10 in the Appendix, we have
∥∥Λ− Λ∥∥H
= Op
(1√nh1
),
∥∥Σ− Σ∥∥H
= Op
(1√nh2
),
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 50
The key result here is the L2 convergence of the estimated FCS operator Λ, in which we
exploit the projections of nonparametric U -statistics coupled with an important decom-
position of m(·, y) to overcome the difficulty caused by the dependence among irregularly
spaced measurements. Note that Λ is obtained by averaging the smoothers m(·, Yi) over
Yi, which is crucial to achieve the univariate convergence rate for this bivariate estima-
tor. The convergence of the covariance operator Σ is presented for completeness, given
in Theorem 2 of Yao and Muller (2010).
We are now ready to characterize the estimation of the central subspace SY |X =
span(β1, . . . , βK). Unlike the multivariate or finite-dimensional case where the conver-
gence of SY |X follows immediately from the boundedness of Σ−1, we have to approximate
Σ−1 with a sequence of truncated estimates Σ−1sn in (3.6). Recall that we specifically
regarded the index functions {β1, . . . , βK} as the eigenfunctions associated with the K
largest eigenvalues of Σ−1Λ to suppress the identifiability concern. It is thus equivalent to
consider {β1,sn , . . . , βK,sn} in place of SY |X . For an arbitrary constant C > 0, we require
the eigenvalues of Σ to satisfy
Assumption 3.5. α1 > α2 > . . . > 0, Eξ4j ≤ Cα2
j for j ≥ 1, and αj − αj+1 ≥ C−1j−a−1
for j ≥ 1.
This condition on the decaying speed of eigenvalues αj prevents the spacings between
consecutive eigenvalues from being too small, which also implies αj ≥ Cj−a and, together
with the boundedness of Σ, a > 1. Expressing the index functions as βk =∑∞
j=1 bkjφj, k =
1, . . . , K, we impose a decaying structure on its generalized Fourier coefficients bkj =
〈βk, φj〉,
Assumption 3.6. |bkj| ≤ Cj−b for j ≥ 1 and 1 ≤ k ≤ K, where a+ 12< b.
This implies that {βk}k=1,...,K is smoother relative to Σ. Here, we require a stronger
condition than a/2 + 1 < b assumed by Hall and Horowitz (2007) for functional linear
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 51
model with completely observed Xi. This is not unexpected, as the index model (3.1) is
more flexible and we are dealing with sparse functional data.
Theorem 3.3. Under conditions 3.1–3.6 and 3.7–3.10 in the Appendix, for all k =
1, . . . , K, we have
∥∥βk,sn − βk∥∥H = Op
(s
32a+1
n√nh1
+s
(2a−b+2)+n√nh2
+1
sb−a− 1
2n
),
where (2a− b+ 2)+ = max(0, 2a− b+ 2).
This result explicitly associates the convergence of βk,sn with the regularizing trun-
cation size sn and the decay rates of αj and bkj. Specifically, the first two terms are
attributed to the variability of estimating Σ−1sn Λ using Σ−1
sn Λ, and the last to the ap-
proximation bias of Σ−1sn Λ. This indicates a bias-variance tradeoff associated with the
truncation size sn. One may view sn as a tuning parameter that controls the resolution
or smoothness of the covariance estimation. Furthermore, the first term of the variance
is due to ‖Σ−1sn ΛΣ
−1/2sn −Σ−1
sn ΛΣ−1/2sn ‖ (details in Appendix) and becomes increasingly un-
stable with a larger truncation. The bias and the second part of the variance contributed
from ‖(Σ−1sn − Σ−1
sn )ΛΣ−1/2sn ‖ are to some extent determined by the relative smoothness of
Σ and βk, i.e., a smoother βk with a larger b leads to less discrepancy.
3.4 Simulations
In this section we illustrate the performance of the proposed FCS method in terms of esti-
mation and prediction. We compare the proposed FCS to (i) FSIR with 5 slices (FSIR5),
(ii) FSIR with 10 slices (FSIR10), (iii) functional index model with nonparametric link
(FIND) proposed by Chen et al. (2011), and (iv) functional linear model (FLM) as a
misspecified baseline for assessing prediction. Although FCS and FSIR are “link-free”
for estimating index functions βk, a general index model (3.1) may lead to model pre-
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 52
dictions with high variability, especially given relatively small sample sizes frequently
encountered in functional data analysis. Thus we follow Chen et al. (2011) by assuming
an additive structure on the link function g in (3.1), i.e., Y = β0 +∑K
k=1 gk(〈βk, X〉
)+ ε.
In each Monte Carlo run, a sample of n = 200 functional trajectories are generated
from the process Xi(t) =∑50
j=1 ξijφj(t), where φj(t) = sin(πtj/5)/√
5 for j even and
φj(t) = cos(πtj/5)/√
5 for j odd, FPC scores ξij are i.i.d. N(0, j−1.5), t ∈ [0, 10].
For the setting of sparsely observed functional data, the number of observations per
subject Ni is chosen uniformly from {15, . . . , 20}, the observational times Tij are i.i.d.
U [0, 10], and the measurement error εij is i.i.d. N(0, 0.1). For densely observed func-
tional data, let Tij = 0.1(j − 1) for j = 1, . . . , 101. The EDR directions are generated by
β1(t) =∑50
j=1 bjφj(t), where bj = 1 for j = 1, 2, 3 and bj = 4(j− 2)−3 for 4 ≤ j ≤ 50, and
β2(t) =√
3/10(t/5− 1) that is not representable with finite Fourier terms.
Since neither FSIR nor FIND are directly applicable to sparse functional data for
estimating βk, we adopt a two-stage method as suggested by Ferre and Yao (2003) and
Chen et al. (2011): first we use the PACE (Yao et al., 2005a) method, a functional
principal component approach specifically designed for sparse functional data and is
publicly available at http://www.stat.ucdavis.edu/PACE, to recover Xi with very little
dimension reduction (using fraction of variance explained of 99%), denoted by Xi; then
we apply FSIR or FIND to obtain βk,sn . The following single and multiple index models
are considered
Model I: Y = sin(π〈β1, X〉/4
)+ ε,
Model II: Y = arctan(π〈β1, X〉/2
)+ ε,
Model III: Y = sin(π〈β1, X〉/3
)+ exp
(〈β2, X〉/3
)+ ε,
Model IV: Y = arctan(π〈β1, X〉
)+ sin
(π〈β2, X〉/6
)/2 + ε,
where the regression error ε is i.i.d. N(0, 1) for all models. Due to the nonidentifiability
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 53
of βk’s, we examine the projection operator of the the EDR space, i.e., P =∑K0
k=1 βk⊗βk
with K0 denoting the true structural dimension. To assess the estimation of the EDR
space, we calculate the average of the singular values of (PK,sn − P ) as the model error,
i.e., its operator norm ‖PK,sn − P‖ normalized by the number of singular values with
PK,sn =∑K
k=1 βK,sn ⊗ βK,sn . We compute the average model error and its standard error
over 100 Monte Carlo repetitions, shown in Table 3.1. The structure dimension K and
the truncation parameter sn are chosen by minimizing the average model error. One can
see that, for sparse functional data, the proposed FCS outperforms the other methods for
all models, while FSIR and FIND may suffer from the two-stage approach for estimating
index functions. As expected, the gains in the setting of dense functional data are less
noticeable.
To assess model prediction, we use a backfitting algorithm (Hastie and Tibshirani,
1990) to nonparametrically estimate the link functions gk by fitting Yi = β0+∑K
k=1 gk(Zik)+
εi, where Zik = 〈βk,sn , Xi〉. For dense functional data, Zik = 〈βk,sn , Xi〉 is given by an
integral approximation. When Xi are sparse, we substitute Xi with its PACE estimate
Xi. Unlike the FSIR and FCS, the FIND jointly estimates the index and link functions.
To calculating the prediction error, we additionally generate a validation sample of size
500 in each run, and calculate the Monte Carlo average of the mean squared predic-
tion error MSPE = 500−1∑500
i=1
(Y ∗i − Y ∗i
)2over different values of K and sn, where
Y ∗i = β0 +∑K
k=1 gk(Z∗ik) and Z∗ik = 〈βk,sn , X∗i 〉 with X∗i being the underlying trajectories
in the testing sample.
We report the minimized average MSPE and its standard error with corresponding
choice of {K, sn}, shown in Table 3.2. We see that the FCS substantially improves pre-
diction for sparse functional data for all models. In the dense data setting, the prediction
from FCS and FSIR are virtually indistinguishable, while FIND seems to be suboptimal
and the misspecified FLM fails as expected. The structural dimension K is the inherent
parameter of the underlying model, while the truncation sn plays a role of a tuning pa-
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 54
Table 3.1: Shown are the model error in form of the operator norm ‖PK,sn − P‖ withits standard error (in parentheses), and the optimal K and sn that minimize the averagemodel error over 100 Monte Carlo repitetions.
Design Model FCS FSIR5 FSIR10 FIND
Sparse
I.476 (.016) .540 (.016) .555 (.018) .492 (.014)
K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3
II.415 (.013) .508 (.014) .511 (.016) .424 (.010)
K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3
III.640 (.009) .667 (.009) .692 (.010) .654 (.008)
K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3
IV.610 (.006) .625 (.007) .637 (.007) .620 (.007)
K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3
Dense
I.305 (.008) .302 (.009) .309 (.007) .310 (.009)
K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3
II.248 (.006) .254 (.007) .257 (.007) .290 (.007)
K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3
III.584 (.007) .590 (.006) .589 (.007) .581 (.009)
K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3
IV.539 (.005) .535 (.005) .543 (.005) .537 (.008)
K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3
rameter that might vary with the purpose of estimation or prediction. In our simulation,
the structural dimension K is correctly specified by both criteria, the average MSPE
and the model error in all cases. Since the model error is not obtainable in practice, we
suggest to approximate the prediction error with a suitable cross-validation procedure
for choosing K together with sn.
3.5 Data Applications
3.5.1 Ebay auction data
In this application, we study the relationship between the winning bid price of n = 156
Palm M515 PDA devices auctioned on eBay between March and May, 2003 and the
bidding history over the 7-day duration of each auction. The observation from a bidding
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 55
Table 3.2: Shown are the average MSPE with its standard error (in parentheses), andthe optimal K and sn that minimize the average MSPE over 100 Monte Carlo repitetions.
Design Model FCS FSIR5 FSIR10 FIND FLM
Sparse
I.129 (.005) .149 (.005) .155 (.006) .135 (.005) .225 (.003)
K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 sn = 2
II.117 (.005) .148 (.006) .156 (.006) .125 (.007) .180 (.003)
K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 sn = 4
III.168 (.005) .182 (.006) .191 (.004) .190 (.008) .227 (.004)
K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 sn = 3
IV.231 (.007) .279 (.009) .301 (.009) .298 (.010) .427 (.007)
K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 sn = 3
Dense
I.075 (.003) .078 (.003) .081 (.004) .084 (.005) .193 (.003)
K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 4 sn = 3
II.058 (.002) .062 (.002) .066 (.003) .079 (.004) .108 (.001)
K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 sn = 3
III.127 (.004) .135 (.004) .139 (.005) .132 (.006) .195 (.003)
K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 sn = 3
IV.141 (.005) .147 (.005) .150 (.005) .157 (.006) .285 (.002)
K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 sn = 3
history represents a “live bid”, the actual price a winning bidder would pay for the device,
known as the “willingness-to-pay” price. Further details on the bidding mechanism can
be found in Liu and Muller (2009). We adopt the view that the bidding histories are i.i.d.
realizations of a smooth underlying price process. Due to the nature of online auctions,
the jth bid of the ith auction usually arrives irregularly at time Tij, and the number of
bids Ni vary widely, from 9 to 52 for this dataset. As common in modeling prices, we
take a log-transform of bid prices. Figure 3.1 shows a sample of 9 randomly selected log
bid histories over the 7-day duration of the auction. Typically, the bid histories are very
sparse until the final hours of each auction when “bid sniping” occurs. At this point,
“snipers” place their bids at the last possible moments in an attempt to deny competing
bidders the chance of placing a higher bid.
Since the main interest is the predictive power of price histories up to time T for
the winning bid prices, we consider the regression of the winning price on the history
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 56
1 3 5 7−4−2
024
Day of Auction
Log
Bid
Pric
e
1 3 5 7−4−2
024
1 3 5 7−4−2
024
1 3 5 7−4−2
024
1 3 5 7−4−2
024
1 3 5 7−4−2
024
1 3 5 7−4−2
024
1 3 5 7−4−2
024
1 3 5 7−4−2
024
Figure 3.1: Irregularly and sparsely observed log bid price trajectories of 9 randomlyselected auctions over the 7-day duration.
trajectory X(t), t ∈ [0, T ], and set T = 4.5, 4.6, 4.7, . . . , 6.8 (in days). For each analysis
on the domain [0, T ], we select the optimal structural dimension K and the truncation
parameter sn by minimizing the average 5-fold cross-validated prediction error over 20
random partitions. Shown in Figure 3.2 are the minimized average cross-validated pre-
diction errors, compared with FSIR and FLM, where FSIR is obtained using 5 slices
(superior to FSIR using 10 slices). We did not show results from FIND that have consid-
erably larger errors. The results are not surprising: the prediction error decreases as the
bidding histories encompass more data and get closer to the end. Obviously the proposed
FCS outperforms the other methods and FLM yields the least favorable prediction, until
the last moments of the auction when any sensible method could achieve high predictive
power.
As an illustration, we present the analysis for the case of T = 6. The estimated model
components using FCS are shown in Figure 3.3 with the parameters chosen as K = 2 and
sn = 2. The first index function assigns contrasting weights to bids made before and after
the first day, indicating some bidders tend to underbid at the beginning only to quickly
overbid relative to the mean. The second index represents a cautious type of bidding
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 57
4.5 5 5.5 6 6.5
1.02
1.04
1.06
1.08
1.1
1.12
1.14
1.16x 10
−3
T (days)
CV
pre
dic
tion
err
or
FCSFLMFSIR
Figure 3.2: Average 5-fold cross-validated prediction errors over 20 random partitionsacross various time domains [0, T ], for sparse Ebay auction data.
behavior, entering at a lower price and slowly increasing towards the average level. These
features contribute the most towards the prediction of the winning bid prices. Also seen
are the slightly nonlinear patterns in the estimated additive link functions.
3.5.2 Spectrometric data
In this example, we study the spectrometric data consisting of n = 215 pieces of finely
chopped meat, publicly available at http://lib.stat.cmu.edu/datasets/tecator. For each
meat sample, the moisture content and the absorbance spectrum, measured at 100 equally
spaced wavelengths between 850 nm to 1050 nm, were recorded using a Tecator Infratec
Food and Feed Analyzer. Each absorbance spectrum is treated as an i.i.d. realization of
the absorbance process. Thus, the 215 absorbance trajectories, shown in Figure 3.4, can
be regarded as densely observed functional data.
In Table 3.3, we present the minimized average 5-fold cross-validated prediction error
over 20 random partitions for different methods, together with the selected structural
dimensions and the truncation sizes. Similar to our simulation study for dense functional
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 58
0 2 4 6
−1
−0.5
0
0.5
t (days)
β 1
0 2 4 6
−1
−0.5
0
0.5
t (days)
β 2
4 6 8 10−0.2
−0.15
−0.1
−0.05
0
0.05
⟨ β1, X ⟩
g 1
−10 −8 −6 −4 −2−0.2
−0.15
−0.1
−0.05
0
0.05
⟨ β2, X ⟩
g 2
Figure 3.3: Estimated model components for sparse Ebay auction data using FCS withK = 2 and sn = 2. The first and second row of plots shows the estimated index functions,i.e., the EDR directions, and the additive link functions, respectively.
0 10 20 30 40 50 60 70 80 90 1002
2.5
3
3.5
4
4.5
5
5.5
Spectrum channel
Ab
sorb
an
ce
Figure 3.4: Absorbance trajectories of 215 meat samples measured over 100 equallyspaced wavelengths between 850 nm and 1050 nm.
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 59
Table 3.3: Average 5-fold cross-validated prediction error over 20 Monte Carlo runs withselected K and sn, for dense spectrometric data.
FCS FSIR5 FSIR10 FIND FLM.0093 (.0001) .0096 (.0001) .0095 (.0001) .0222 (.0016) .0128 (.0002)K = 2, sn = 5 K = 2, sn = 5 K = 2, sn = 5 K = 2, sn = 5 sn = 8
data, the results for FCS and FSIR are virtually indistinguishable, and both improve
significantly upon FIND and FLM. The estimated EDR directions and additive link
functions are displayed in Figure 3.5 with K = 2 and sn = 5, where the link functions
appear to be nearly linear. The first index function emphasizes the rising trend above
the mean at wavelengths around 930 nm, and the second index picks up the contrast
between wavelengths 930 nm and 950 nm. Such EDR directions suggest that the rise
and fall around wavelengths 930 nm and 950 nm in the spectrometric trajectories, seen
in Figure 3.4, are important features for predicting moisture content.
840 860 880 900 920 940 960 980 1000 1020 1040 1060−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
Wavelength (nm)
β 1
840 860 880 900 920 940 960 980 1000 1020 1040 1060−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
Wavelength (nm)
β 2
−1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4
−15
−10
−5
0
5
10
15
20
25
⟨ β1, X ⟩
g1
0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6
−15
−10
−5
0
5
10
15
20
25
⟨ β2, X ⟩
g2
Figure 3.5: Estimated model components for spectrometric data using FCS for (K, sn) =(2, 5). The first and second row of plots shows the estimated EDR directions and additivelink functions, respectively.
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 60
3.6 Concluding Remarks
In this chapter we introduce a new method of effective dimension reduction for sparse
functional data, where one observes only a few noisy and irregular measurements for
some or all of the subjects. The proposed FCS estimation is link-free and targets at the
EDR space directly by borrowing information across the entire sample. Theoretical anal-
ysis reveals the bias and variance tradeoff associated with the truncation parameter, and
the impact due to decaying structures of the predictor process and the EDR directions.
Numerical results from simulated and real examples are shown superior to existing meth-
ods for sparse functional data. It is worth mentioning that the proposed method in fact
opens a door to more sophisticated dimension reduction approaches for sparse functional
data. Following the strategy of “pooling information together”, we may further extend
the idea of functional cumulative slicing to variance estimation or direction regression, by
analogy to the multivariate case (Zhu et al., 2010). The usefulness and justifications of
these extensions deserve further study and shall be explored in our future investigation.
3.A Regularity Conditions
Without loss of generality, we assume that the known weight function w(·) = 1. Denote
T = [a, b] and T δ = [a − δ, b + δ] for some δ > 0, a single observation time by T and
a pair by (T1, T2)> whose densities are f(t) and f2(s, t), respectively. Recall that the
unconditional mean function m(t, y) = E[X(t)1(Y ≤ y)]. The regularity conditions for
the underlying moment functions and design densities are as follows, where `1, `2 are
non-negative integers,
Assumption 3.7. ∂2
∂s`1∂t`2Σ is continuous on T δ×T δ for `1 +`2 = 2, ∂2m/∂t2 is bounded
and continuous respect to t ∈ T for all y ∈ R.
Assumption 3.8. f(1)1 (t) is continuous on T δ with f1(t) > 0, ∂
∂s`1∂t`2f2 is continuous on
T δ × T δ for `1 + `2 = 1 with f2(s, t) > 0.
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 61
Assumption 3.7 can be guaranteed by twice differentiable process, and 3.8 is standard
and also implies the boundedness and Lipschitz continuity of f . Recall the bandwidths
h1 and h2 used in smoothing steps for m in (3.3) and Σ in (3.5), respectively. We assume
that
Assumption 3.9. h1 → 0, h2 → 0, nh31/ log n→∞, nh5
1 <∞, nh22 →∞ and nh6
2 <∞.
We say that a bivariate kernel function K2 is of order (ν, `), where ν is a multi-index
ν = (ν1, ν2)>, if
∫ ∫u`1v`2K2(u, v)dudv =
0 0 ≤ `1 + `2 < `, `1 6= ν1, `2 6= ν2,
(−1)|ν|ν1!ν2! `1 = ν1, `2 = ν2,
6= 0 `1 + `2 = `,
(3.7)
where |ν| = ν1 + ν2 < `. The univariate kernel K is said to be of order (ν, `) for a
univariate ν = ν1, if (3.7) holds with `2 = 0 on the right hand side, integrating only over
the argument u on the left hand side. The following standard conditions on the kernel
densities are required.
Assumption 3.10. Kernel functions K1 and K2 are non-negative with compact supports,
bounded and of order (0, 2) and ((0, 0), 2), respectively.
3.B Proof of Theorem 3.1
It is equivalent to show that if b ⊥ span(Σβ1, . . . ,ΣβK), i.e., 〈b,Σβk〉 = 0 for k =
1, . . . , K, then 〈b,m(y)〉 = 0. Observe that
〈b,m(y)〉 = E[E{〈b,X1(Y ≤ y)〉 | Y }
]= E
{E(〈b,X〉 | Y )1(Y ≤ y)
}= E
{E(〈b,X〉 | 〈β1, X〉, . . . , 〈βK , X〉)1(Y ≤ y)
},
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 62
where the last line follows from the assumption of model (3.1). It suffices to show
the inner-expectation E(〈b,X〉 | 〈β1, X〉, . . . , 〈βK , X〉) = 0, implied by E{E2(〈b,X〉 |
〈β1, X〉, . . . , 〈βK , X〉)} = 0. Invoking assumptions 3.1–3.2 and that E(〈βk, X〉〈b,X〉) =
〈b,Σβk〉,
E{E2(〈b,X〉 | 〈β1, X〉, . . . , 〈βK , X〉)
}= E
{E(〈b,X〉 | 〈β1, X〉, . . . , 〈βK , X〉)(c0 +
K∑k=1
ck〈βk, X〉)}
= E{E(c0〈b,X〉+
K∑k=1
ck〈βk, X〉〈b,X〉 | 〈β1, X〉, . . . , 〈βK , X〉)}
= c0E(〈b,X〉) +K∑k=1
ck〈b,Σβk〉 = 0,
as desired.
3.C Proof of Theorem 3.2
Let M denote the upper bound of the random variable N ,
Sn(t) =1
n
n∑i=1
M∑j=1
1(Ni ≥ j)
h1E(N)K1
(Tij − th1
) 1Tij−th1
Tij−th1
(Tij−th1
)2
and
S(t) =
fT (t) 0
0 fT (t)σ2K
.
The local linear estimator of m(t, y) with kernel K1 is
m(t, y) = (1, 0)S−1n (t)
∑i
∑j
1(Ni≥j)nh1E(N)
K1
(Tij−th1
)Uij1(Yi ≤ y)∑
i
∑j
1(Ni≥j)nh1E(N)
K1
(Tij−th1
)(Tij−th1
)Uij1(Yi ≤ y)
.
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 63
Let U∗ij(t, y) = Uij1(Yi ≤ y)−m(t, y)−m(1)(t, y)(Tij−t) andWn(z, t) = (1, 0)S−1n (t)(1, z)′K1(z).
Then
m(t, y)−m(t, y) =1
n
n∑i=1
M∑j=1
1(Ni ≥ j)
h1E(N)Wn
(Tij − th1
, t
)U∗ij(t, y).
If we denote a point between Tij and t by t∗ij, then by Taylor’s Theorem U∗ij(y) = Uij1(Yi ≤
y) −m(Tij, y) + 12m(2)(t∗ij, y)(Tij − t)2. Finally, if we let eij(y) denote the “error” term,
i.e., eij(y) = Uij1(Yi ≤ y)−m(Tij, y), we then have
m(t, y)−m(t, y) =1
n
n∑i=1
M∑j=1
1(Ni ≥ j)
h1E(N)fT (t)K1
(Tij − th1
)eij(y)
+1
2n
n∑i=1
M∑j=1
h11(Ni ≥ j)
E(N)fT (t)K1
(Tij − th1
)(Tij − th1
)2
m(2)(t∗ij, y) + An(t, y),
whereAn(t, y) = m(t, y)−m(t, y)−{nh1E(N)fT (t)}−1∑
i
∑j 1(Ni ≥ j)K1((Tij−t)/h1)U∗ij(t, y).
This allows us to write Λ(s, t)− Λ(s, t) = I1n(s, t) + I2n(s, t) + I3n(s, t), where
I1n(s, t) =1
n
n∑k=1
{m(s, Yk)
[m(t, Yk)−m(t, Yk)
]+m(t, Yk)
[m(s, Yk)−m(s, Yk)
]}I2n(s, t) =
1
n
n∑k=1
{m(s, Yk)−m(s, Yk)
}{m(t, Yk)−m(t, Yk)
}I3n(s, t) =
1
n
n∑k=1
m(s, Yk)m(t, Yk)− Λ(s, t),
which implies by the Cauchy-Schwarz inequality that ‖Λ−Λ‖2H = Op(‖I1n‖2
H +‖I2n‖2H +
‖I3n‖2H). We will drop the subscript H for brevity in the sequel. Recall that we defined
Zi as the underlying data quadruplet (Ti, Ui, Yi, Ni). Further, let∑
(p) hi1,...,ip denote the
sum of hi1,...,ip over the permutations of i1, . . . , ip. We will repeatedly make use of the
dominated convergence theorem (DCT) and its variant given in Prakasa-Rao (1983, p.35),
which we will call the PR proposition. We will refer to Corollary 1 of Martins-Filho and
Yao (2006) as MFY corollary. Unless otherwise stated, we will drop the dummy variable
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 64
in all integrals for the sake of brevity. Finally, let 0 < BT ≤ f(t) ≤ BT <∞ denote the
lower and upper bounds of the density function of T , let |K1(x)| ≤ BK <∞ denote the
bound on the kernel function K1, and let |∂2m/∂t2| ≤ B2m < ∞ denote the bound on
the second partial derivative of m(t, y) with respect to t.
(a) We further decompose I1n(s, t) into I1n(s, t) = I11n(s, t) + I12n(s, t) + I13n(s, t), where
I11n(s, t) =1
n2
n∑k=1
n∑i=1
M∑j=1
{1(Ni ≥ j)
h1E(N)fT (t)K1
(Tij − th1
)eij(Yk)m(s, Yk)
+1(Ni ≥ j)
h1E(N)fT (s)K1
(Tij − sh1
)eij(Yk)m(t, Yk)
}
I12n(s, t) =1
2n2
n∑k=1
n∑i=1
M∑j=1
{h11(Ni ≥ j)
E(N)fT (t)K1
(Tij − th1
)(Tij − th1
)2
m(2)(t∗ij, Yk)m(s, Yk)
+h11(Ni ≥ j)
E(N)fT (s)K1
(Tij − sh1
)(Tij − sh1
)2
m(2)(t∗ij, Yk)m(t, Yk)
}
I13n(s, t) =1
n
n∑k=1
{m(s, Yk)An(t, Yk) +m(t, Yk)An(s, Yk)
},
which we analyze individually below.
(a-i) We will first show E‖I11n‖2 = O({nh1}−1). We write I11n(s, t) as
I11n(s, t) =1
2n2
n∑k=1
n∑i=1
∑(2)
{hik(s, t) + hik(t, s)
}=
1
2n2
n∑k=1
n∑i=1
ψn(Zi, Zk; s, t) =1
2vn(s, t),
where vn(s, t) is a V -statistic with symmetric kernel ψn(Zi, Zk; s, t) and
hik(s, t) =M∑j=1
1(Ni ≥ j)
h1E(N)fT (t)K1
(Tij − th1
)eij(Yk)m(s, Yk).
Since E[eij(Yk)|Tij, Yk
]= 0, it is easy to show that E[hik(s, t)] = E[hik(t, s)] = E[hki(s, t)] =
E[hki(t, s)] = 0. Thus θn(s, t) = E[ψn(Zi, Zk; s, t)] = 0. Additionally,
ψ1n(Zi; s, t) = E[ψn(Zi, Zk; s, t)|Zi
]
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 65
=M∑j=1
1(Ni ≥ j)
h1E(N)fT (t)K1
(Tij − th1
)E[eij(Yk)m(s, Yk)|Zi
]+
M∑j=1
1(Ni ≥ j)
h1E(N)fT (s)K1
(Tij − sh1
)E[eij(Yk)m(t, Yk)|Zi
].
Provided E[ψ2n(Zi, Zk; s, t)
]= o(n), the MFY corollary gives nE
[vn(s, t) − un(s, t)
]2=
o(1), where un(s, t) = 2n−1∑n
i=1 ψ1n(Zi; s, t) is the projection of the corresponding U -
statistic. Recall that the projection of a U -statistic is a sum of i.i.d. random variables
ψ1n(Zi; s, t). Thus, E‖I11n‖2 ≤ 2n−1∫ ∫
var(E[hik(s, t)|Zi]
)+2n−1
∫ ∫var(E[hik(t, s)|Zi]
)+
o(n−1),
2
n
∫ ∫var(E[hik(s, t)|Zi
])≤
M∑j=1
2P (Ni ≥ j)
nh21E(N)
∫ ∫f−2T (t)E
{K2
1
(Tij − th1
)E2[eij(Yk)m(s, Yk)|Zi
]}
=M∑j=1
2P (Ni ≥ j)
nh1E(N)
∫ ∫ ∫f−2T (t)K2
1(u)
× EXi,Yi,εi
{E2Yk
[eij(Yk)m(s, Yk)|Tij = t+ uh1
]}fT (t+ uh1)dudsdt
→M∑j=1
2‖K1‖2P (Ni ≥ j)
nh1E(N)
∫ ∫f−1T (t)EXi,Yi,εi
{E2Yk
[eij(Yk)m(s, Yk)|Tij = t
]}≤ 8‖K1‖2
nh1BT
E‖X‖4 +4‖K1‖2σ2
nh1BT
E‖X‖2 = O
(1
nh1
),
where the first line follows from the Cauchy-Schwarz inequality, the second line by letting
u = h−11 (Tij − t) and observing that Tij is independent of Xi, Yi, εi, and the third line by
the DCT since the integrand is bounded by 4B−2T BTB
2KE‖X‖4 + 2B−2
T BTB2Kσ
2E‖X‖2.
Thus E‖I11n‖2 = O({nh1}−1), provided that E[ψ2n(Zi, Zk; s, t)] = o(n) for all i, k which
we will show below. For i 6= k,
E[ψ2n(Zi, Zk; s, t)
]= 2E
[h2ik(s, t)
]+ 2E
[h2ik(t, s)
]+ 4E
[hik(s, t)hik(t, s)
]
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 66
+ 4E[hik(s, t)hki(s, t)
]+ 4E
[hik(s, t)hki(t, s)
].
Observe that
n−1E[h2ik(s, t)
]=
M∑j=l
M∑l=1
P (Ni ≥ max(j, l))
E2(N)f 2T (t)
E{(nh2
1
)−1K1
(Tij − th1
)×K1
(Til − th1
)eij(Yk)eil(Yk)m
2(s, Yk)
}
For j = l, the PR proposition applied to the expectation on the right hand side gives
n−1h−11 ‖K1‖2fT (t)E
[e2ij(Yk)m
2(s, Yk)|Tij = t]
= o(1) provided nh1 → ∞. For j 6= l, a
similar application gives n−1f 2T (t)E
[eij(Yk)eil(Yk)m
2(s, Yk)|Tij = Til = t]
= o(1). The
next two terms E[h2ik(t, s)
]and E
[hik(s, t)hik(t, s)
]can be handled similarly. For the
remaining two terms, we apply the PR proposition twice to derive,
n−1E[hik(s, t)hki(s, t)
]=
M∑j=1
M∑l=1
P (Ni ≥ j)P (Nk ≥ l)
nE2(N)f 2T (t)
∫ ∫K1(u)K1(v)fT (t+ uh1)
× fT (t+ vh1)E[eij(Yk)ekl(Yi)m(s, Yk)m(s, Yi)|Tij = t+ uh1, Tkl = t+ vh1
]= o(1).
The calculations above can be used in the same manner to derive similar results for the
case i = k. Thus we have E[ψ2n(Zi, Zk; s, t)
]= o(n).
(a-ii) We will now show E‖I12n‖2 = O(h41) + o(n−1), writing I12n(s, t) as
I12n(s, t) =1
4n2
n∑i=1
n∑k=1
∑(2)
[hik(s, t) + hik(t, s)
]=
1
4n2
n∑i=1
n∑k=1
ψn(Zi, Zk; s, t) =1
4vn(s, t),
where vn(s, t) is a V -statistic with
hik(s, t) =M∑j=1
h11(Ni ≥ j)
E(N)fT (t)K1
(Tij − th1
)(Tij − th1
)2
m(2)(t∗ij, Yk)m(s, Yk).
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 67
By the MFY corollary, nE[vn(s, t) − un(s, t)]2 = o(1) provided E[ψ2n(Zi, Zk; s, t)] = o(n)
for all i, k. Hence
E∥∥I12n
∥∥2=
1
16
∫ ∫ {E2[un(s, t)
]+ var
(un(s, t)
)}+ o(n−1),
where the projection of the U -statistic is un(s, t) = 2n−1∑n
i=1 ψ1n(Zi; s, t) − θn(s, t),
ψ1n(Zi; s, t) =∑
(2)
{E[hik(s, t)|Zi
]+ E
[hik(t, s)|Zi
]}with mean θn(s, t) = E
[un(s, t)
]=
2E[hik(s, t)
]+ 2E
[hik(t, s)
]. Observe E2
[un(s, t)
]≤ 4E2
[hik(s, t)
]+ 4E2
[hik(t, s)
]. Thus,
similarly for E2[hik(t, s)
], we use the DCT to derive
4h−41
∫ ∫E2[hik(s, t)
]→
M∑j=1
4B2KP
2(Ni ≥ j)
5E(N)
∫E[(m(2)(t, Yk))
2] ∫
E[m2(s, Yk)
]≤ 4C1B
2KB
22mE‖X‖2 = O(1)
where C1 =∫T u
4du. This leads to∫ ∫
E2[un(s, t)
]= O(h4
1). Next,
var(h−2
1 un(s, t))
= 4(nh41)−1
{E[E2(hik(s, t)|Zi)
]+ E
[E2(hik(t, s)|Zi)
]+ E
[E2(hki(s, t)|Zi)
]+ E
[E2(hki(t, s)|Zi)
]+ 2E
[E(hik(s, t)|Zi)E(hki(s, t)|Zi)
]+ 2E
[E(hik(s, t)|Zi)E(hki(t, s)|Zi)
]+ 2E
[E(hik(t, s)|Zi)E(hki(s, t)|Zi)
]+ 2E
[E(hik(t, s)|Zi)E(hki(t, s)|Zi)
]− 4E2
[hik(s, t)
]− 4E2(hik(t, s)
]− 4[E(hik(s, t))E(hki(t, s))
]}.
Firstly, for j 6= l, using the DCT, it can be shown that 4(nh41)−1
∫ ∫E[E2(hik(s, t)|Zi)]
is bounded by 4n−1B2mσ4KE‖X2‖ = O(n−1). For j = l, it can be shown to be bounded
by n−1h−11 B2
KB2mB−1T E‖X‖ = O({nh1}−1). Combining the previous two results shows
that 4(nh41)−1
∫ ∫E[E2(hik(s, t)|Zi)] = o(1), provided nh1 → ∞. All of the remaining
terms can be handled similarly using the DCT, so∫ ∫
var(un(s, t)) = o(h41). Thus we
have E‖I12n‖2 = O(h41) + o(n−1) provided E
[ψ2n(Zi, Zk; s, t)
]= o(n), which can be shown
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 68
similarly using the PR proposition as before.
(a-iii) We now show ‖I13n‖2 = Op(n−1h1 +h6
1). Following Lemma 2 of Martins-Filho and
Yao (2007),
|An(t, Yk)| =
∣∣∣∣∣M∑j=1
n∑i=1
1(Ni ≥ j)
nh1E(N)
{Wn
(Tij − th1
, t
)− f−1
T (t)K1
(Tij − th1
)}U∗ij(t, Yk)
∣∣∣∣∣≤ h−1
1
{(1, 0)
(S−1n (t)− S−1(t)
)2(1, 0)′
}1/2(∣∣∣∣∑
j
∑i
1(Ni ≥ j)
nE(N)K1
(Tij − th1
)
× U∗ij(t, Yk)∣∣∣∣+
∣∣∣∣∑j
∑i
1(Ni ≥ j)
nE(N)K1
(Tij − th1
)(Tij − th1
)U∗ij(t, Yk)
∣∣∣∣)
= h−11
{(1, 0)
(S−1n (t)− S−1(t)
)2(1, 0)′
}1/2
Rn(t, Yk).
If nh31/ log n→∞, a direct application of Lemma 1(b) of Martins-Filho and Yao (2007)
gives supt∈T h−11
∣∣{(1, 0)(S−1n (t)− S−1(t))2(1, 0)′}1/2
∣∣ = Op(1). Next,
Rn(t, Yk) ≤ |Rn1(t, Yk)|+ |Rn2(t, Yk)|+ |Rn3(t, Yk)|+ |Rn4(t, Yk)|,
where
Rn1(t, Yk) =M∑j=1
n∑i=1
1(Ni ≥ j)
nE(N)K1
(Tij − th1
)eij(Yk)
Rn2(t, Yk) =M∑j=1
n∑i=1
h211(Ni ≥ j)
2nE(N)K1
(Tij − th1
)(Tij − th1
)2
m(2)(t∗ij, Yk)
Rn3(t, Yk) =M∑j=1
n∑i=1
1(Ni ≥ j)
nE(N)K1
(Tij − th1
)(Tij − th1
)eij(Yk)
Rn4(t, Yk) =M∑j=1
n∑i=1
h211(Ni ≥ j)
2nE(N)K1
(Tij − th1
)(Tij − th1
)3
m(2)(t∗ij, Yk)
Thus n−1∑
km(s, Yk)Rn1(t, Yk) = h1fT (t)I11n(s, t) and from the analysis of I11n this im-
plies ‖h1fT I11n‖2 = Op(n−1h1). Secondly, n−1
∑km(s, Yk)Rn2(t, Yk) = h1fT (t)I12n(s, t)
and from the analysis of I12n this implies ‖h1fT I12n‖2 = Op(h61). It follows similarly that
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 69
the third and fourth remaining terms are Op(n−1h1) and Op(h
61) respectively. Hence,
‖I13n‖2 = Op(n−1h1 + h6
1). Combining the previous results thus show that ‖I1n‖2 =
Op({nh1}−1 + h41).
(b) These terms are of higher order and are omitted for brevity.
(c) By the law of large numbers, ‖n−1∑n
i=1 m(s, Yi)m(t, Yi)−Λ(s, t)‖2 = Op(n−1). Com-
bining previous results leads to ‖Λ− Λ‖2 = Op{(nh1)−1} given h41 = O{(nh1)−1}.
3.D Proof of Theorem 3.3
For a bounded linear operator A, let ‖A‖ denote the norm defined on the space of
bounded linear operators from H to itself, i.e., ‖A‖ = sup{‖Af‖H : ‖f‖H ≤ 1}. To
facilitate theoretical analysis, for each k = 1, . . . , K, let ηk = Σ1/2βk (resp. ηk,sn =
Σ1/2sn βk,sn) be the normalized eigenvectors of the eigenvalue problem Σ−1ΛΣ−1/2ηk = λkβk
(resp. Σ−1sn ΛΣ
−1/2sn ηk,sn = λk,sn βk,sn). Recall that Σ−1 and Σ−1/2 are well-defined by
suitably restricting the domain as in Proposition 1. This allows us to write
∥∥βk,sn − βk∥∥H ≤ ∥∥λ−1k,sn
Σ−1sn ΛΣ−1/2
sn − λ−1k Σ−1ΛΣ−1/2
∥∥+ λ−1
k
∥∥Σ−1ΛΣ−1/2∥∥∥∥ηk,sn − ηk∥∥
≤ λ−1k,sn
∥∥Σ−1sn ΛΣ−1/2
sn − Σ−1ΛΣ−1/2∥∥
+∥∥Σ−1ΛΣ−1/2
∥∥(∣∣λ−1k,sn− λ−1
k
∣∣+ λ−1k
∥∥ηk,sn − ηk∥∥),using the inequality λ−1
k,sn≤ λ−1
k + |λ−1k,sn−λ−1
k |. Applying standard theory for self-adjoint
compact operators (Bosq, 2000) gives
∣∣λk,sn − λk∣∣ ≤ ∥∥Σ−1/2sn ΛΣ−1/2
sn − Σ−1/2ΛΣ−1/2∥∥∥∥ηk,sn − ηk∥∥H ≤ 2
√2δ−1k
∥∥Σ−1/2sn ΛΣ−1/2
sn − Σ−1/2ΛΣ−1/2∥∥,
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 70
where δ1 = λ1 − λ2 and δk = min(λk−1 − λk, λk − λk+1) for k > 1. Thus, for each
k = 1, . . . , K, we have
∥∥βk,sn − βk∥∥2
H= Op
(I1n + I2n
),
where I1n = ‖Σ−1sn ΛΣ
−1/2sn −Σ−1ΛΣ−1/2‖2 and I2n = ‖Σ−1/2
sn ΛΣ−1/2sn −Σ−1/2ΛΣ−1/2‖2. Be-
low, we show I1n = Op(s3a+2n /
√nh1+s
(4a−2b+4)+n /(
√nh2)+1/(s2b−2a−1
n )). The calculations
for I2n are similar. Observe I1n ≤ 3I11n + 3I12n + 3I13n, where
I11n =∥∥Σ−1
sn ΛΣ−1/2sn − Σ−1ΛΣ−1/2
∥∥2,
I12n =∥∥Σ−1
sn ΛΣ−1/2sn − Σ−1
sn ΛΣ−1/2sn
∥∥2,
I13n =∥∥Σ−1
sn ΛΣ−1/2sn − Σ−1
sn ΛΣ−1/2sn
∥∥2,
which we study separately below.
(a) Recall that Πsn =∑sn
j=1 φj ⊗ φj is the orthogonal projector onto the eigenspace
associated with the sn largest eigenvalues of Σ. Let I denote the identity operator and
Π⊥sn = I − Πsn denote the operator perpendicular to Πsn , i.e., Π⊥sn is the orthogonal
projector onto the eigenspace associated with eigenvalues of Σ less than αsn . Thus
Σ−1sn ΛΣ
−1/2sn = ΠsnΣ−1ΛΣ−1/2Πsn , which allows us to write I11n ≤ ‖Π⊥snΣ−1ΛΣ−1/2‖ +
‖Σ−1ΛΣ−1/2Π⊥sn‖. Note that the range of ΛΣ−1/2 is spanned by β1, . . . , βK and a direct
calculation leads to
∥∥Π⊥snΣ−1ΛΣ−1/2∥∥2 ≤
K∑k=1
λ2k
∥∥∑i>sn
α−1i
∞∑j=1
bkj〈φi, φj〉φi∥∥2
≤K∑k=1
λ2k
∑j>sn
b2kj
α2j
≤ C1
K∑k=1
λ2k
∑j>sn
j−2b+2a = O
(1
s2b−2a−1n
),
and similarly for ‖Σ−1ΛΣ−1/2Π⊥sn‖2.
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 71
(b) We decompose I12n as I12n ≤ 3I121n + 3I122n + 3I123n, where
I121n =∥∥(Σ−1
sn − Σ−1sn )ΛΣ−1/2
sn
∥∥2,
I122n =∥∥Σ−1
sn Λ(Σ−1/2sn − Σ−1/2
sn )∥∥2,
I123n =∥∥(Σ−1
sn − Σ−1sn )Λ(Σ−1/2
sn − Σ−1/2sn )
∥∥2.
(b-i) Note I121n ≤ 6‖ΛΣ−1/2Πsn‖2I1211n + 6‖ΛΣ−1/2Πsn‖2I1212n, where
I1211n =∥∥ sn∑j=1
(α−1j − α−1
j )φj ⊗ φj∥∥2, I1212n =
∥∥ sn∑j=1
α−1j (φj ⊗ φj − φj ⊗ φj)
∥∥2.
Then I1211n ≤∑sn
j=1(αj−αj)2(αjαj)−2 ≤ C2‖Σ−Σ‖2
∑snj=1 j
4a = Op
(s4a+1n /(nh2
2)), where
the second inequality follows from |αj−αj| ≤ ‖Σ−Σ‖ for self-adjoint compact operators.
Similarly, I1212n ≤ 2∑sn
j=1 α−2j ‖φj − φj‖2 ≤ C3
∥∥Σ − Σ∥∥2∑sn
j=1 j4a+2 = Op(s
4a+3n /(nh2
2)),
where the second inequality follows from ‖φj − φj‖ ≤ Cδ−1j ‖Σ − Σ‖ for self-adjoint
compact operators. Similar to the calculation for I11n, ‖ΛΣ−1/2Πsn‖2 = O(s−2b+1n ). Thus
I121n = Op(s(4a−2b+4)+n /(nh2
2)).
(b-ii) Using similar decompositions as for I121n, we write I122n ≤ 6‖ΠsnΣ−1Λ‖2I1221n +
6‖ΠsnΣ−1Λ‖2I1222n, where
I1221n =∥∥ sn∑j=1
(α−1/2j − α−1/2
j )φj ⊗ φj∥∥2, I1222n =
∥∥ sn∑j=1
α−1/2j (φj ⊗ φj − φj ⊗ φj)
∥∥2.
Then, I1221n ≤∑sn
j=1(αj − αj)2(αjα
2j )−1 ≤ C4‖Σ − Σ‖2
∑snj=1 j
3a = Op
(s3a+1n /(nh2
2)),
where the first inequality follows from the Mean Value Theorem with αj between αj and
αj. Next, I1222n ≤ C4‖Σ − Σ‖2∑sn
j=1 j3a+1 = Op
(s3a+2n /(nh2
2)). Also, ‖ΠsnΣ−1Λ‖2 =
O(s−2b+1n ) as before and thus I122n = op
(s
(4a−2b+2)+n /(nh2
2)).
(b-iii) Using similar calculations, I123n can also be shown to be op(s(4a−2b+2)+n /(nh2
2)).
This gives I12n = Op
(s
(4a−2b+2)+n /(nh2
2))
as a result.
Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 72
(c) Observe I13n ≤ ‖Σ−1sn ‖
2‖Λ−Λ‖2‖Σ−1/2sn ‖2, where ‖Σ−1
sn ‖2 ≤
∑snj=1 α
−2j ≤ C5
∑snj=1 j
2a =
Op(s2a+1n ) and similarly ‖Σ−1/2
sn ‖ = Op(sa+1n ). From Theorem 2 we have ‖Λ − Λ‖2 =
Op({nh1}−1). Thus I13n = Op(s3a+2n /(nh1)). Combining previous results leads to
I1n = Op
(1
s2b−2a−1n
+s
(4a−2b+4)+n
nh22
+s3a+2n
nh1
).
Chapter 4
Cumulative Variance Estimation for
Classification
73
Chapter 4. Cumulative Variance Estimation for Classification 74
4.1 Introduction
In a typical classification problem in functional data analysis (FDA), one observes a train-
ing set {(Xi, Yi) : 1 ≤ i ≤ n}, where Xi is a random function and Yi ∈ {0, 1, . . . , C − 1}
is a known class label. Analogous to multivariate classification, the goal is to predict in
which class does a new observation X0 belong. This problem has been studied extensively
in FDA. Pfeiffer et al. (2002) suggested a simple method of using summary statistics such
as the mode; James and Hastie (2001), Shin (2008) extended linear discriminant analysis;
James and Sugar (2003) developed a clustering method for sparse functional data; Hall
et al. (2001), Song et al. (2008) constructed classifiers based on functional principal com-
ponents (FPCs); Leng and Muller (2006) proposed functional logistic regression of FPCs;
Ferraty and Vieu (2003) estimated posterior probabilities using kernel estimators; Biau
et al. (2005) worked with a nearest neighbor-type classifier of FPCs; Ferraty et al. (2007)
extended multivariate factorial analysis; Cuevas et al. (2007), Cuesta-Albertos and Nieto-
Reyes (2008) considered classification based on data depth; Wang et al. (2007) studied
Bayesian classification using wavelets; Tian and James (2013) projected the functional
process onto simple piecewise constant and piecewise linear functions; Hall and Delaigle
(2012) showed that perfect asymptotic classification is possible if the functional process
satisfy certain smoothness conditions.
In this chapter, we study functional classification from the perspective of effective
dimension reduction (EDR). Recall from Chapter 3 that EDR methods assume a very
flexible semiparametric multiple index model
Y = g(〈β1, X〉, . . . , 〈βK , X〉; ε
). (4.1)
Dimension reduction is particularly useful when the process X is infinite dimensional
since it is natural to expect that information relevant to the separation of the C classes is
contained in only a small number of projections 〈β1, X〉, . . . , 〈βK , X〉. Despite the sizable
Chapter 4. Cumulative Variance Estimation for Classification 75
literature on EDR methods for multivariate regression, the corresponding literature for
classification has been relatively scarce. Cook (1996), Cook and Critchley (2000) pursued
an exploratory graphical approach by studying binary plots of at most three projections of
X onto the EDR space. Cook and Lee (1999) showed that SIR (Li, 1991) can only detect
differences between the means of the two underlying classes, while sliced average variance
estimation (SAVE, Cook and Weisberg, 1991) can detect both mean and covariance
differences. Li (2000), Cook and Yin (2001) further proved that SIR’s EDR directions
and Fisher’s linear discriminant (LDA) coordinates are proportional and thus span the
same subspace. However, Velilla (2008, 2010) showed that quadratic discriminant analysis
(QDA, Schott, 1993) and SAVE estimate vastly different subspaces since proportionality
does not hold.
By analogy to the multivariate case (Zhu et al., 2010), we extend the ideas in Chap-
ter 3 for Functional Cumulative Slicing (FCS) to facilitate the derivation of Functional
Cumulative Variance (FCV), the cumulative slicing version of SAVE. Our primary mo-
tivation for developing FCV is that although classification does not pose any conceptual
or theoretical challenges to EDR methods in general, first moment methods such as FCS
suffer since, as we will demonstrate later, they do not adequately estimate the EDR space
in practice. Following the same strategy of “pooling data together across subjects”, our
proposal is applicable for both densely/completely and sparsely observed functional data.
The rest of this chapter is organized as follows. We present the proposed FCV methodol-
ogy and estimation procedure in Chapter 4.2, and Chapter 4.3 provides numerical studies
of simulated examples.
4.2 Methodology
Although our method is applicable to C-class classification, we will assume C = 2 for
simplicity. We observe data pairs {(Xi, Yi) : 1 ≤ i ≤ n} independent and identically
Chapter 4. Cumulative Variance Estimation for Classification 76
distributed (i.i.d.) as (X, Y ), where Xi is a random variable defined on the real and
separable Hilbert space H ≡ L2(T ) for a compact interval T , and Yi its class label that
equals k if Xi is sampled from the subpopulation Πk for k = 0, 1. Let π0 and π1 = 1−π0
denote the probabilities that X is drawn from subpopulations Π0 and Π1, respectively.
Finally, we make the same assumption as in Chapter 3 on the first and fourth moments
of X,
Assumption 4.1. X is centered and has a finite fourth moment∫τE[X4(t)]dt <∞.
Recall that under assumption 4.1, the covariance surface of X is given by Σ(s, t) =
E[X(s)X(t)], which generates a Hilbert-Schmidt operator Σ = E[X ⊗ X] on H. By
Mercer’s Theorem, Σ admits a spectral decomposition Σ =∑∞
j=1 αjφj⊗φj, where eigen-
functions {φj}j=1,2,... form a complete and orthonormal system in H and eigenvalues
{αj}j=1,2,... are assumed to be strictly decreasing and positive such that∑∞
j=1 αj < ∞.
Finally, recall that the EDR directions β1, . . . , βK in model (4.1) are linearly indepen-
dent functions in H, and the response Y is assumed to be conditionally independent of
X given the K projections 〈β1, X〉, . . . , 〈βK , X〉.
When the response is binary, the FCS operator defined in (3.2) reduces to ΛFCS =
π0w(0)E[X1(Y = 0)]⊗E[X1(Y = 0)] and thus can only recover 1 EDR direction regard-
less of the complexity of the underlying EDR space. In general, for C-class classification,
first moment EDR methods such as FSIR and FCS can recover at most C − 1 EDR
directions. This obvious limitation combined with its restrictiveness in only being able
to detect differences between the means of classes motivate our development of FCV, a
second order EDR method.
4.2.1 Validity of Functional Cumulative Variance
Originally proposed to estimate the EDR space when SIR fails, SAVE captures second
moment information on X|Y and targets the EDR space through the operator ΛSAV E =
Chapter 4. Cumulative Variance Estimation for Classification 77
E{Σ−V[X|Y ]}2. By analogy to Zhu et al. (2010) for extending multivariate cumulative
slicing to cumulative variance, we replace V[X|Y ] with its cumulative version V[X1(Y ≤
y)] = E{(X1(Y ≤ y)− E[X1(Y ≤ y)])⊗ (X1(Y ≤ y)− E[X1(Y ≤ y)])}. This leads to
the functional cumulative variance operator
ΛFCV = E[Λ2(Y )
], (4.2)
where Λ(y) = V[X1(Y ≤ y)] − F (y)Σ and F (y) = P (Y ≤ y). The following theorem
establishes the validity of FCV. Analogous to the multivariate case, the linearity and
constant variance assumptions are needed. For any function b ∈ H,
Assumption 4.2. The conditional mean E[〈b,X〉|〈β1, X〉, . . . , 〈βK , X〉] is a linear func-
tion of 〈β1, X〉, . . . , 〈βK , X〉.
Assumption 4.3. The conditional variance V[〈b,X〉|〈β1, X〉, . . . , 〈βK , X〉] is non-random.
Assumption 4.2 is the same as assumption 3.2 in Chapter 3 to derive the validity of
functional cumulative slicing. Recall a sufficient condition for the linearity assumption
is if X has an elliptically contoured distribution, which is more general than, but bears
a close connection to, a Gaussian process (Cambanis et al., 1981, Li and Hsing, 2010).
Assumption 4.3 is much more restrictive since it is satisfied if X is a Gaussian process,
but only holds approximately if X has an elliptically contoured distribution (Shao et al.,
2007).
Theorem 4.1. If assumptions 4.1-4.3 hold for model (4.1), span({Λ(y) : y ∈ R}) ⊆
span(Σβ1, . . . ,ΣβK).
A corollary to Theorem 4.1 is that range(ΛFCV ) ⊆ span(Σβ1, . . . ,ΣβK). It is also
easy to see that range(ΛFCS) ⊆ range(ΛFCV ) and thus functional cumulative variance is a
more comprehensive method in estimating the EDR space than FCS. If ΛFCV has K non-
zero eigenvalues, the space spanned by its eigenfunctions is precisely span(Σβ1, . . . ,ΣβK).
Chapter 4. Cumulative Variance Estimation for Classification 78
Similar to FCS in Chapter 3, recall that our target is the subcentral space SY |X , even
though the EDR directions themselves are not identifiable. For specificity, we again re-
gard these eigenfunctions of Σ−1ΛFCV associated with the K largest non-zero eigenvalues
as the index functions β1, . . . , βK themselves unless stated otherwise.
We refer the reader to Chapter 3.2 for dealing with the unboundedness of the operator
Σ−1. An assumption analogous to 3.3 on the principal components of X is needed to
ensure Σ−1ΛFCV is well-defined.
4.2.2 Functional Cumulative Variance for Sparse Functional Data
For the data{
(Xi, Yi) : 1 ≤ i ≤ n}
independently and identically distributed (i.i.d.)
as (X, Y ), the predictor trajectories Xi are observed intermittently, contaminated with
noise, and collected in the form of repeated measurements{
(Tij, Uij) : 1 ≤ i ≤ n, 1 ≤
j ≤ Ni
}, where Uij = Xi(Tij) + εij with i.i.d. measurement error εij that are of zero
mean, constant variance σ2x, and independent of all other random variables. When only
a few observations are available for some or even all subjects, individual smoothing to
recover Xi is infeasible and one must adopt the strategy of pooling together data from
across subjects for consistent estimation.
From functional cumulative slicing in Chapter 3, both the unconditional meanm(t, y) =
E[X(t)1(Y ≤ y)] and the covariance surface Σ(s, t) = E[X(s)X(t)] can be estimated by
local linear estimators defined in (3.3) and (3.5), respectively.
We use a local linear estimator similar to that of Σ(s, t) to estimate V[X1(Y ≤ y)].
Let Gi(Tij, Til; y) = {Uij1(Yi ≤ y) − m(Tij, y)}{Uil1(Yi ≤ y) − m(Til, y)} denote the
“raw” covariances of X1(Y ≤ y). It is easy to check that E[Gi(Tij, Til; y)|Tij, Til] ≈
V (Tij, Til; y) +F (y)σ2δjl, where V (s, t; y) = cov(X(s)1(Y ≤ y), X(t)1(Y ≤ y)) and δjl is
Chapter 4. Cumulative Variance Estimation for Classification 79
1 if j = l and 0 otherwise. This suggests the diagonal of Gi should be removed, and thus
min(b0,b1,b2)
n∑i=1
∑1≤j 6=l≤Ni
{Gi(Tij, Til; y)− b0 − b1(Tij − s)− b2(Til − t)
}2
×
K2
(Tij − sh2
,Til − th2
) (4.3)
yields V (s, t; y) = b0, where K2 is a non-negative bivariate kernel density and h2 = h2(n)
is the bandwidth chosen by leave-one-curve-out cross-validation.
Then, the kernel associated with the operator ΛFCV in (4.2) can be estimated by its
sample moment
ΛFCV (s, t) =1
n
n∑i=1
{V (s, t;Yi)− F (Yi)Σ(s, t)
}2, (4.4)
which reduces to ΛFCV (s, t) = π0{V (s, t; 0) − π0Σ(s, t)}2 when the response is binary.
Finally, the estimated EDR directions {βk,sn}k=1,...,K are the eigenfunctions associated
with the K largest nonzero eigenvalues of Σ−1sn ΛFCV , where Σ−1
sn is defined in (3.6).
The situation for completely observed Xi is similar to the multivariate case and con-
siderably simpler. The quantity V (s, t; y) is easily estimated by its sample moment
V (s, t; y) = n−1∑n
i=1{Xi(s)1(Y ≤ y) − m(s, y)}{Xi(t)1(Y ≤ y) − m(t, y)}, where
m(t, y) = n−1∑n
i=1Xi(t)1(Yi ≤ y), while the estimate of ΛFCV remains the same as
(4.4). For densely observed Xi, individual smoothing can be used as a preprocessing step
to recover smooth trajectories and the estimation error introduced in this step can be
shown to be asymptotically negligible under certain design conditions, i.e., it is equivalent
to the ideal situation of the completely observed Xi’s (Hall et al., 2006).
Chapter 4. Cumulative Variance Estimation for Classification 80
4.3 Simulations
In this section we illustrate the practical performance of the proposed FCV method us-
ing reduced rank quadratic discriminant analysis (see Hastie et al., 2009, chap. 4.3.3)
to split the K-dimensional EDR space into C = 2 regions for class prediction. For
i = 1, . . . , n, let Zi =(〈β1,sn , Xi〉, . . . , 〈βK,sn , Xi〉
)>denote the K-variate random vari-
able Xi that has been projected onto the EDR space via FCV. For a new observation
Z0 =(〈β1,sn , X0〉, . . . , 〈βK,sn , X0〉
)>, we calculate the reduced rank quadratic discrimi-
nant function
δk(Z0) = −1
2log |Σk| −
1
2(Z0 − µk)>Σ−1
k (Z0 − µk) + log πk, (4.5)
where µk and Σk are the mean vector and covariance matrix of subpopulation Πk cal-
culated from the reduced variables Zi, respectively, and πk is the estimated proportion
of subpopulation Πk. We classify X0 to subpopulation Π0 if δ0(Z0) > δ1(Z0), and to Π1
if δ0(Z0) < δ1(Z0). We remind the reader from Chapter 3.4 that 〈βk,snXi〉 is given by
an integral approximation when the functional data is dense, while Xi is replaced by its
PACE (Yao et al., 2005a) estimate Xi when the functional data is sparse.
We compare our proposal to (i) functional SAVE in the same reduced rank QDA
framework, (ii) FCS in a reduced rank LDA framework, (iii) QDA on the FPCs (Hall
et al., 2001), and (iv) a Naive Bayes (NB) classifier on the FPCs. In all of the following
simulations we generate a total of n = 100 curves from Π0 and Π1 with respective sizes
n0 = n/2 and n1 = n/2. For k = 0, 1, functional processes from Πk are generated as
Xki(t) =∑40
j=1(θkj + µkj)φj(t), where θkj is i.i.d. N(−(λkj/2)1/2, λkj/2) with probability
1/2 and N((λkj/2)1/2, λkj/2) with probability 1/2. The λkjs and µkjs are selected depend-
ing on the property of FCV we want to illustrate below. In each case the measurement
error on Xki is i.i.d. N(0, 0.01), the domain of observation is t ∈ [0, 1] and eigenfunctions
are φj(t) = sin(πtj/2)/√
2 for j even and φj(t) = cos(πtj/2)/√
2 for j odd. For dense
Chapter 4. Cumulative Variance Estimation for Classification 81
functional data the Tij are 101 equispaced points in [0, 1], while for sparse functional data
the number of observations per subject Ni is chosen uniformly from {5, . . . , 14} and the
observational times Tij are i.i.d. U(0, 1).
Shown in Table 4.1 are the combinations of λkj and µkj that are considered. Model A
captures the general classification problem where both the inter-class means and covari-
ances are different, model B depicts the scenario when only the inter-class covariances
are different, and model C describes the scenario when only the inter-class means are dif-
ferent. We compute the average percent of misclassification and its standard error over
100 Monte Carlo repetitions, shown in Table 4.2 for the sparse design. The structure
dimension K and the truncation parameter sn are chosen by minimizing the misclassifi-
cation rate. These results suggest that FCV is optimal when inter-class covariances are
distinct, but that FCS is optimal otherwise. The results for FCV and FSAVE when inter-
class covariances are equal corroborate those from Zhu and Hastie (2003) who showed
that multivariate SAVE tends to over-emphasize second-order differences between classes,
while ignoring first-order differences.
Table 4.1: Shown are the combinations of θkj and µkj we use in our simulation study.
Model λ0j λ1j µ0j µ1j
A j−3 4j−2 µ01 = µ02 = µ03 = µ04 = 1 0 for all j
B j−3 4j−2 0 for all j 0 for all j
C 3j−2 3j−2 µ01 = µ02 = µ03 = µ04 = 1 0 for all j
4.4 Data Applications
In this section we study temporal gene expression data for yeast cell cycle (Spellman
et al., 1998). Each trajectory contains 18 observations of gene expression, measured
Chapter 4. Cumulative Variance Estimation for Classification 82
Table 4.2: Shown are the average misclassification error (×100%) with its standard error(in parentheses), and the optimal K and sn that minimize the average misclassificationerror over 100 Monte Carlo repetitions for sparse functional data.
Model FCV FSAVE FCS QDA NB
A16.11 (1.36) 19.48 (1.52) 22.94 (.44) 21.93 (.59) 28.83 (.79)K = 3, sn = 3 K = 3, sn = 3 K = 1, sn = 3 sn = 3 sn = 2
B21.82 (.35) 24.34 (.35) 47.72 (.43) 46.51 (.43) 30.80 (.48)
K = 5, sn = 5 K = 5, sn = 6 K = 1, sn = 5 sn = 2 sn = 2
C33.31 (.81) 37.87 (.86) 25.91 (.48) 27.28 (.47) 27.74 (.63)
K = 2, sn = 2 K = 3, sn = 3 K = 1, sn = 4 sn = 3 sn = 3
Table 4.3: Shown are the average misclassification error (×100%) with its standarderror (in parentheses), and the optimal K and sn that minimize the average 5-fold cross-validated classification error for the temporal gene expression data.
FCV FSAVE FCS QDA NB
15.11 (.25) 15.12 (.27) 21.76 (.34) 37.47 (.51) 40.01 (.69)K = 2, sn = 2 K = 2, sn = 2 K = 1, sn = 2 sn = 3 sn = 4
every 7 minutes between 0 and 119 minutes. 92 genes were identified, of which 43 are
known to regulate the G1 (Y = 1) phase and the remaining 49 are known to regulate
the non-G1 (Y = 0) phase. The functional trajectories are shown in Figure 4.1. To
artificially create sparse functional trajectories from this dense data, we randomly select
9 observations to use from each trajectory. In Table 4.3, we present the minimized average
5-fold cross-validated prediction error over 20 random partitions for different methods,
together with the selected structural dimensions and the truncation sizes. The second
order EDR methods, FCV and FSAVE, are virtually indistinguishable from each other,
but both compare very favorably to the other methods.
Chapter 4. Cumulative Variance Estimation for Classification 83
0 20 40 60 80 100 120−3
−2
−1
0
1
2
3
Time
G1 ph
ase
0 20 40 60 80 100 120−3
−2
−1
0
1
2
3
Time
non−
G1 ph
ase
Figure 4.1: Temporal gene expressions.
4.A Appendix: Proof of Theorem 4.1
It suffices to show that for any b ∈ H, 〈b,Σβk〉 = 0 for all k = 1, . . . , K implies
〈b,Λ(y)b〉 = 0. First, observe that 〈b,Λ(y)b〉 = 〈b,V[X1(Y ≤ y)]b〉 − F (y)〈b,Σb〉. Then,
〈b,V[X1(Y ≤ y)]b〉
= 〈b,E[X ⊗X1(Y ≤ y)]b〉 − 〈b, (E[X1(Y ≤ y)]⊗ E[X1(Y ≤ y)]) b〉
= E[〈b,X〉21(Y ≤ y)
]− E2
[〈b,X〉1(Y ≤ y)
]= E
{E[〈b,X〉2|〈β1, X〉, . . . , 〈βK , X〉
]1(Y ≤ y)
}−
E2{E[〈b,X〉|〈β1, X〉, . . . , 〈βK , X〉
]1(Y ≤ y)
}= E
[1(Y ≤ y)
]E{E[〈b,X〉2|〈β1, X〉, . . . , 〈βK , X〉
]}= F (y)〈b,Σb〉,
where the second last equality follows by invoking the linearity and constant variance
assumptions. Thus, 〈b,Λ(y)b〉 = 0 as desired.
Bibliography
Adler, R. J. and Taylor, J. E. (2007), Random Fields and Geometry, Springer Monographs
in Mathematics, Springer.
Ash, R. B. and Gardner, M. F. (1975), Topics in stochastic processes, New York: Aca-
demic Press [Harcourt Brace Jovanovich Publishers], probability and Mathematical
Statistics, Vol. 27.
Biau, G., Bunea, F., and Wegkamp, M. H. (2005), “Functional classification in Hilbert
spaces,” IEEE Transactions on Information Theory, 51, 2163–2172.
Bosq, D. (2000), Linear Processes in Function Spaces: Theory and Applications, vol. 149,
New York: Springer-Verlag Inc.
Cai, T. T. and Hall, P. (2006), “Prediction in functional linear regression,” Ann. Statist.,
34, 2159–2179.
Cambanis, S., Huang, S., and Simons, G. (1981), “On the theory of elliptically contoured
distrobutions,” Journal of Multivariate Analysis, 11, 368–385.
Cardot, H., Ferraty, F., and Sarda, P. (1999), “Functional linear model,” Statistics &
Probability Letters, 45, 11–22.
Chen, D., Hall, P., and Muller, H.-G. (2011), “Single and multiple index functional
regression models with nonparametric link,” The Annals of Statistics, 39, 1720–1747.
84
BIBLIOGRAPHY 85
Chiaromonte, F., Cook, D. R., and Li, B. (2002), “Sufficient Dimension Reduction in
Regressions with Categorical Predictors,” The Annals of Statistics, 30, 475–497.
Cook, D. R. (1996), “Graphics for regressions with a binary response,” Journal of the
American Statistical Association, 91, 983–992.
— (1998), Regression Graphics: Ideas for Studying Regressions through Graphics, vol.
318 of Probability and Statistics, Wiley.
Cook, D. R. and Critchley, F. (2000), “Identifying Regression Outliers and Mixtures
Graphically,” Journal of the American Statistical Association, 95, 781–794.
Cook, D. R., Forzani, L., and Yao, A.-F. (2010), “Necessary and sufficient conditions for
consistency of a method for smoothed functional inverse regression,” Statistica Sinica,
20, 235–238.
Cook, D. R. and Lee, H. (1999), “Dimension Reduction in Binary Response Regression,”
Journal of the American Statistical Association, 94, 1187–1200.
Cook, D. R. and Weisberg, S. (1991), “Comment on “Sliced Inverse Regression for Di-
mension Reduction”,” Journal of the American Statistical Association, 86, 328–332.
Cook, D. R. and Yin, X. (2001), “Special Invited Paper: Dimension Reduction and
Visualization in Discriminant Analysis (with discussion),” Australian and New Zealand
Journal of Statistics, 43, 147–199.
Cuesta-Albertos, J. and Nieto-Reyes, A. (2008), “The random Tukey depth,” Computa-
tional Statistics & Data Analysis, 52, 4979–4988.
Cuevas, A., Febrero, M., and Fraiman, R. (2007), “Robust estimation and classication
for functional data via projection-based depth notions,” Computational Statistics &
Data Analysis, 22, 481–496.
BIBLIOGRAPHY 86
Demidenko, E. (2004), Mixed Models: Theory and Applications, Wiley Series in Proba-
bility and Statistics, Wiley.
Di, C.-Z., Crainiceanu, C. M., Caffo, B. S., and Punjabi, N. M. (2011), “Multilevel
functional principal component analysis,” Annals of Applied Statistics, 3, 458–488.
Duan, N. and Li, K.-C. (1991), “Slicing regression: a link-free regression method,” The
Annals of Statistics, 19, 505–530.
Fan, J. and Gijbels, I. (1996), Local polynomial modelling and its applications, vol. 66 of
Monographs on Statistics and Applied Probability, London: Chapman & Hall.
Ferraty, F. and Vieu, P. (2003), “Curves Discrimination: a Nonparametric Functional
Approach,” Computational Statistics & Data Analysis, 44, 161–173.
Ferraty, F., Vieu, P., and Pla-Viguier, S. (2007), “Factor-based comparison of groups of
curves,” Computational Statistics & Data Analysis, 51, 4903–4910.
Ferre, L. and Yao, A. F. (2003), “Functional sliced inverse regression analysis,” Statistics,
37, 475–488.
Ferre, L. and Yao, A.-F. (2005), “Smoothed functional inverse regression,” Statistica
Sinica, 15, 665–683.
Fisher, R. A. (1918), “The Correlation Between Relatives on the Supposition of
Mendelian Inheritance,” Transactions of the Royal Society of Edinburgh, 52, 399–433.
Griswold, C., Gomulkiewicz, R., and Heckman, N. (2008), “Hypothesis testing in com-
parative and experimental studies of function-valued traits,” Evolution, 62, 1229–1242.
Hall, P. and Delaigle, A. (2012), “Achieving near perfect classification for functional
data,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74,
267–286.
BIBLIOGRAPHY 87
Hall, P. and Horowitz, J. L. (2007), “Methodology and convergence rates for functional
linear regression,” The Annals of Statistics, 35, 70–91.
Hall, P. and Hosseini-Nasab, M. (2006), “On properties of functional principal compo-
nents analysis,” Journal of the Royal Statistical Society: Series B (Statistical Method-
ology), 68, 109–126.
Hall, P., Muller, H.-G., and Wang, J.-L. (2006), “Properties of principal component
methods for functional and longitudinal data analysis,” The Annals of Statistics, 34,
1493–1517.
Hall, P., Muller, H. G., and Yao, F. (2008), “Modeling sparse generalized longitudinal
observations with latent Gaussian processes,” Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 70, 730–723.
Hall, P., Poskitt, D. S., and Presnell, B. (2001), “A functional data-analytic approach to
signal discrimination,” Technometrics, 43, 1–9.
Hastie, T. and Tibshirani, R. (1990), Generalized additive models, vol. 43 of Monographs
on Statistics and Applied Probability, London: Chapman and Hall Ltd.
Hastie, T., Tibshirani, R., and Friedman, J. (2009), The elements of statistical learn-
ing, Springer Series in Statistics, New York: Springer-Verlag, 2nd ed., data mining,
inference, and prediction.
He, G., Muller, H.-G., and Wang, J.-L. (2003), “Functional canonical analysis for square
integrable stochastic processes,” Journal of Multivariate Analysis, 85, 54–77.
Heckman, N. (2003), “Functional data analysis in evolutionary biology,” in Recent Ad-
vances and Trends in Nonparametric Statistics, eds. Akritas, M. G. and Politis, D. N.,
Elsevier, pp. 49–60.
BIBLIOGRAPHY 88
Henderson, C. R. (1950), “Estimation of genetic parameters (abstract),” Annals of Math-
ematical Statistics, 21, 309–310.
James, G. M. and Hastie, T. J. (2001), “Functional linear discriminant analysis for irreg-
ularly sampled curves,” Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 63, 533–550.
James, G. M., Hastie, T. J., and Sugar, C. A. (2000), “Principal component models for
sparse functional data,” Biometrika, 87, 587–602.
James, G. M. and Silverman, B. W. (2005), “Functional adaptive model estimation,”
Journal of the American Statistical Association, 100, 565–576.
James, G. M. and Sugar, C. A. (2003), “Clustering for sparsely sampled functional data,”
Journal of the American Statistical Association, 98, 397–408.
Kaslow, R. A., Ostrow, D. G., Detels, R., Phair, J. P., Polk, B. F., and Rinaldo, C. R.
(1987), “The Multicenter AIDS Cohort Study: Rationale, Organization and Selected
Characteristics of the Participants,” American Journal of Epidemiology, 126, 310–318.
Kato, T. (1995), Perturbation theory for linear operators, Berlin: Springer-Verlag.
Kirkpatrick, M. and Heckman, N. (1989), “A quantitative genetic model for growth,
shape, reaction norms, and other infinite-dimensional characters,” Journal of Mathe-
matical Biology, 27, 429–450.
Leng, X. and Muller, H.-G. (2006), “Classification using functional data analysis for
temporal gene expression data,” Bioinformatics, 22, 68–76.
Li, B. and Wang, S. (2007), “On directional regression for dimension reduction,” Journal
of the American Statisticsl Association, 102, 997–1008.
BIBLIOGRAPHY 89
Li, K.-C. (1991), “Sliced inverse regression for dimension reduction,” Journal of the
American Statistical Association, 86, 316–342, with discussion and a rejoinder by the
author.
— (1992), “On principal hessian directions for data visualization and dimension reduc-
tion: another application of Stein’s lemma,” Journal of the American Statistical Asso-
ciation, 87, 1025–1039.
— (2000), “High Dimensional Data Analysis via the SIR/PHD Approach,” .
Li, Y. and Hsing, T. (2010), “Deciding the dimension of effective dimension reduction
space for functional and high-dimensional data,” The Annals of Statistics, 38, 3028–
3062.
Lin, X. and Carroll, R. J. (2000), “Nonparametric function estimation for clustered
data when the predictor is measured without/with error,” Journal of the American
Statistical Association, 95, 520–534.
Liu, B. and Muller, H.-G. (2009), “Estimating derivatives for samples of sparsely observed
functions, with application to on-line auction dynamics.” Journal of the American
Statistical Association, 104, 704–714.
Loeve, M. (1978), Probability Theory II, vol. 46 of Graduate Texts in Mathematics,
Springer.
Lynch, M. and Walsh, B. (1998), Genetics and analysis of quantitative traits, Sinauer.
Martins-Filho, C. and Yao, F. (2006), “A note on the use of V and U statistics in
nonparametric model of regression,” Annals of the Institute of Statistical Mathematics,
58, 389–406.
— (2007), “Nonparametric frontier estimation via local linear regression,” Journal of
Econometrics, 141, 283–319.
BIBLIOGRAPHY 90
Meyer, K. (1985), “Genetic parameters for dairy production of Australian Black and
White cows,” Livestock Production Science, 12, 205–219.
— (1999), “Estimates of genetic and phenotypic covariance functions for postweaning
growth and mature weight of beef cows,” Journal of Animal Breeding and Genetics,
116, 181–205.
— (2007), “WOMBAT – A tool for mixed model analyses in quantitative genetics by
restricted maximum likelihood (REML),” Journal of Zhejiang University Science, 8,
815–821.
Meyer, K., Carrick, M. J., and Donnelly, B. J. P. (1993), “Genetic parameters for growth
traits of Australian beef cattle from a multi-breed selection experiment,” Journal of
Animal Science, 71, 2614–2622.
Meyer, K. and Hill, W. (1997), “Estimation of genetic and phenotypic covariance func-
tions for longitudinal or repeated records by restricted maximum likelihood,” Livestock
Production Science, 47, 185–200.
Morris, J. S., Vannucci, M., Brown, P. J., and Carroll, R. J. (2003), “Wavelet-based
nonparametric modeling of hierarchical functions in colon carcinogenesis,” Journal of
the American Statistical Association, 98, 573–597, with comments and a rejoinder by
the authors.
Muller, H.-G. (2005), “Functional modelling and classification of longitudinal data,”
Scandinavian Journal of Statistics. Theory and Applications, 32, 223–240.
— (2008), “Functional modeling of longitudinal data,” in Longitudinal Data Analysis
(Handbooks of Modern Statistical Methods), eds. Fitzmaurice, G., Davidian, M., Ver-
beke, G., and Molenberghs, G., New York: Chapman & Hall/CRC, pp. 223–252.
BIBLIOGRAPHY 91
Muller, H.-G. and Prewitt, K. A. (1993), “Multiparameter bandwidth processes and
adaptive surface smoothing,” Journal of Multivariate Analysis, 47, 1–21.
Muller, H.-G. and Stadtmuller, U. (2005), “Generalized functional linear models,” The
Annals of Statistics, 33, 774–805.
Peng, J. and Paul, D. (2011), “Principal components analysis for sparsely observed cor-
related functional data using a kernel smoothing approach,” Electronic Journal of
Statistics, 5, 1960–2003.
Pfeiffer, R. M., Bura, E., Smith, A., and Rutter, J. L. (2002), “Two approaches to
mutation detection based on functional data,” Statistics in Medicine, 21, 3447–3464.
Prakasa-Rao, B. (1983), Nonparametric functional estimation, Orlando, FL: Academic
Press.
Ramsay, J. O., Bock, D. R., and Gasser, T. (1995), “Comparison of height acceleration
curves in the Fels, Zurich, and Berkeley growth data,” Annals of Human Biology, 22,
413–426.
Ramsay, J. O. and Silverman, B. W. (2005), Functional data analysis, Springer Series in
Statistics, New York: Springer, 2nd ed.
Rice, J. A. (2004), “Functional and longitudinal data analysis: Perspectives on smooth-
ing,” Statistica Sinica, 631–647.
Rice, J. A. and Silverman, B. W. (1991), “Estimating the mean and covariance structure
nonparametrically when the data are curves,” Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 53, 233–243.
Rice, J. A. and Wu, C. O. (2001), “Nonparametric mixed effects models for unequally
sampled noisy curves,” Biometrics, 57, 253–259.
BIBLIOGRAPHY 92
Schott, J. R. (1993), “Dimensionality reduction in quadratic discriminant analysis,” Com-
putational Statistics & Data Analysis, 16, 161–174.
Shao, Y., Cook, D. R., and Weisberg, S. (2007), “Marginal tests with sliced average
variance estimation,” Biometrika, 94, 285–296.
Shin, H. (2008), “An extension of Fisher’s discriminant analysis for stochastic processes,”
Journal of Multivariate Analysis, 99, 1191–1216.
Song, J. J., Deng, W., Lee, H.-J., and Kwon, D. (2008), “Optimal classification for time-
course gene expression data using functional data analysis,” Computational Biology
and Chemistry, 32, 426–432.
Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B.,
Brown, P. O., Botstein, D., and Futcher, B. (1998), “Comprehensive Identification
of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray
Hybridization,” Molecular Biology of the Cell, 9, 3273–3297.
Tian, T. S. and James, G. M. (2013), “Interpretable dimension reduction for classifying
functional data,” Computational Statistics and Data Analysis, 57, 282–296.
Tuddenham, R. and Snyder, M. (1954), “Physical growth of California boys and girls
from birth to age 18,” Calif. Publ. Child Deve., 1, 183–364.
Velilla, S. (2008), “A method for dimension reduction in quadratic classification prob-
lems,” Journal of Computation and Graphical Statistics, 17, 572–589.
— (2010), “On the structure of the quadratic subspace in discriminant analysis,” Journal
of Multivariate Analysis, 101, 1239–1251.
Wang, X., Ray, S., and Mallick, B. K. (2007), “Bayesian curve classification using
wavelets,” Journal of the American Statistical Association, 102, 962–973.
BIBLIOGRAPHY 93
Xia, Y., Tong, H., Li, W., and Zhu, L.-X. (2002), “An adaptive estimation of dimen-
sion reduction space,” Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 64, 363–410.
Yao, F. and Muller, H.-G. (2010), “Empirical dynamics for longitudinal data,” The An-
nals of Statistics, 38, 3458–3486.
Yao, F., Muller, H.-G., Clifford, A. J., Dueker, S. R., Follett, J., Lin, Y., Buchholz, B. A.,
and Vogel, J. S. (2003), “Shrinkage estimation for functional principal component
scores with application to the population kinetics of plasma folate,” Biometrics, 59,
676–685.
Yao, F., Muller, H.-G., and Wang, J.-L. (2005a), “Functional data analysis for sparse
longitudinal data,” Journal of the American Statistical Association, 100, 577–590.
— (2005b), “Functional Linear Regression Analysis for Longitudinal Data,” The Annals
of Statistics, 33, 2873–2903.
Yuan, M. and Cai, T. T. (2010), “A reproducing kernel Hilbert space approach to func-
tional linear regression,” The Annals of Statistics, 38, 3412–3444.
Zhou, L., Huang, J. Z., Martinez, J. G., Maity, A., Baladandayuthapani, V., and Carroll,
R. J. (2010), “Reduced Rank Mixed Effects Models for Spatially Correlated Hierarchi-
cal Functional Data,” Journal of the American Statistical Association, 105, 390–400.
Zhu, L.-P., Zhu, L.-X., and Feng, Z.-H. (2010), “Dimension reduction in regressions
through cumulative slicing estimation,” Journal of the American Statistical Associa-
tion, 105, 1455–1466.
Zhu, M. and Hastie, T. J. (2003), “Feature extraction for nonparametric discriminant
analysis,” Journal of Computation and Graphical Statistics, 12, 101–120.