102
Methods for Sparse Functional Data by Edwin Kam Fai Lei A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto c Copyright 2014 by Edwin Kam Fai Lei

by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Methods for Sparse Functional Data

by

Edwin Kam Fai Lei

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Statistical SciencesUniversity of Toronto

c© Copyright 2014 by Edwin Kam Fai Lei

Page 2: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Abstract

Methods for Sparse Functional Data

Edwin Kam Fai Lei

Doctor of Philosophy

Graduate Department of Statistical Sciences

University of Toronto

2014

The primary aim of this thesis is to study methods for the analysis of sparse functional

data. Since this type of data is observed infrequently and irregularly for each subject, even

simple descriptive statistics such as the mean and covariance must be reformulated. In

the first part of this thesis, we study a related but more challenging problem of recovering

the underlying functional trajectories when the subjects are genetically correlated. The

key idea is to reconstruct the trajectories by using the Karhunen-Loeve expansion of a

random function with a data-driven eigenbasis. In the second part of this thesis, we study

effective dimension reduction for regression of a scalar response on a sparse functional

predictor. Our proposal estimates the effective dimension reduction space under the

presence of sparse functional data, which has the important property that the projection

of the functional predictor onto it contains as much information on the response as the

functional predictor itself. We derive our estimator’s asymptotic properties and study

its finite sample performance. Lastly, we consider extensions of our effective dimension

reduction procedure for the classification of sparse functional data.

ii

Page 3: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Acknowledgements

First and foremost I would like to thank my supervisor Fang Yao for his patience and

support during the four years of my doctoral studies. Without his timely insights, I

would not have been able complete this thesis. Secondly, I would like to thank my family

for their unwavering support of my education. Thirdly, I would like to thank the faculty

and staff of the Department of Statistical Sciences for their dedication to the program.

Last but not least I would like to thank Andriy, Angel, Avideh, Darren, David, Eric D.,

Eric Y., Eugene, Jason, Lily, Natalie, and Steve for being great friends.

iii

Page 4: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Contents

1 Introduction 1

1.1 Notation, Definitions, and Basic Results . . . . . . . . . . . . . . . . . . 4

1.1.1 Theory on Bounded Linear Operators . . . . . . . . . . . . . . . . 4

1.1.2 Linear Processes in Function Spaces . . . . . . . . . . . . . . . . . 6

1.1.3 Local Polynomial Regression . . . . . . . . . . . . . . . . . . . . . 9

1.1.4 Data Model for Independent Subjects . . . . . . . . . . . . . . . . 12

1.2 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Data Model for Genetically Correlated Subjects 19

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.1 Motivating Application . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Genetic Relationship and Proposed Functional Model . . . . . . . . . . . 22

2.2.1 Background on the Quantitative Genetic Model . . . . . . . . . . 22

2.2.2 Functional Data Model for Genetically Related Individuals . . . . 24

2.3 Model Estimation and FPC Representation . . . . . . . . . . . . . . . . 27

2.3.1 Estimation of Model Components . . . . . . . . . . . . . . . . . . 27

2.3.2 FPC Representation for Genetically Related Individuals . . . . . 30

2.4 Application to Weights of Beef Cattle . . . . . . . . . . . . . . . . . . . . 32

2.5 Simulated Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

iv

Page 5: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Cumulative Slicing Estimation for Dimension Reduction 40

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.1 Validity of Functional Cumulative Slicing . . . . . . . . . . . . . . 44

3.2.2 Functional Cumulative Slicing for Sparse Functional Data . . . . 46

3.3 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.5 Data Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.5.1 Ebay auction data . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.5.2 Spectrometric data . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.A Regularity Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.B Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.C Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.D Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4 Cumulative Variance Estimation for Classification 73

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.2.1 Validity of Functional Cumulative Variance . . . . . . . . . . . . . 76

4.2.2 Functional Cumulative Variance for Sparse Functional Data . . . 78

4.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.4 Data Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.A Appendix: Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . 83

Bibliography 84

v

Page 6: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

List of Tables

1.1 Commonly used kernel functions in local polynomial regression. . . . . . 10

2.1 ISE improvement (%) of the proposed FACE method upon PACE, where

Simulation I uses data-based models with different values of (Kg, Ke) and

Simulation II examines half-sibling (α = 0.25) and full-sibling (α = 0.5)

family relationships. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.1 Shown are the model error in form of the operator norm ‖PK,sn − P‖

with its standard error (in parentheses), and the optimal K and sn that

minimize the average model error over 100 Monte Carlo repitetions. . . 54

3.2 Shown are the average MSPE with its standard error (in parentheses), and

the optimal K and sn that minimize the average MSPE over 100 Monte

Carlo repitetions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3 Average 5-fold cross-validated prediction error over 20 Monte Carlo runs

with selected K and sn, for dense spectrometric data. . . . . . . . . . . . 59

4.1 Shown are the combinations of θkj and µkj we use in our simulation study. 81

4.2 Shown are the average misclassification error (×100%) with its standard

error (in parentheses), and the optimal K and sn that minimize the av-

erage misclassification error over 100 Monte Carlo repetitions for sparse

functional data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

vi

Page 7: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

4.3 Shown are the average misclassification error (×100%) with its standard

error (in parentheses), and the optimal K and sn that minimize the average

5-fold cross-validated classification error for the temporal gene expression

data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

vii

Page 8: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

List of Figures

1.1 Growth information of 8 girls measured between 1 and 18 years of age. . 2

1.2 Proportion of CD4 cells of 6 HIV-positive males at each visit (years). . . 3

1.3 Commonly used kernel functions in local polynomial regression. . . . . . 11

2.1 Beef cattle data: frequency distributions. . . . . . . . . . . . . . . . . . . 33

2.2 Estimated mean function (dark) with observed trajectories (light) for the

beef cattle data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.3 Non-negative definite estimates of the genetic and environmental covari-

ance functions for the beef cattle data. . . . . . . . . . . . . . . . . . . . 35

2.4 Shown are the first (solid), second (dashed), third (dash-dot), and fourth

(dotted) eigenfunctions. Left: first three eigenfunctions of the genetic pro-

cess, counting for 98% of the genetic variance. Right: first four eigenfunc-

tions of the environmental process, explaining 98.3% of the environmental

variance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.5 Estimated trajectories by leave-one-family-out cross-validation (CV) for

two families of cows obtained using FACE method (solid) and PACE

method (dashed), where the first row presents two half-siblings from one

family and the bottom three rows present six half-siblings from another

family. The legend shows the relative CV error of each cow,∑Nij

k=1{Uijk −

X−iij (Tijk)}2/U2ijk, obtained from two methods, where X−iij is as described

in Section 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

viii

Page 9: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

3.1 Irregularly and sparsely observed log bid price trajectories of 9 randomly

selected auctions over the 7-day duration. . . . . . . . . . . . . . . . . . . 56

3.2 Average 5-fold cross-validated prediction errors over 20 random partitions

across various time domains [0, T ], for sparse Ebay auction data. . . . . . 57

3.3 Estimated model components for sparse Ebay auction data using FCS with

K = 2 and sn = 2. The first and second row of plots shows the estimated

index functions, i.e., the EDR directions, and the additive link functions,

respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.4 Absorbance trajectories of 215 meat samples measured over 100 equally

spaced wavelengths between 850 nm and 1050 nm. . . . . . . . . . . . . . 58

3.5 Estimated model components for spectrometric data using FCS for (K, sn) =

(2, 5). The first and second row of plots shows the estimated EDR direc-

tions and additive link functions, respectively. . . . . . . . . . . . . . . . 59

4.1 Temporal gene expressions. . . . . . . . . . . . . . . . . . . . . . . . . . . 83

ix

Page 10: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1

Introduction

Functional data analysis (FDA) is concerned with the study of infinite-dimensional data,

such as curves, shapes, and images. Muller (2005) writes,

[Functional data] are affected by time-neighborhood and smoothness rela-

tions; time-order is crucial. The analysis changes in a basic way whenever

the time order of observations is changed. In contrast, in multivariate statisti-

cal analysis, the order of the components of observed random vectors is quite

irrelevant, and any changes in this order leads to the same results. This fact

and the continuous flow of time, which serves as argument, lead to differences

in perspective.

Figure 1.1a provides an example; it shows the heights (cm) of 8 girls measured between

1 and 18 years of age from the Berkeley Growth Study (Tuddenham and Snyder, 1954).

Even though each of the measurements of height involves only discrete values, as indicated

by the circles on each curve, it is not unreasonable to expect that had measurements been

made at every age the data would be a smooth curve, as indicated by the linearly con-

nected trajectories between each observation. Ramsay and Silverman (2005) elaborates

further on the crucial nature of time-order for this dataset,

The ages themselves must also play an explicit role in our analysis... Although

1

Page 11: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1. Introduction 2

it might be mildly interesting to correlate heights at ages 9, 10 and 10.5, this

would not take account of the fact that we expect the correlation for two ages

separated by only half a year to be higher than that for a separation of one

year.

Under the assumption within FDA that stochastic processes are ultimately smooth

curves, Ramsay et al. (1995) estimated the acceleration curve of the girls’ growth, shown

in Figure 1.1b.

●●

●●

●●

●●

●●

● ● ● ● ● ● ● ● ● ● ● ●

2 4 6 8 10 12 14 16 18

8010

012

014

016

018

0

Age (years)

Hei

ght (

cm)

●●

●●

●●

●●

●●

● ● ●● ● ● ● ● ● ●

●●

●●

● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●●

●●

●●

●●

●● ● ● ● ● ● ● ● ●

●●

●●

●●

●●

●●

●●

●●

● ● ● ● ● ● ● ● ●

●●

●●

●●

●●

●● ● ● ● ● ● ● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ● ● ● ●

●●

● ● ● ● ● ● ● ● ● ● ● ●

(a) Observed height

2 4 6 8 10 12 14 16 18

−8

−6

−4

−2

02

Age (years)

Gro

wth

acc

eler

atio

n (c

m /

year

^2)

(b) Estimated growth acceleration

Figure 1.1: Growth information of 8 girls measured between 1 and 18 years of age.

Functional data can be further categorized by the observed spacing between measure-

ments. If a stochastic process is observed in its entirety, we call this completely observed

functional data, a type rarely encountered in practice. If it is observed on a fine grid,

we call this dense functional data. The data shown in Figure 1.1a is an example, even

though the measurements are not equally spaced. Finally, if each sample of a stochastic

process contains very few observations, we call this sparse functional data. Figure 1.2

provides an example; it shows the proportion of CD4 cells (number of CD4 cells divided

by total number of lymphocytes) of 6 out of a total of 283 HIV-positive homosexual

Page 12: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1. Introduction 3

males during each of their visits to the clinic (Kaslow et al., 1987). Longitudinal data is

very similar in this regard, although longitudinal data analysis typically places a greater

emphasis on inferential procedures (Rice, 2004). Muller (2008) elaborates further on the

0 1 2 3 4 5

010

2030

4050

Visit (year)

CD

4 (%

)

●●

● ● ● ●

0 1 2 3 4 5

010

2030

4050

● ●

0 1 2 3 4 50

1020

3040

50

0 1 2 3 4 5

010

2030

4050

0 1 2 3 4 5

010

2030

4050

0 1 2 3 4 50

1020

3040

50

Figure 1.2: Proportion of CD4 cells of 6 HIV-positive males at each visit (years).

practical differences between the three types of functional data designs,

If one was given a sample of entirely observed trajectories Yi(t), i = 1, . . . , N ,

for N subjects, a mean trajectory could be defined as sample average, Y (t) =

N−1∑N

i=1 Yi(t). However, this relatively straightforward situation is rather

the exception than the norm, as we face the following difficulties: The tra-

jectories may be sampled at sparsely distributed times, with timings varying

from subject to subject; the measurements may be corrupted by noise and

are dependent within the same subject.

This thesis’ primary focus is in modeling and analyzing sparsely observed functional data.

In extreme situations where only a few observations are available for some, or even all,

Page 13: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1. Introduction 4

of the subjects, one must adopt the strategy of “pooling” together data across subjects

with the aim that the entire sample is dense for consistent estimation. Variations on

this strategy will permeate throughout this thesis. A second common theme within this

thesis, and within FDA in general, is the use of dimension reduction to achieve tractable

solutions. Owing to the rich history of dimension reduction in multivariate data anal-

ysis, many of the methods in this thesis are the functional counterparts to established

multivariate techniques, such as principal components analysis and effective dimension

reduction. Finally, the critical assumption within FDA that underlying stochastic pro-

cesses are smooth leads to an extensive use of smoothing methods such as local polynomial

kernel regression.

1.1 Notation, Definitions, and Basic Results

In this section we introduce the notation and present some definitions and basic theorems

(proofs omitted) we will be using throughout this thesis. The following material on op-

erator theory, linear processes in function spaces, and local polynomial kernel regression

is primarily adapted from Kato (1995), Bosq (2000), and Fan and Gijbels (1996) respec-

tively. In Chapter 1.1.4, we will introduce a sparse functional data model for independent

subjects, adapted from Yao et al. (2005a).

1.1.1 Theory on Bounded Linear Operators

Let H be a separable Hilbert space endowed with inner product 〈·, ·〉 and a norm induced

by its inner product as ‖ · ‖ =√〈·, ·〉. Recall an operator T acting on H is bounded

if there exists M < ∞ such that ‖Tf‖ ≤ M‖f‖ for all f ∈ H. Let B be the space

of bounded linear operators from H to itself. B is a Banach space equipped with the

uniform, or operator, norm

‖T‖B = sup‖f‖≤1

‖Tf‖.

Page 14: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1. Introduction 5

Definition 1.1. The adjoint operator of T ∈ B, namely T ∗, satisfies

〈Tf, g〉 = 〈f, T ∗g〉, ∀f, g ∈ H.

Definition 1.2. An operator T is said to be self-adjoint if it is its own adjoint, i.e.,

T = T ∗.

Definition 1.3. A bounded operator T is compact if it can be expressed as

Tf =∞∑j=1

tj〈f, vj〉uj, ∀f ∈ H, (1.1)

where {tj}j∈N is a decreasing sequence of positive numbers with limit zero, and {uj}j∈N

and {vj}j∈N are two orthonormal but not necessarily complete sets.

Note that the operator T can be written succinctly as

T =∞∑j=1

tjvj ⊗ uj,

where the tensor product f ⊗ g denotes the rank one operator on H that maps h to

(f ⊗ g)h = 〈h, f〉g. If T is self-adjoint, then

T =∞∑j=1

tjvj ⊗ vj, (1.2)

where {vj}j∈N forms a complete and orthonormal basis of H. Observe that (1.2) implies

Tvj = tjvj and thus {(tj, vj)}j∈N are the eigenelements of T .

Definition 1.4. If there exists K <∞ such that

T =K∑j=1

tjvj ⊗ uj,

then T is said to be a finite-rank operator with rank K.

Page 15: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1. Introduction 6

Definition 1.5. A compact operator T with the expansion in (1.2) is said to be a Hilbert-

Schmidt operator if∑∞

j=1 t2j <∞.

Denote the set of Hilbert-Schmidt operators on H by S, which itself is a Hilbert space

equipped with inner product

〈S, T 〉S =∞∑j=1

〈Svj, T vj〉, S, T ∈ S

and norm

‖T‖S =√〈T, T 〉S =

(∞∑j=1

t2j

)1/2

,

where {(tj, vj)}j∈N are the eigenelements of T . It is easy to check that ‖ ·‖S ≥ ‖·‖B. The

following theorem connects the difference of two compact operators with their respective

eigenelements.

Theorem 1.1. Let S and T be two linear, self-adjoint, and compact operators on H

whose respective spectral expansions are given by

S =∞∑j=1

sjvj ⊗ vj, T =∞∑j=1

tjuj ⊗ uj.

Then, for any j ∈ N,

|tj − sj| ≤ ‖T − S‖B,

and

‖uj − vj‖ ≤2√

2

aj‖T − S‖B,

where a1 = s1 − s2 and aj = min(sj−1 − sj, sj − sj+1) for j ≥ 2.

1.1.2 Linear Processes in Function Spaces

Hereafter, let H denote the real and separable Hilbert space L2(T ) for a compact interval

T . H is equipped with inner product 〈f, g〉 =∫T f(t)g(t)dt, and norm ‖f‖ =

√〈f, f〉.

Page 16: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1. Introduction 7

We assume our stochastic process X is H-valued with continuous sample paths. The

expectation of a H-valued random function X is defined as µ(t) := E(X)(t) = E(X(t))

for any t ∈ T . The covariance function of X is defined by Σ(s, t) := cov(X(s), X(t)) =

E{(X(s) − µ(s))(X(t) − µ(t))} for any s, t ∈ T . Recall the covariance function is sym-

metric and positive-definite. We now turn to our first major theorem from functional

analysis.

Theorem 1.2 (Mercer’s Theorem). Let K(s, t) be a continuous, symmetric and positive-

definite function on L2(T × T ). Then there exists an orthonormal basis of H, namely

{φj}j∈N, and a sequence of decreasing positive numbers {λj}j∈N such that

K(s, t) =∞∑j=1

λjφj(s)φj(t), (1.3)

where the convergence is uniform on T × T .

Corollary 1.1 (Spectral Expansion). If X has a finite second moment, i.e., E‖X‖2 <∞,

then Σ(s, t) admits the decomposition

Σ(s, t) =∞∑j=1

λjφj(s)φj(t).

Moreover, ∫T

Σ(s, t)φj(s)ds = λjφj(t), j ∈ N,

and thus φj(t) is the eigenfunction of Σ(s, t) associated with eigenvalue λj. We also have

the identity ∫T

Σ(t, t)dt =∞∑j=1

λj <∞.

The next result will appear in many instances throughout the thesis and serves as the

backbone to functional principal components analysis.

Theorem 1.3 (Karhunen-Loeve Expansion). Let X be zero-mean with a finite second

Page 17: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1. Introduction 8

moment. Let {(λj, φj)}j∈N be the eigenelements of Σ(s, t). Then X admits the expansion

X(t) =∞∑j=1

ξjφj(t), (1.4)

where {ξj}j∈N are pairwise uncorrelated zero-mean real-valued random variables with λj =

Eξ2j , and the convergence is uniform with respect to the H-norm.

Corollary 1.2. Let X be a zero-mean Gaussian process with covariance Σ(s, t). Let

{(λj, φj)}j∈N be the eigenelements of Σ(s, t). Then X admits the expansion

X(t) =∞∑j=1

ξjφj(t),

where ξj are mutually independent and distributed as N(0, λj) for j ∈ N.

We are now ready to connect stochastic processes in function spaces to our previous

discussion on operator theory. For specificity, we are still working in H = L2(T ). If X

has a finite second moment, i.e., E‖X‖2 <∞, then the kernel operator

(Σf)(t) =

∫T

Σ(s, t)f(s)ds, ∀t ∈ T , f ∈ H

associated with the kernel function Σ(s, t) is a bounded linear operator on H, i.e., Σ ∈ B.

From this definition, it is easy to show that Σf = E(〈X, f〉X

)and that the symmetry

of Σ(s, t) implies Σ is self-adjoint. Further,

∫T

∫T

Σ2(s, t)dsdt =

∫T

∫T

(∞∑j=1

λjφj(s)φj(t)

)2

dsdt

=∞∑j=1

∞∑k=1

λjλk〈φj, φk〉2

=∞∑j=1

λ2j <∞,

Page 18: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1. Introduction 9

where the first equality follows from applying Mercer’s Theorem, the second from the

uniform convergence in Mercer’s Theorem, and the third from the orthonormal nature

of {φj}j∈N. Thus, Σ is a self-adjoint Hilbert-Schmidt operator whose spectral expansion

is given by

Σ =∞∑j=1

λjφj ⊗ φj.

In fact, the last identity in Corollary 1.1 implies that Σ belongs to the class of nuclear

operators, a subset of the Hilbert-Schmidt class, but we will not need this result in

the thesis. Note we have incidentally shown that the H-norm of Σ(s, t) is equal to the

Hilbert-Schmidt norm of the operator Σ.

1.1.3 Local Polynomial Regression

Local polynomial regression provides a flexible approach to studying the relationship

between dependent and independent variables without imposing strong functional as-

sumptions on the nature of this relationship. To be precise, given the population pair

(X, Y ), our primary interest is to study the regression function m(x) = E(Y |X = x).

From a statistical perspective, we typically assume observed data pairs {(Xi, Yi)}i∈N are

independent and identically distributed (i.i.d.) according to the model

Y = m(X) + ε, (1.5)

where the regression error ε has zero mean, finite variance, and is independent of X.

If we assume that the (p + 1)th derivative of the conditional mean m(x) exists at a

point x0, then we can approximate m(x) by a polynomial of order p. Taylor’s expansion

in a neighborhood around x0 gives

m(x) ≈ m(x0) +

p∑r=1

m(r)(x0)

r!(x− x0)r,

Page 19: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1. Introduction 10

where m(r)(x0) is the rth derivative of m evaluated at the point x0. Let m(x0) = β0 and

m(r)(x0)/r! = βr. These are fitted by solving the weighted least squares problem

(β0, . . . , βp)> = argmin

β0,...,βp

n∑i=1

K

(Xi − x0

h

){Yi −

p∑r=0

βr(Xi − x0)r}2, (1.6)

where K is a kernel function that assigns larger weights to points closer to x0, and,

conversely, smaller weights to points farther away. The bandwidth h controls the size of

the neighborhood around x0. To estimate the entire function m, we solve (1.6) for all

points x0 in the domain of interest.

Let 1(A) be the indicator function on the set A. Table 1.1 lists several commonly

used kernel functions in local polynomial regression with the corresponding curves in

Figure 1.3. It is well known that the choice of kernel is secondary to the choice of

bandwidth h.

Uniform K(u) = 121(|u| ≤ 1)

Triangular K(u) = (1− |u|)1(|u| ≤ 1)Epanechnikov K(u) = 3

4(1− u2)1(|u| ≤ 1)

Gaussian K(u) = 1√2πe−u

2/2

Table 1.1: Commonly used kernel functions in local polynomial regression.

The large sample performance of local polynomial estimators is almost always assessed

by its (integrated) mean squared error (MSE) under the scenario h → 0 as sample size

n → ∞. The intuition behind “small-h” asymptotics is that one typically requires a

smaller neighborhood with larger sample sizes. The MSE has the familiar decomposition

into its bias and variance components

MSE(m(x)) =

∫E[(m(x)−m(x))2

]dx

=

∫bias2(m(x))dx+

∫variance(m(x))dx.

Page 20: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1. Introduction 11

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

u

K(u

)

UniformTriangularEpanechnikovGaussian

Figure 1.3: Commonly used kernel functions in local polynomial regression.

Page 21: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1. Introduction 12

1.1.4 Data Model for Independent Subjects

Functional data analysis (FDA) has attracted substantial research interest and has pro-

vided powerful tools to study data arising from a collection of curves rather than from

scalars or vectors. Ramsay and Silverman (2005) offer a comprehensive introduction to

FDA. A key issue in modeling functional data is the representation of the underlying

process X, which is often of a complex nature and requires regularization. A common

approach is to utilize functional principal component (FPC) analysis (FPCA), exploiting

a data-driven eigenbasis to represent X. When the design of the functional data is dense,

FPCA has been studied extensively by Rice and Silverman (1991), James et al. (2000),

Yao et al. (2005a), Hall and Hosseini-Nasab (2006), Hall et al. (2006), and references

therein. The eigenbasis is the unique canonical basis leading to a generalized Fourier

series, i.e., the Karhunen-Loeve expansion (Theorem 1.3). The advantage of this expan-

sion is that it gives the most rapidly convergent representation of X in the L2 sense (Ash

and Gardner, 1975). In addition, the connection between the Karhunen-Loeve expansion

and Mercer’s Theorem (Theorem 1.2) implies that FPCA also characterizes the domi-

nant modes of variation of a sample of functional data. These theoretical and practical

considerations have led FPCA to be one of the standard procedures in FDA.

However, when the functional data is sparse, for example when there is only one

or two observations per subject, the standard approach of estimating the FPC scores,

i.e., generalized Fourier coefficients, by numerical integration does not work well. Using a

reduced rank mixed effects approach, Rice and Wu (2001), James et al. (2000), and James

and Sugar (2003) overcame this issue by modeling each individual trajectory as B-splines

with random coefficients. However, as noted by Yao et al. (2005a), James et al. (2000)

did not study the asymptotic properties of their estimators owing to the complexity of the

mixed effects approach, deciding instead to construct pointwise confidence intervals using

the bootstrap. In contrast, we review in this section the method of Principal components

Analysis through Conditional Expectation (PACE) by Yao et al. (2005a). It recovers the

Page 22: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1. Introduction 13

individual trajectories directly through the Karhunen-Loeve expansion and thus allows

for the derivation of the relevant asymptotic properties.

Methodology

As in Chapter 1.1 we assume X is a random function defined on H = L2(T ) for a com-

pact interval T . Additionally, X has mean function µ(t) = EX(t), finite second moment,

i.e., E‖X‖2 <∞, and covariance function Σ(s, t) = cov(X(s), X(t)). Let X1, . . . , Xn be

independently and identically distributed (i.i.d.) as X. Mercer’s Theorem implies that

there exists a spectral expansion of Σ(s, t) whose eigenelements are {(λk, φk)}k∈N. Ad-

ditionally, the Karhunen-Loeve expansion implies that there exists a generalized Fourier

expansion of Xi(t) for i = 1, . . . , n given by Xi(t) = µ(t) +∑∞

k=1 ξikφk(t), where ξik has

zero mean and E[ξikξi′k′ ] = λk if i = i′, k = k′ and 0 otherwise.

In reality, sparse functional data is often observed with additive measurement error ε,

whose mean is zero and variance σ2. To accurately reflect the nature of sparse functional

data, we assume both the number of observations per subject and the observation times

to be random. To be precise, let the number of observations per subject Ni be i.i.d. N ,

where N is a bounded positive discrete random variable, and Tij be a random variable

on T that denotes the jth observation of Xi. Then, the data model for noisy sparse

functional data is

Uij = Xi(Tij) + εij

= µ(Tij) +∞∑k=1

ξikφk(Tij) + εij, Tij ∈ T , 1 ≤ j ≤ Ni, 1 ≤ i ≤ n (1.7)

where εij is i.i.d. ε. This eigenfunction approach differs from a random regression model

with spline basis functions, as the eigenfunction basis is completely data-driven, while

the spline function basis is pre-specified without knowledge of the data.

We use local linear smoothing over the pooled noisy sparse observations to estimate

Page 23: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1. Introduction 14

the mean function µ(t). To be specific, µ(t) = a0, where

(a0, a1)> = argmina0,a1

n∑i=1

Ni∑j=1

K1

(Tij − th1

){Uij − a0 − a1(Tij − t)}2, (1.8)

K1 is a non-negative and symmetric univariate kernel density function and h1 = h1(n) is

the bandwidth to control the amount of smoothing. Note that h1 depends only on the

sample size n and thus ignores the dependency between measurements made on the same

subject, which Lin and Carroll (2000) showed to be the most asymptotically efficient.

We use leave-one-curve-out cross-validation to select h1, although a subjective choice is

often sufficient in practice.

For 1 ≤ i ≤ n, 1 ≤ j ≤ Ni, let Gi(Tij, Til) = {Uij − µ(Tij)}{Uil − µ(Til)} denote the

observed covariance. Observe that

E[Gi(Tij, Til)|Tij, Til] = E[UijUil|Tij, Til]− µ(Tij)µ(Til)− µ(Til)µ(Tij) + µ(Tij)µ(Til)

≈ E[UijUil|Tij, Til]− µ(Tij)µ(Til)

= cov[Xi(Tij), Xi(Til)|Tij, Til] + σ2δjl,

where δjl = 1 if j = l and 0 otherwise. This suggests only {Gi(Tij, Til) : 1 ≤ i ≤ n, 1 ≤

j 6= l ≤ Ni} should be included as input data for estimation of the covariance surface

Σ(s, t). Thus, Σ(s, t) = b0, where

(b0, b1, b2)> = argminb0,b1,b2

n∑i=1

∑1≤j 6=l≤Ni

K2

(Tij − sh2

,Til − th2

)× {Gi(Tij, Til)− b0 − b1(Tij − s)− b2(Til − t)}2,

(1.9)

K2 is a non-negative and symmetric bivariate kernel density function and h2 = h2(n) is

the bandwidth to control the amount of smoothing. We again use leave-one-curve-out

cross validation to select h2.

Page 24: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1. Introduction 15

The smoothing step in (1.9) also hints at the estimation of the σ2 by

σ2 = |T1|−1

∫T1{Σ(t)− Σ(t, t)}dt, (1.10)

where Σ is obtained by smoothing Gi(Tij, Tij) over all individuals. The region of inte-

gration, T1, of length |T1|, is taken as the middle half of the whole interval T to reduce

boundary effects introduced by smoothing. To better estimate Σ(s, t) along the “height

ridge” when s ≈ t, we adjust the estimate Σ(t) using a local quadratic smoother, see Yao

et al. (2003) for details.

The estimated eigenelements {(λk, φk)}k∈N thus solve the eigenvalue problem

∫T

Σ(s, t)φk(s)ds = λkφk(t),

subject to the orthonormality constraint 〈φk, φm〉 = δkm. This can be solved numerically

by discretizing Σ(s, t) into a fine grid of equally spaced time points and carrying out

multivariate principal components analysis (Ramsay and Silverman, 2005, Chapter 8.4).

Principal Components Analysis through Conditional Expectation

It is obvious that when functional data is observed sparsely, the standard approach

to estimating the FPC scores ξik =∫τ{Xi(t) − µ(t)}φk(t)dt via numerical integration

does not work. Further, since under model (1.7) the trajectories are observed with

noise, substituting Xi(Tij) with Uij thus leads to biased estimates of ξik. These two

observations are the primary motivations for Principal components Analysis through

Conditional Expectation (PACE, Yao et al., 2005a).

If we assume model (1.7) can be well-approximated by the first K functional principal

components, then we can write it as

U i = µi + Φiξi + εi,

Page 25: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1. Introduction 16

where U i = (Ui1, . . . , UiNi)>, µi = (µ(Ti1), . . . , µ(TiNi

))>, φik = (φk(Ti1), . . . , φk(TiNi))>,

εi = (εi1, . . . , εiNi)> and ξi = (ξi1, . . . , ξiK)> are vectors, and Φi = (φi1, . . . ,φiK) is an

Ni ×K matrix.

The best linear unbiased predictor (BLUP, Henderson, 1950) of ξi is given by

ξi = λΦ>i Σ−1i (U i − µi), (1.11)

where λ = diag(λ1, . . . , λK), and Σi is an Ni ×Ni matrix whose (j, l)th element is given

by cov[Uij, Uil|Tij, Til] = Gi(Tij, Til) + σ2δjl. Let T i = (Ti1, . . . , TiNi)>. It is well known

that if ξi and εi are additionally jointly Gaussian, then ξi = E[ξi|U i,T i] and is optimal

in mean squared error. The PACE estimate of ξi is thus given by

ξi = λΦ>i Σ−1

i (U i − µi), (1.12)

where the (j, l)th element of Σi is given by Σ(Tij, Til) + σ2δjl. The prediction for Xi(t)

with dimension reduction is thus

Xi(t) = µ(t) +K∑k=1

ξikφk(t). (1.13)

Selecting the Number of Functional Principal Components

Let µ−(i) and {φ−(i)k }k∈N denote the mean and eigenfunctions estimated from the data

excluding subject i, respectively. We use leave-one-curve-out cross-validation to select

the number of principal components K in the prediction of X in (1.13). To be precise,

we select K as

K = argminK

n∑i=1

Ni∑j=1

{Uij − X−(i)i (Tij)}2,

where X−(i)i (Tij) = µ−(i)(Tij) +

∑Kk=1 ξikφ

−(i)k (Tij) represents the predicted trajectory for

subject i. However, in practice a subjective choice such as fraction of variance explained

Page 26: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1. Introduction 17

is often sufficient. More specifically, for a user-defined threshold 0 < α < 1, we select K

as

K = min{K :

∑Kk=1 λk∑∞k=1 λk

≥ α}.

For an AIC-type criterion, we refer the reader to equation (11) in Yao et al. (2005a).

Page 27: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 1. Introduction 18

1.2 Outline of Thesis

In Chapter 2, we extend the Principal components Analysis through Conditional Ex-

pectation procedure of Chapter 1.1.4 to the case of genetically correlated subjects. The

motivating example concerns sparse measurements of mass of sibling cows from several

independent families. In Chapter 3, we consider the problem of dimension reduction in

functional regression under the framework of effective dimension reduction. Our pro-

posal draws inspiration from multivariate cumulative slicing estimation; it provides an

innovative solution to the challenging problem of characterizing the effective dimension

reduction space in the presence of sparse functional data. In Chapter 4, we apply our

effective dimension reduction proposal to study functional classification.

Page 28: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2

Data Model for Genetically

Correlated Subjects

19

Page 29: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 20

2.1 Introduction

The aforementioned works on FPC approaches in Chapter 1.1.4 deal exclusively with

independent subjects. Very little work has appeared involving the analysis of correlated

subjects or of clusters. Due to the difficulty in appropriate modelling of complex de-

pendence structures, existing work on feasible models for correlated functional data has

usually been motivated in the context of specific applications. For instance, Peng and

Paul (2011) adopted a separable covariance structure for weakly correlated functional

data, e.g., for growth profiles from different locations in agricultural land, while Zhou

et al. (2010) considered spatially correlated FPC analysis by coupling linear mixed ef-

fects (LME) models with penalized splines. In this chapter, we propose a functional data

model for family-wise related individuals. Our proposal models the genetic and environ-

mental processes both at subject level, and allows for genetic dependencies introduced by

varied familial associations. This is distinct from hierarchical or multilevel FPCA (Morris

et al., 2003, Di et al., 2011), where the assumptions on the within-family covariance do

not allow for a variety of familial relationships.

2.1.1 Motivating Application

Our motivating example concerns the growth (in kilograms) as a function of age (in

days) of half-sibling cows in fifteen independent families. A key issue in the analysis is

the incorporation of genetic information that helps researchers understand how selective

breeding can change the physical traits passed down to future generations. This under-

standing has economic consequences, as accurate estimation of the genetic component

of an individual’s trait can lead to better breeding decisions. Even small improvements

in breeding practices can greatly increase food production. However, the estimation of

the genetic component is complicated by the fact that it is unobservable and must be

inferred from the observed physical trait. The physical trait depends not only on the

Page 30: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 21

genotype but also on the environmental effect, which includes factors such as habitat or

food availability. Fortunately, genetic theory makes inference possible when data include

information from related individuals.

This data set was first analyzed using a multivariate approach in Meyer (1985) and

later, with a random regression approach for individual growth in Meyer and Hill (1997).

The random regression approach uses a basis expansion with an individual’s coefficients

modeled as random effects. Statistical analysis is implemented with an LME model,

see Demidenko (2004) and references therein for a general treatment of the random

regression model using LME. However, in random regression, the choice of pre-specified

basis functions is not straightforward. Although splines (in particular B-splines) have

been a popular option, simulation studies in Griswold et al. (2008) indicated that B-

splines do not necessarily perform well in many realistic settings. This might be caused

by the “one-size-fits-all” character of B-splines, which may result in needing a fairly large

number of B-spline functions. A natural approach to constructing a parsimonious model

is to exploit the FPCA technique to find a data-adaptive eigenbasis, which often requires

only a few leading eigenfunctions to adequately reconstruct trajectories.

2.1.2 Overview

The main contribution of this chapter is to develop a new FPCA framework that effec-

tively takes into account genetic information and can be used in a variety of biological

applications. The key is to generalize the canonical eigenbasis model to genetically re-

lated subjects within independently sampled families. As the individual phenotype is

irregularly and sparsely observed with noise, a common occurrence in many settings, it

is desirable to borrow strength from the whole sample. Yao et al. (2005a) proposed a

version of FPC analysis, called Principal components Analysis through Conditional Ex-

pectation (PACE), that is particularly useful for such sparse functional data. Compared

to spline-based FPC methods that implicitly treat truncated models as the target (James

Page 31: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 22

et al., 2000), PACE emphasizes genuine nonparametric modeling of the covariance and

finds data-driven eigenfunctions to be used as basis functions. Thus PACE allows for

theoretical investigation of the underlying process itself. Given these advantages of the

PACE approach, we couple the PACE principle with the genetic information to develop

a novel FPCA framework, called Familial principal components Analysis through Con-

ditional Expectation (FACE). Our approach naturally decomposes the total covariance

into genetic and environmental components, both of which are estimated by smoothing

techniques. Data-adaptive eigen-components associated with both covariance structures

are obtained and used in the proposed FACE estimation of the genetically related indi-

viduals.

The remainder of this chapter is organized as follows. In Section 2.2, we introduce bi-

ological modeling of the genetic component of a physical trait, and motivate the proposed

FPC model for related individuals. Section 2.3 describes the methodology for estimation

of the model components, including the genetic and environmental covariances and their

respective eigen-components. The known familial genetic relationship is utilized and

leads to the proposed FACE estimation for subject-level signal extraction. We analyze

the growth of beef cattle in Section 2.4, while Section 2.5 contains simulation examples.

Concluding remarks are offered in Section 3.6.

2.2 Genetic Relationship and Proposed Functional

Model

2.2.1 Background on the Quantitative Genetic Model

To describe the standard quantitative genetic model for physical traits, let Xj denote

the phenotype of individual j, Uj the phenotype observed with error εj, gj the genetic

component, and ej the environmental factor. Suppose for now that these quantities

Page 32: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 23

are either all scalar, p-vectors, or functions. The simplest genetic model is an additive

structure with gj, ej, and εj uncorrelated with expected values equal to 0,

Uj = Xj + εj = µ+ gj + ej + εj. (2.1)

Individuals raised in different environments have uncorrelated ej’s, while related indi-

viduals from the same family have correlated underlying genotypes, the gj’s, with the

amount of correlation depending on the individuals’ relationship. For instance, suppose

that gj is a p-vector with p× p covariance matrix G. The p× p cross-covariance matrix

defined as E[gjg>j′ ], j 6= j′, is equal to αjj′G, where αjj′ ∈ [0, 1] is called a relationship

coefficient and depends on the relationship between individuals j and j′. If the individ-

uals are full siblings, i.e., they have the same mother and father, then αjj′ = 1/2. If

the individuals are half-siblings, that is, if they have only one parent in common, then

αjj′ = 1/4. If the individuals are unrelated then αjj′ = 0, and if they are clones or the

same individual then αjj′ = 1. The intuition behind the value of αjj′ is that αjj′ equals

the expected proportion of genes that individuals j and j′ share via inheritance.

This model for genetic correlation and the use of these values of αjj′ are well-supported

by both theoretical calculations and empirical studies. Their use is standard in animal

breeding and in laboratory experiments in evolutionary biology. The model was first

introduced, with values of αjj′ calculated, in Fisher (1918). Also see Lynch and Walsh

(1998, Chapter 7) for a modern treatment and Heckman (2003) for a statistician-friendly

derivation of E[gjg>j′ ] = G/2 for a mother-child relationship. Analysis of (2.1) is straight-

forward when the traits are scalar or vector-valued, the relationships are all the same and

the design is balanced – for instance, for data from N independent families, with k full

siblings in each family. In this case, variance/covariance parameters are easily estimated

in closed form by analysis of variance and method of moments. For more general de-

signs and combinations of relationships, numerical estimation is possible via (restricted)

Page 33: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 24

maximum likelihood (Lynch and Walsh, 1998, Chapter 27), and is implemented in soft-

ware such as ASReml (http://www.vsni.co.uk/software/asreml) and WOMBAT (Meyer,

2007).

2.2.2 Functional Data Model for Genetically Related Individu-

als

Data such as weights of cows can be viewed as arising from smooth functions, even if the

weights are sampled at irregular and, possibly, sparse discrete times across subjects. We

consider the situation where there are N independent families with Ni members in family

i. Let αi,jj′ denote the known relationship coefficient for individuals j and j′ of family

i and assume that the within-family relationship coefficients are non-zero. While our

methodology holds for general αi,j,j′ ’s, in the data we analyze in Section 2.4, all family

members are half-siblings, i.e., αi,jj′ = 1/4 for j 6= j′ and αi,jj = 1 otherwise.

The functional version of (2.1) for the phenotype of the jth individual in the ith

family is

Xij(t) = µ(t) + gij(t) + eij(t), (2.2)

where µ is the population mean curve, gij is what is called the random genetic effect,

and eij models any other random effects (mainly environmental) giving rise to within

individual covariances that are not due to gij. As is common (see, e.g., Lynch and

Walsh, 1998), we will refer to eij as the environmental effect and gij simply as the

genetic effect. In this model, gij and eij are (i) mean zero with the variance of gij(t)

and eij(t) finite for all t, (ii) uncorrelated, (iii) cov(gij(s), gij(t)

)= G(s, t), and (iv)

cov(eij(s), eij(t)

)= E(s, t). These four properties imply that the total covariance is

cov(Xij(s), Xij(t)

)= Σ(s, t) = G(s, t) + E(s, t). The within-family genetic correlation

Page 34: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 25

between two individuals depends on G and the individuals’ relationship coefficient:

cov(gij(s), gij′(t)

)= αi,jj′G(s, t). (2.3)

The processes eij(·) and ei′j′(·) are independent when (i, j) 6= (i′, j′). Assume that the

measurements are taken on a closed and bounded interval T , i.e., t ∈ T . Note that model

(2.2) is not the classical functional model that assumes that data come from independent

realizations of Xij(t) = µ(t) +vij(t). In (2.2), we have decomposed the random deviation

vij(t) as gij(t) + eij(t), where the genetic effect gij(t) induces a within-family correlation.

A stochastic process with finite covariance admits a Karhunen-Loeve expansion and

its covariance function admits a spectral basis expansion (Loeve, 1978, Adler and Taylor,

2007). The key proposal is to exploit such expansions for both genetic and environmental

processes, whilst maintaining the dependence structure of related individuals. For the

genetic process gij, we have for s, t ∈ T ,

gij(t) =∞∑l=1

ξijlφl(t), G(s, t) =∞∑l=1

λlφl(s)φl(t), (2.4)

where the φl’s are orthonormal eigenfunctions, ξij1, ξij2, . . . are the FPC scores, which are

uncorrelated random variables with zero mean and variances λ1 > λ2 > . . ., satisfying∑∞l=1 λl < ∞. Based on the underlying genetic model in equation (2.3), we can deduce

that the correlation between ξijl and ξi′j′l′ is λl αi,jj′ for i = i′ and l = l′, and zero

otherwise. This genetic association is the key to consistent parameter estimation, as

it enables us to borrow information across related individuals. This model and basis

expansion in the context of selection and genetics was first described in Kirkpatrick

and Heckman (1989). Similar expansions hold for the environmental process eij with

orthonormal eigenfunctions {ψm}m≥1 and nonincreasing eigenvalues {ρm}m≥1, i.e., for

Page 35: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 26

s, t ∈ T

eij(t) =∞∑m=1

ζijmψm(t), E(s, t) =∞∑m=1

ρmψm(s)ψm(t), (2.5)

where ζijm are uncorrelated FPC scores of eij with zero mean and finite variance ρm. It

is obvious that the correlation between ζijm and ζi′j′m′ is always zero given independent

environmental processes, unless (i, j,m) = (i′, j′,m′).

Therefore the proposed FPC model for Xij(t) based on these Karhunen-Loeve expan-

sions is given by

Xij(t) = µ(t) +∞∑l=1

ξijlφl(t) +∞∑m=1

ζijmψm(t), t ∈ T . (2.6)

The deviation of each curve Xij from the overall trend µ is a sum of curves φl and ψm with

random amplitudes ξijl and ζijm, respectively. Although the underlying model (2.6) is

infinite-dimensional, the typically rapid decay of eigenvalues often allows us to use a small

number of leading eigenfunctions to recover Xij. In practice, the infinite sums in (2.6) can

be truncated and the φl’s and ψm’s estimated, yielding a data-adaptive low-dimensional

model for Xij. The practical choice of the level of truncations is discussed in Section 2.3.

This eigenfunction approach differs from a random regression model with spline basis

functions, as the eigenfunction basis is completely data-driven, while the spline function

basis is pre-specified without knowledge of the data. A principal components approach to

model (2.2) appears in Di et al. (2011), but with a more restricted covariance structure,

which in our context would require that αi,jj′ ≡ α for all i and for all j 6= j′.

We let the data observed for individual j from family i consist of Nij repeated mea-

surements of Xij taken at discrete time points {Tijk ∈ T : k = 1 . . . , Nij}. Denoting the

kth noisy observation of Xij at Tijk by Uijk, the data model is

Uijk = Xij(Tijk) + εijk

Page 36: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 27

= µ(Tijk) +∞∑l=1

ξijlφl(Tijk) +∞∑m=1

ζijmψm(Tijk) + εijk, (2.7)

where the εijk’s are independent and identically distributed errors with zero mean, finite

variance σ2, and are independent of both the ξijl and the ζijm.

2.3 Model Estimation and FPC Representation

The quantities in model (2.7) are composed of two types: the population components,

such as the mean, covariances and eigenvalues/functions; and the subject-level signals,

i.e., the random amplitudes or FPC scores for the underlying genetic and environmental

processes. The main challenge in estimating these quantities is due to the irregularly and

sparsely observed functional data. More specifically, there may be only a few observations

available for some or even all of the individuals. In this case, borrowing strength across the

entire collection of data is important for obtaining consistent estimation of the population

quantities. As mentioned in the introduction, Yao et al. (2005a) provided a thorough

treatment for such sparse functional data in the case of the classical functional model with

independent realizations, and proposed, namely, the PACE method. We shall generalize

the key idea of PACE and take advantage of the genetic relationship (2.3) in model (2.7).

2.3.1 Estimation of Model Components

The mean and covariance functions are assumed to be smooth, so we can estimate them

by nonparametric regression methods, which borrow information from neighboring data

values. We use local linear smoothers (Fan and Gijbels, 1996) for function and surface

estimation. The key to estimating parameters from sparse functional data is to pool

together information from all individuals, requiring the “pooled” data to be sufficiently

dense. For these local smoothing steps, for a given level of smoothing we adopt the

strategy of ignoring the dependency among the data from the same individual/family.

Page 37: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 28

However we do not ignore correlation when choosing the amount of smoothing. See Lin

and Carroll (2000) for a discussion of smoothing correlated data. Automatic bandwidth

choices for the amount of smoothing of functional data are available [see Rice and Sil-

verman (1991) for leave-one-curve-out cross-validation and Muller and Prewitt (1993)

for surface smoothing], even though subjective choices are often adequate in practice.

Following Chapter 1.1.4, the mean function µ evaluated at t is estimated by µ(t) = a0,

where

(a0, a1)> = argmina0,a1

n∑i=1

Ni∑j=1

Nij∑k=1

K1

(Tijk − th1

){Uijk − a0 − a1(Tijk − t)}2. (2.8)

The kernel function K1 is a positive density symmetric about 0, and h1 is the bandwidth.

Due to the genetic correlation within family, we choose h1 by minimizing the “leave-one-

family-out” cross-validation (CV),

CV (h1) =n∑i=1

Ni∑j=1

Nij∑k=1

{Uijk − µ−(i)(Tijk;h1)

}2, (2.9)

where µ−(i)(·;h1) is the estimate of µ gotten by removing all of the ith family’s data.

The estimation of the covariance functions combines smoothing and the method of

moments and relies upon the following key facts. Recalling that the total covariance

Σ(s, t) = G(s, t) + E(s, t), we have

cov[Uijk, Uijk′

∣∣Tijk, Tijk′] = Σ(Tijk, Tijk′) + δkk′σ2

α−1i,jj′ cov

[Uijk, Uij′k′

∣∣Tijk, Tij′k′] = G(Tijk, Tij′k′), j 6= j′, (2.10)

where δkk′ = 1 for k = k′ and 0 otherwise. We define the centered observation U cijk =

Uijk− µ(Tijk), and the raw covariance observations Cijkk′ = U cijkY

cijk′ . Then we use a two-

dimensional local linear smoother as in (1.9) to estimate the overall covariance function

Page 38: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 29

Σ, with V = b0, where

(b0, b1, b2)> = argminb0,b1,b2

n∑i=1

Ni∑j=1

∑1≤k 6=l≤Ni

K2

(Tijk − sh2

,Tijl − th2

)× {Cijkk′ − b0 − b1(Tijk − s)− b2(Tijl − t)}2.

(2.11)

K2 is a positive bivariate density symmetric about 0, and h2 is the bandwidth. As in

equation (1.10) we can estimate the noise variance σ2 by

σ2 = |T1|−1

∫T1{Σ(t)− Σ(t, t)}dt,

where Σ is obtained by smoothing (Tijk, Cijkk) over all individuals. The bandwidths that

control the smoothness of Σ and Σ, respectively, are also chosen by the leave-one-family-

out CV in the spirit of (2.9).

To estimate the genetic covariance function G, the key relationship in (2.10) sug-

gests borrowing data across the entire family by constructing raw cross-covariances ob-

tained from individuals of the same family. Define such raw cross-covariance obser-

vations adjusted for relationship coefficients αi,jj′ by Gijj′kk′ = α−1i,jj′U

cijkY

cij′k′ . There-

fore we estimate G using a two-dimensional local linear smoother of the pooled input

{(Tijk, Tij′k′ , Gijj′kk′) : k, k′ = 1, . . . , Nij, 1 ≤ j 6= j′ ≤ Ni, i = 1, . . . , n}, yielding the

estimate G. As a consequence, the environmental covariance E is easily obtained by

E = Σ− G.

We suggest an optional step for updating the estimates of G and E. Note that the

genetic covariance G appears in the within-individual covariance and also appears in the

covariance between related individuals, coupled with the relationship coefficient, as given

in (2.3). In our initial estimate of G, we have only used the latter type of information,

the information among related individuals, that is, we have only smoothed the adjusted

cross-covariances Gijj′kk′ = α−1i,jj′U

cijkY

cij′k′ , j 6= j′. In our update, we add the information

Page 39: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 30

on G contained within an individual. Specifically we use our initial estimate of E and

note that for k 6= k′, E[Cijkk′− E(Tijk, Tijk′)

]≈ G(Tijk, Tijk′). Thus we can construct G∗,

a new estimate of G, by smoothing the combined “data”: {Cijkk′ − E(Tijk, Tijk′), k 6= k′}

and {Gijj′kk′ , j 6= j′}. The estimate of the environmental covariance is also updated by

E∗ = Σ− G∗ accordingly. In practice, when the number of observations per individual is

small and/or when we have a large number of individuals per family, this updating step

can often be omitted as the changes in estimates are negligible.

Estimates of the eigenfunctions and eigenvalues of G and E are obtained as solutions

to the eigen-equations

∫TG∗(s, t)φl(s)ds = λlφl(t),

∫TE∗(s, t)ψm(s)ds = ρmψm(t) (2.12)

subject to the orthonormal constraints 〈φl, φl′〉 = δll′ and 〈ψm, ψm′〉 = δmm′ . This can

be implemented by discretizing the smooth covariances G∗ and E∗ and carrying out

matrix eigen-decomposition, as described in Rice and Silverman (1991). However, the

smoothed covariance functions G∗ and E∗ are not necessarily non-negative definite. A

simple modification is to set negative estimated eigenvalues to zero, and reconstruct G

and E based on (2.4) and (2.5), i.e.,

G(s, t) =∑l:λl>0

λlφl(s)φl(t), E(s, t) =∑

m:ρm>0

ρmψm(s)ψm(t), (2.13)

which has been shown to improve the covariance estimation in terms of mean squared

error (Hall et al., 2008, Theorem 1).

2.3.2 FPC Representation for Genetically Related Individuals

We proceed to reconstruct the individual trajectory Xij in (2.6), which requires the es-

timation of the genetic and environmental FPC scores given by ξijl = 〈Xij − µ, φl〉 and

Page 40: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 31

ζijm = 〈Xij−µ, ψm〉, respectively. It is well-known that the classical integral approxima-

tion fails for sparsely observed functional data. The PACE method by Yao et al. (2005a)

overcomes this problem by employing the idea of the best linear unbiased prediction

(BLUP) in the context of FPCA. Here we generalize the PACE method for estimat-

ing the FPC scores ξijl and ζijm to the case where individuals are genetically related

within family. We call this generalization Familial principal component Analysis through

Conditional Expectation (FACE).

In the sequel, all expectations are understood to be taken conditional on the times Tijk.

To calculate ξijl, the BLUP of ξijl, let U ij = (Uij1, . . . , UijNij)>, U i = (U>i1, . . . ,U

>iNi

)>and Mi =

∑Ni

j=1Nij. Recall the covariance structures in (2.10). Due to the genetic

correlation within all individuals in family i, we infer the lth FPC score ξijl of the

genetic process gij from the observed data for all subjects in the ith family. Write

the Nij × Nij auto-covariance matrix of U ij as Σi,jj = cov(U ij,U ij) = [Σ(Tijk, Tijk′) +

δkk′σ2]1≤k,k′≤Nij

, and the Nij × Nij′ cross-covariance matrix between U ij and U ij′ by

Σi,jj′ = cov(U ij,U ij′) =[αi,jj′G(Tijk, Tij′k′)

]1≤k≤Nij ,1≤k′≤Nij′

, where 1 ≤ j 6= j′ ≤ Ni.

Then we have the Mi×Mi covariance matrix of U i, ΣU i= cov(U i,U i) = (Σi,jj′)1≤j,j≤Ni

.

Let φijl = (φl(Tij1), . . . , φl(TijNij))>, and noting that αi,jj = 1 one has cov(ξijl,U i) =

λl(αi,j1φ>i1l, . . . , αi,jNi

φ>iNil). Finally, denote µij = (µ(Tij1), . . . , µ(TijNij

))> and µi =

(µ>i1, . . . ,µ>iNi

)>. By the BLUP principle, we obtain the FACE formula for ξijl,

ξijl = cov(ξijl,U i)cov(U i,U i)−1(U i − µi)

= λl(αi,j1φ>i1l, . . . , αi,jNi

φ>iNil){(Σi,jj′)1≤j,j≤Ni

}−1(U i − µi), (2.14)

which is equal to E[ξill|U i] when all quantities are Gaussian. Substituting the estimates

of model components, using the generic notation “ˆ”, the FACE estimates are

ξijl = λl(αi,j1φ>i1l, . . . , αi,jNi

φ>iNil

){(Σi,jj′)1≤j,j≤Ni}−1(U i − µi). (2.15)

Page 41: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 32

Since the environmental processes, the eij’s, are independent across individuals, the es-

timation for the FPC scores ζiim is as in PACE, i.e., only use the observed data for that

subject. Denoting ψijm = (ψm(Tij1), . . . , ψm(TijNij))>, simple calculation by the BLUP

principle yields the FACE formulae ζiim and its plug-in estimate ζijm,

ζijm = ρmψ>ijmΣ−1

i,jj(U ij − µij),

ζijm = ρmψ>ijmΣ−1

i,jj(U ij − µij). (2.16)

The reconstruction of the individual trajectories is straightforward once we obtain

the estimates of the FPC scores. It is customary to assume that the Xij’s are well

approximated by a low-dimensional expansion. Suppose we include theKg andKe leading

eigenfunctions of gij and eij in (2.6), respectively, so that

Xij(t) = µ(t) +

Kg∑l=1

ξijlφl(t) +Ke∑m=1

ζijmψm(t). (2.17)

The values of Kg and Ke can be chosen by objective criteria, such as leave-one-family-out

cross-validation, or the AIC based on pseudo-likelihood under Gassian assumptions in a

spirit similar to that of Yao et al. (2005a). In practice, using the proportion of functional

variation explained (FVE) with a suitable threshold is often satisfactory.

2.4 Application to Weights of Beef Cattle

The dataset we analyze here is a subset of a larger dataset used in Meyer et al. (1993) and

Meyer (1999). Our data set contains weights in kilograms of 55 beef cattle from a total

of 15 independent families. The cows within a family were half-siblings, having the same

sire but different mothers. Thus the genetic correlation parameter αi,jj′ ≡ 1/4 is known

a priori, based on the half-sibling relationships. The phenotypic trajectories are notably

irregularly and sparsely observed. The number Ni of half-siblings per family ranges from

Page 42: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 33

one to eight; see Figure 2.1a for the distribution of ni’s. Weighings occurred at ages

ranging from 548 to 2553 days, i.e., T = [548, 2553]. The number Nij of weighings per

individual varied from 1 to 62, and a histogram of the Nij’s is shown in Figure 2.1b. Data

were affected by some additional environmental factors, but for simplicity, we have not

included them in our model. Including such fixed effects is, in general, straightforward,

and would allow the user to model variability that is not completely due to individual

effects.

1 2 3 4 5 6 7 8

Number of siblings per sire

Fre

q

01

23

(a) Siblings per sire

Number of observations per cow

Fre

q

0 10 20 30 40 50 60 70

05

1015

20

(b) Observations per cow

Figure 2.1: Beef cattle data: frequency distributions.

The estimated mean function is shown in Figure 2.2, and shows, approximately, a

yearly cyclical pattern that depicts the seasonal weight changes of beef cattle. The non-

negative definite covariance estimates (2.13) for the genetic and environmental processes

are shown in Figure 2.3a and 2.3b. We see that the genetic covariance is not as strong

as the environmental covariance. Indeed, the environmental process explains about five

and a half times the variability as the genetic process. However, the two covariances do

exhibit similar patterns, with relatively high variation at late times. Another observation

is that the environmental covariance seems to increase over time, which is not surprising

as environmental influences may accumulate as the cows age. We used a threshold of 98%

to select the number of principal components for the genetic and environmental processes.

Page 43: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 34

Thus, the Kg = 3 genetic principal components, λ1 = 4.4 × 105, λ2 = 2.1 × 105, and

λ3 = 3.9 × 104 explained 62.5%, 29.9%, and 5.6% of the genetic variation, respectively.

The Ke = 4 environmental principal components, ρ1 = 3.1 × 106, ρ2 = 3.1 × 105, ρ3 =

2.0×105, and ρ4 = 1.3×105 explained 81.6%, 8.1%, 5.2%, and 3.4% of the environmental

variation, respectively. The estimated genetic and environmental eigenfunctions are given

in Figures 2.4a and 2.4b, respectively. From the first two eigenfunctions in each panel,

one can see that the dominant variation in the genetic process concentrates around 2000

days and includes a contrast between weights at 1200 days and at 2300 days. The

environmental effect shows a more constant influence over time with an early slow increase

followed by a sharp drop after 2000 days (or vice versa). The updating step of the genetic

and environmental covariances did not alter the estimates obviously and was not needed

for this analysis.

600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600

200

300

400

500

600

700

800

Age (days)

Wei

ght (

kg)

Figure 2.2: Estimated mean function (dark) with observed trajectories (light) for thebeef cattle data.

We are primarily interested in predicting the growth of beef cattle from sparsely ob-

served measurements. It is thus informative to assess the proposed method by comparing

it with the PACE method that treats all individuals independently, i.e., that doesn’t

take familial genetic correlation into account. We calculate the leave-one-family-out

cross-validation error given by∑

i

∑j

∑k{Uijk−X

−iij (Tijk)}2, where X−iij is the predicted

phenotype of the jth cow in the ith family. Specifically, the model components are es-

Page 44: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 35

5001000

15002000

25003000

500

1000

1500

2000

2500

3000−600

−400

−200

0

200

400

600

800

1000

1200

Age (days)Age (days)

(a) Genetic

5001000

15002000

25003000

500

1000

1500

2000

2500

3000−1000

0

1000

2000

3000

4000

5000

Age (days)Age (days)

(b) Environmental

Figure 2.3: Non-negative definite estimates of the genetic and environmental covariancefunctions for the beef cattle data.

600 800 1000 1200 1400 1600 1800 2000 2200 2400

−0.05

−0.04

−0.03

−0.02

−0.01

0

0.01

0.02

0.03

0.04

Age (days)

(a) Genetic

600 800 1000 1200 1400 1600 1800 2000 2200 2400

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

Age (days)

(b) Environmental

Figure 2.4: Shown are the first (solid), second (dashed), third (dash-dot), and fourth(dotted) eigenfunctions. Left: first three eigenfunctions of the genetic process, countingfor 98% of the genetic variance. Right: first four eigenfunctions of the environmentalprocess, explaining 98.3% of the environmental variance.

Page 45: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 36

timated based on data excluding family i using the method described in Section 2.3.1.

Then the FPC scores ξ−iijl and ζ−iijm are obtained by substituting these leave-one-family-

out estimates, µ−i, λ−il , ρ−im , φ

−il , ψ

−im ,Σ

−ii,jj′ , into (2.15) and (2.16), leading to X−iij . We use

K−ig and K−ie leading eigenfunctions, chosen to explain 98% of, respectively, the genetic

and the environmental functional variation in the data. The reconstruction using the

PACE method is obtained in a similar manner. See Yao et al. (2005a) for details. Not

surprisingly, the proposed FACE method considerably improves upon the PACE method

by around 18%. Shown in Figure 2.5 are the cross-validated trajectory estimates for

offsprings of two of the fifteen families using FACE and PACE methods. We observe

that FACE offers improved predictions for these eight cows.

552 1224255

450

7.4%7.1%

574 2471300

575

2.9%2.2%

564 2546325

600

2.5%1.9%

558 2540321

653

4.0%2.1%

556 2538310

590

2.9%2.0%

553 2534324

692

2.3%1.4%

581 1707305

675

3.1%1.8%

574 2519330

616

3.7%2.6%

Figure 2.5: Estimated trajectories by leave-one-family-out cross-validation (CV) for twofamilies of cows obtained using FACE method (solid) and PACE method (dashed), wherethe first row presents two half-siblings from one family and the bottom three rows presentsix half-siblings from another family. The legend shows the relative CV error of each cow,∑Nij

k=1{Uijk− X−iij (Tijk)}2/U2

ijk, obtained from two methods, where X−iij is as described inSection 4.

Page 46: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 37

2.5 Simulated Examples

To further illustrate the performance of the proposed method, we carry out two simulation

studies. For Simulation I, we closely mimic the cow data, using the same design, e.g.,

the same family sizes and times of weighings. The underlying model is (2.7) with Kg

terms for the genetic component and Ke terms for the environmental component. The

environmental covariance is derived from the first four estimated eigenfunctions, i.e.,

Ke = 4. In view of the importance of the genetic component, we examine three values of

Kg: Kg = 1, 2, 3, and we use the corresponding genetic eigenfunctions estimated from the

data. We use the half-sibling relationship coefficient αi,jj′ = 1/4 for all i, j and j′ 6= j.

The genetic and environmental FPC scores ξijl and ζijm and the measurement errors εijk

are independently generated from normal distributions, respectively, using the estimated

eigenvalues and error variance from the data. To focus our attention on the covariances

and FPCs, we set the mean function µ to 0 in the data generation but still treat it

as unknown in our analysis. For each underlying model, we generate 100 Monte Carlo

samples, and produce two versions of Xij, the FACE estimate that respects the familial

genetic relationship, and the PACE estimate that ignores familial dependence. To select

Kg and Ke, we again use a 98% threshold for the fraction of variance explained. Within

each sample and for each estimation method, we calculate the integrated squared error

(ISE) for the jth individual in the ith family, ISEij =∫T

{Xij(t) − Xij(t)

}2dt, and the

overall ISE is defined as ISE =∑

i,j ISEij. Improvements of the proposed FACE method

upon the PACE method are summarized in Table 2.1, which indicates a substantial

improvement of 21% to 25%.

In Simulation II, we again follow model (2.7), but with µ(t) = t + sin(2πt), φ1(t) =

ζ1(t) = − cos(2πt/10)/√

5 and φ2(t) = ζ2(t) = sin(2πt/10)/√

5 and corresponding eigen-

values λ1 = 10, λ2 = 5 and ρ1 = 100, ρ2 = 10. The genetic and environmental FPC

scores are generated from normal distributions, and the measurement error εijk is from

N(0, 0.01). We still generate data for 15 families, but the number of siblings within

Page 47: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 38

family is chosen uniformly from {2, . . . , 6} and the number of observations per subject is

chosen uniformly from {5, . . . , 20}. The observation times are uniformly distributed on

[0, 10]. With 100 Monte Carlo samples, the ISE based on the FACE method incorporating

genetic correlation outperformed the PACE method by 30% for the case of half-sibling

families with αi,jj′ = 1/4 for j 6= j′, and by 25% for the case of full-sibling families with

αi,jj′ = 1/2 for j 6= j′. See Table 2.1.

Table 2.1: ISE improvement (%) of the proposed FACE method upon PACE, whereSimulation I uses data-based models with different values of (Kg, Ke) and Simulation IIexamines half-sibling (α = 0.25) and full-sibling (α = 0.5) family relationships.

Simulation I

(Kg, Ke) Mean (SE) 1st Quartile Median 3rd Quartile(1, 4) 21.4 (1.5) 15.1 23.5 28.7(2, 4) 25.1 (1.6) 12.9 28.9 36.3(3, 4) 21.9 (1.6) 10.9 24.7 32.6α Mean (SE) 1st Quartile Median 3rd Quartile

Simulation II 0.25 30.4 (3.1) 13.4 39.0 52.80.50 25.4 (3.0) 11.7 30.4 45.4

2.6 Conclusion

In this chapter, we propose a version of functional data analysis for trajectories of geneti-

cally related individuals from independent families. We are able to estimate various levels

of variation: the genetic covariance, the environmental covariance induced by external

factors, and the measurement error variance. A new method, named FACE, is proposed

to taking into account the familial correlation for estimating the genetic random effects.

By making use of the auto-covariance function of each individual, we also develop a sim-

ple step to update estimates of the genetic and environmental covariance functions. We

apply our method to study the growth over time of families of half-sibling cows, which

shows via data analysis and simulation studies that, for predicting underlying trajecto-

ries, our proposal improves considerably upon the existing PACE method designed for a

sample of independent subjects.

Page 48: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 2. Data Model for Genetically Correlated Subjects 39

While our method does well on its own, it can also be part of a hybrid approach. Our

proposed methodology can be used for dimension reduction, specifically to determine a

handful of eigenfunctions that can then be used as basis functions in further analysis.

For instance, the basis functions might be used in a parsimonious mixed effects random

regression analysis, a method that is computationally burdensome with even a moderate

number of basis functions.

Page 49: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3

Cumulative Slicing Estimation for

Dimension Reduction

40

Page 50: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 41

3.1 Introduction

In functional data analysis (FDA), one is often interested in how a scalar response Y ∈ R

varies with a smooth trajectory X(t), where t is an index variable defined on a closed

interval T (Ramsay and Silverman, 2005, for a comprehensive overview). To be specific,

one seeks to model the relationship Y = M(X; ε), where M is a smooth functional

and the error process ε has zero mean, finite variance σ2, and is independent of X.

While modeling M parametrically can be restrictive in many applications, modeling M

nonparametrically is practically infeasible due to slow convergence rates associated with

the “curse of dimensionality”. Therefore a class of semiparametric index models has been

proposed to approximate M(X; ε) with an unknown link function g : RK+1 → R,

Y = g(〈β1, X〉, . . . , 〈βK , X〉; ε

), (3.1)

where K is the reduced dimension of the model, β1, . . . , βK are linearly independent index

functions, and 〈u, v〉 =∫u(t)v(t)dt is the usual L2 inner product. Functional linear model

(FLM) Y = β0 +∫β1(t)X(t)d(t) + ε is a special case and has been extensively studied

(Cardot et al., 1999, Muller and Stadtmuller, 2005, Yao et al., 2005b, Cai and Hall, 2006,

Hall and Horowitz, 2007, Yuan and Cai, 2010, among others).

In this chapter, we tackle the index model (3.1) from the perspective of effective di-

mension reduction (EDR), in the sense that the K linear projections 〈β1, X〉, . . . , 〈βK , X〉

form a sufficient statistic. This is particularly useful when the process X is infinite-

dimensional. Our primary goal is to offer a novel treatment of dimension reduction for

functional data, especially when the trajectories are corrupted with noise and sparsely

observed with a few observations for some, or even all of the subjects. Pioneered by Li

(1991) for multivariate data, EDR methods are typically “link-free”, requiring neither

specification nor estimation of the link function (Duan and Li, 1991), and the objective

is to characterize the K-dimensional EDR space SY |X = span(β1, . . . , βK) onto which to

Page 51: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 42

project X. Such index functions βk are referred to as EDR directions, K is called the

structural dimension of the EDR space, and SY |X is also known as the central subspace

(Cook, 1998). Li (1991) characterized SY |X via the inverse mean E[X|Y ], namely the

sliced inverse regression (SIR), and has since motivated a large body of related works for

multivariate data: Cook and Weisberg (1991) estimated var(X|Y ), Li (1992) dealt with

the Hessian matrix of the regression curve, Xia et al. (2002) proposed minimum aver-

age variance estimation as an adaptive approach based on kernel methods, Chiaromonte

et al. (2002) used partial SIR for categorical predictors, Li and Wang (2007) worked with

empirical directions, Zhu et al. (2010) proposed cumulative slicing estimation (CUME)

to improve upon SIR, and among others.

The literature of EDR methods for functional data has been relatively scarce. Notably,

Ferre and Yao (2003) extended SIR to completely observed functional data (FSIR), and

Li and Hsing (2010) developed sequential χ2 testing procedures to decide the structural

dimension of the EDR space obtained using FSIR. Besides EDR approaches, James and

Silverman (2005) estimated the index and link functions jointly for an additive form

g(〈β1, X〉, . . . , 〈βK , X〉; ε

)= β0 +

∑Kk=1 gk

(〈βk, X〉

)+ ε, assuming that the trajectories

were densely or completely observed, and the index and link functions were elements of a

finite dimensional spline space. Chen et al. (2011) estimated the index and additive link

functions nonparametrically and relaxed the finite dimensional assumption for theoretical

analysis, but retained the dense design as a crucial condition.

To the best of our knowledge, none of the existing work addresses dimension reduc-

tion for sparse functional data in the context of multiple-index type models (3.1). Similar

to suggestions from James and Silverman (2005) and Chen et al. (2011), Ferre and Yao

(2003) remarked that, in practice, the functional trajectories could first be recovered,

but this cumbersome two-step procedure deviates from the spirit of EDR analysis. In

contrast, we aim to estimate the EDR space directly by drawing our inspiration from cu-

mulative slicing for multivariate data (Zhu et al., 2010). When adapted to the functional

Page 52: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 43

setting, cumulative slicing offers a novel way of borrowing strength across subjects to

handle sparsely observed trajectories. This key advantage has not been leveraged upon

elsewhere. As we will demonstrate later, though the extension of cumulative slicing to

completely observed functional data is straightforward, it takes a materially different

strategy for sparse design by maximizing the use of available data. We also provide a

rigorous theoretical analysis of the proposed method, namely Functional Cumulative Slic-

ing (FCS), for sparse functional data, which reveals the bias-variance tradeoff associated

with the regularizing truncation and the decaying structures of the predictor process and

the EDR space.

The rest of the chapter is organized as follows. We present the proposed FCS method-

ology and its sample estimation procedure in Chapter 3.2. Chapter 3.3 details asymptotic

properties of the relevant estimates obtained from FCS. Chapter 3.4 provides numeri-

cal studies of simulated examples, and Chapter 3.5 offers two data applications, one on

sparsely observed functional data and the other on densely observed functional data.

Conclusion remarks are given in Chapter 3.6, while technical proofs are relegated to the

Appendix.

3.2 Methodology

Let T be a compact interval, and X be a random variable defined on the real and

separable Hilbert space H ≡ L2(T ) endowed with inner product 〈f, g〉 =∫T f(t)g(t)dt

and norm ‖f‖H =√〈f, f〉. We assume for simplicity that

Assumption 3.1. X is centered and has a finite fourth moment∫τE[X4(t)]dt <∞.

Under Assumption 3.1, the covariance surface of X is given by Σ(s, t) = E[X(s)X(t)],

which generates a Hilbert-Schmidt operator Σ onH that maps f to (Σf)(s) =∫τ

Σ(s, t)f(t)dt.

This operator can be written succinctly as Σ = E[X⊗X], where the tensor product u⊗v

denotes the rank one operator on H that maps w to (u ⊗ v)w = 〈u,w〉v. By Mercer’s

Page 53: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 44

Theorem, Σ admits a spectral decomposition Σ =∑∞

j=1 αjφj ⊗ φj, where eigenfunctions

{φj}j=1,2,... form a complete and orthonormal system in H and eigenvalues {αj}j=1,2,...

are assumed to be strictly decreasing and positive such that∑∞

j=1 αj <∞. Finally, recall

that the EDR directions β1, . . . , βK in model (3.1) are linearly independent functions in

H, and the response Y ∈ R is assumed to be conditionally independent of X given the

K projections 〈β1, X〉, . . . , 〈βK , X〉.

As a close comparison, we briefly review functional sliced inverse regression that

targets the EDR space through ΛSIR = var(E[X|Y ]

), the operator associated with the

covariance of the inverse mean. It partitions the range of Y into a user-specified partition

of S slices I1, . . . , IS, where Is denotes the interval (ys−1, ys] with −∞ = y0 < y1 < . . . <

yS = +∞. Observe that E[X|Y ∈ Is

]= E

[X1(Y ∈ Is)

]/P (Y ∈ Is) ≡ ms/ps. Then

FSIR approximates ΛSIR by its sliced version Λ0 =∑S

s=1 p−1s ms⊗ms. From multivariate

SIR, it is well known that the number of slices is associated with a bias-variance tradeoff.

The number of slices must be larger than the structural dimension in order to fully

characterize SY |X , but if it is too large, the variance will increase as ps will be close

to zero. It is easy to see that applying FSIR to sparsely observed functional data is

practically infeasible, since the combination of the sparsely observed X and the delicate

need of choosing a sufficiently large number of slices would inevitably result in too few

observations in each slice with which to estimate Λ0.

3.2.1 Validity of Functional Cumulative Slicing

To avoid the nontrivial selection of the number of slices in SIR, Zhu et al. (2010) noted

that for a fixed y, using two slices I1 = (−∞, y] and I2 = (y,+∞) would maximize the use

of data and minimize the variability in each slice. In light of the foregoing discussion on

the limitations of FSIR for sparse functional data, the choice of two slices is thus critical to

ensure that each slice has a sufficient number of observations. The kernel of the operator

Λ0 then reduces to Λ0(s, t; y) ∝ m(s, y)m(t, y), where m(·, y) = E[X(·)1(Y ≤ y)

]is an

Page 54: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 45

unconditional expectation in contrast to the conditional expectation E[X(·)|Y ∈ Is] of

FSIR. However, such a kernel Λ0 can recover at most one direction of SY |X for a fixed

y ∈ R. It is necessary to combine all possible estimation of m(·, y) by letting y run

across the support of Y , an independent copy of Y . Therefore the kernel of the proposed

functional cumulative slicing (FCS) is given by

Λ(s, t) = E[m(s, Y )m(t, Y )w(Y )

], (3.2)

where w(y) is a known nonnegative weight function for generality. Denote the corre-

sponding integral operator of Λ(s, t) also by Λ. The following theorem establishes the

validity of FCS. Analogous to the multivariate case, the linearity assumption is needed,

Assumption 3.2 (Linearity). For any function b ∈ H, there exists constants c0, . . . , cK ∈

R such that

E[〈b,X〉|〈β1, X〉, . . . , 〈βK , X〉

]= c0 +

K∑k=1

ck〈βk, X〉.

Condition 3.2 is satisfied when X has an elliptically contoured distribution, which is

more general than, but bears a close connection to, a Gaussian process (Cambanis et al.,

1981, Li and Hsing, 2010).

Theorem 3.1. If assumptions 3.1 and 3.2 hold for model (3.1), then the linear space

spanned by m(t, y), y ∈ R, is contained in the linear space spanned by {Σβ1, . . . ,ΣβK},

i.e., span({m(t, y) : y ∈ R}

)⊆ span

(Σβ1, . . . ,ΣβK

).

An important observation from Theorem 3.1 is that for any b ∈ H orthogonal to

the space spanned by {Σβ1, . . . ,ΣβK}, we have 〈b,Λx〉 = 0, implying range(Λ) ⊆

span(Σβ1, . . . ,ΣβK). If Λ has K non-zero eigenvalues, the space spanned by its eigenfunc-

tions is precisely span(Σβ1, . . . ,ΣβK). In principle, we can deduce SY |X = span(β1, . . . , βK)

from Σ and Λ. Recall that our target is the subcentral space SY |X , even though the EDR

directions themselves are not identifiable. For specificity, we regard these eigenfunc-

Page 55: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 46

tions of Σ−1Λ associated with the K largest non-zero eigenvalues as the index functions

β1, . . . , βK themselves unless stated otherwise.

It is worth mentioning that since the covariance operator Σ is Hilbert-Schmidt, its

inverse Σ−1 is not well-defined so the EDR directions may not even exist in H. Following

He et al. (2003) for functional canonical correlation, let RΣ denote the range of Σ and

R−1Σ =

{b ∈ H :

∑∞j=1 α

−1j 〈b, φj〉2 < ∞, b ∈ RΣ

}. Restricted to R−1

Σ , Σ is a one-to-one

operator from R−1Σ ⊂ H onto RΣ whose inverse is defined by Σ−1 =

∑∞j=1 α

−1j φj ⊗ φj.

Let ξj = 〈X,φj〉 denote the jth principal component (or generalized Fourier coefficient)

of X, and assume that

Assumption 3.3.∑∞

j=1

∑∞l=1 α

−2j α−1

l E2{E[ξj1(Y ≤ Y )|Y ]E[ξl1(Y ≤ Y )|Y ]

}<∞.

Proposition 3.1. Under assumptions 3.1-3.3, the eigenspace associated with the K non-

null eigenvalues of Σ−1Λ is well defined in H.

This is a direct analogue to Theorem 4.8 in He et al. (2003) and Theorem 2.1 in Ferre

and Yao (2005), thus the proof is omitted for conciseness.

3.2.2 Functional Cumulative Slicing for Sparse Functional Data

For the data{

(Xi, Yi) : 1 ≤ i ≤ n}

independently and identically distributed (i.i.d.)

as (X, Y ), the predictor trajectories Xi are observed intermittently, contaminated with

noise, and collected in the form of repeated measurements{

(Tij, Uij) : 1 ≤ i ≤ n, 1 ≤

j ≤ Ni

}, where Uij = Xi(Tij) + εij with i.i.d. measurement error εij that are of zero

mean, constant variance σ2x, and independent of all other random variables. When only

a few observations are available for some or even all subjects, individual smoothing to

recover Xi is infeasible and one must adopt the strategy of pooling together data from

across subjects for consistent estimation.

To estimate the FCS kernel Λ defined in (3.2), the key quantity is the unconditional

mean m(t, y) = E[X(·)1(Y ≤ y)

]. For sparsely and irregularly observed Xi, the cross-

Page 56: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 47

sectional estimation used in multivariate cumulative slicing is no longer applicable. To

maximize the use of available data, we propose to pool together the repeated measure-

ments across subjects via a scatterplot smoother, which works seamlessly in conjunction

with the strategy of cumulative slicing. For specificity, we use a local linear estimator

m(t, y) = a0 (Fan and Gijbels, 1996), minimizing

min(a0,a1)

n∑i=1

Ni∑j=1

{Uij1(Yi ≤ y)− a0 − a1(Tij − t)

}2

K1

(Tij − th1

), (3.3)

where K1 is a non-negative and symmetric univariate kernel density and h1 = h1(n)

is the bandwidth to control the amount of smoothing. Here we follow the suggestion of

ignoring the dependency among the data from the same individual (Lin and Carroll, 2000,

for smoothing correlated data), and use leave-one-curve-out cross-validation to select h1

(Rice and Silverman, 1991). Then an estimator of the FCS kernel function Λ(s, t) is given

by its sample moment,

Λ(s, t) =1

n

n∑i=1

m(s, Yi)m(t, Yi)w(Yi). (3.4)

For the covariance operator Σ, following Yao et al. (2005a), denote the observed raw

covariances byGi(Tij, Til) = UijUil and note E[Gi(Tij, Til)|Tij, Til

]= cov(X(Tij), X(Til))+

σ2δjl, where δjl is 1 if j = l and 0 otherwise. This suggests the diagonal of the raw co-

variances should be removed, and minimizing

min(b0,b1,b2)

n∑i=1

∑1≤j 6=l≤Ni

{Gi(Tij, Til)−b0−b1(Tij−s)−b2(Til−t)

}2

K2

(Tij − sh2

,Til − th2

)(3.5)

yields Σ(s, t) = b0, where K2 is a non-negative bivariate kernel density and h2 = h2(n)

is the bandwidth chosen by leave-one-curve-out cross-validation, see Yao et al. (2005a)

for details on the implementation. Since the inverse operator Σ−1 is unbounded, we

regularize it by projection onto a truncated subspace. To be precise, let sn be a possibly

Page 57: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 48

divergent sequence and Πsn =∑sn

j=1 φj ⊗ φj (resp. Πsn =∑sn

j=1 φj ⊗ φj) denote the

orthogonal projector onto the eigensubspace associated with the sn largest eigenvalues of

Σ (resp. Σ). Then, Σsn = ΠsnΣΠsn (resp. Σsn = ΠsnΣΠsn) is a sequence of finite rank

operators converging to Σ (resp. Σ) as n→∞ with bounded inverse

Σ−1sn =

sn∑j=1

α−1j φj ⊗ φj, Σ−1

sn =sn∑j=1

α−1j φj ⊗ φj, (3.6)

respectively. Finally we obtain the eigenfunctions associated with the K largest nonzero

eigenvalues of Σ−1sn Λ as the estimates of the EDR directions {βk,sn}k=1,...,K .

The situation for completely observedXi is similar to the multivariate case and consid-

erably simpler. The quantities m(t, y) and Σ(s, t) are easily estimated by their respective

sample moments m(t, y) = n−1∑n

i=1 Xi(t)1(Yi ≤ y) and Σ(s, t) = n−1∑n

i=1 Xi(s)Xi(t),

while the estimate of Λ remains the same as (3.4). For densely observed Xi, individual

smoothing can be used as a preprocessing step to recover smooth trajectories and the

estimation error introduced in this step can be shown to be asymptotically negligible un-

der certain design conditions, i.e., it is equivalent to the ideal situation of the completely

observed Xi’s (Hall et al., 2006).

Remarks. (i) For small values of Yi, m(·, Yi) obtained by (3.3) may be unstable due

to the smaller number of pooled observations in the slice. A suitable weight function w

may be used to refine the estimator Λ(s, t). In our numerical studies, the naive choice

of w ≡ 1 performed fairly well compared to other methods. Analogous to multivariate

case, choosing an optimal w remains an open question. (ii) Ferre and Yao (2005) avoided

inverting Σ with the claim that for a finite rank operator Λ, range(Λ−1Σ) = range(Σ−1Λ);

however, Cook et al. (2010) showed that this required more stringent conditions that

are not easily fulfilled. (iii) Regularization can also be tackled with a ridge penalty

(Σ + ρI)−1, where ρ > 0 and I is the identity operator. However, numerical results from

this regularization scheme are observed to be inferior to those from spectral truncation,

Page 58: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 49

and thus not pursued further. (iv) For selecting the structural dimension K, the only

relevant work to date is Li and Hsing (2010), where sequential χ2 tests are developed to

determine K in the context of FSIR for completely observed functional data. How to

extend such tests (if feasible at all) to sparse functional data is a substantive problem

that deserves further exploration. Nevertheless, since prediction is the primary concern

in many applications, both K and sn can be easily chosen by minimizing prediction error

when there is a sensible model in place. In the simulated and real examples we adopt

this principle that empirically performs well.

3.3 Asymptotic Properties

In this section we present asymptotic properties of the FCS kernel operator and the EDR

directions for sparsely observed functional data. Here the number of measurements Ni

and the observation times Tij are considered random to reflect a sparse and irregular

design. Specifically, we assume that

Assumption 3.4. Ni are random variables with Nii.i.d.∼ N , where N is a bounded

positive discrete random variable with P{N ≥ 2} > 0, and({Tij, j ∈ Ji}, {Uij, j ∈ Ji}

)are independent of Ni for Ji ⊆ {1, . . . , Ni}.

Writing Ti = (Ti1, . . . , TiNi)> and Ui = (Ui1, . . . , UiNi

)>, the data quadruplets Zi =

{Ti, Ui, Yi, Ni} are thus i.i.d.. Note that extremely sparse designs are covered, with only

a few measurements for each subject. Other regularity conditions are standard and listed

in the Appendix, including assumptions on the smoothness of the mean and covariance

functions of X, the distributions of the observation times, the bandwidths and kernel

functions used in smoothing steps. Denote ‖A‖2H =

∫T

∫T A

2(s, t)dsdt for A ∈ L2(T ×T ).

Theorem 3.2. Under assumptions 3.1, 3.4 and 3.7–3.10 in the Appendix, we have

∥∥Λ− Λ∥∥H

= Op

(1√nh1

),

∥∥Σ− Σ∥∥H

= Op

(1√nh2

),

Page 59: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 50

The key result here is the L2 convergence of the estimated FCS operator Λ, in which we

exploit the projections of nonparametric U -statistics coupled with an important decom-

position of m(·, y) to overcome the difficulty caused by the dependence among irregularly

spaced measurements. Note that Λ is obtained by averaging the smoothers m(·, Yi) over

Yi, which is crucial to achieve the univariate convergence rate for this bivariate estima-

tor. The convergence of the covariance operator Σ is presented for completeness, given

in Theorem 2 of Yao and Muller (2010).

We are now ready to characterize the estimation of the central subspace SY |X =

span(β1, . . . , βK). Unlike the multivariate or finite-dimensional case where the conver-

gence of SY |X follows immediately from the boundedness of Σ−1, we have to approximate

Σ−1 with a sequence of truncated estimates Σ−1sn in (3.6). Recall that we specifically

regarded the index functions {β1, . . . , βK} as the eigenfunctions associated with the K

largest eigenvalues of Σ−1Λ to suppress the identifiability concern. It is thus equivalent to

consider {β1,sn , . . . , βK,sn} in place of SY |X . For an arbitrary constant C > 0, we require

the eigenvalues of Σ to satisfy

Assumption 3.5. α1 > α2 > . . . > 0, Eξ4j ≤ Cα2

j for j ≥ 1, and αj − αj+1 ≥ C−1j−a−1

for j ≥ 1.

This condition on the decaying speed of eigenvalues αj prevents the spacings between

consecutive eigenvalues from being too small, which also implies αj ≥ Cj−a and, together

with the boundedness of Σ, a > 1. Expressing the index functions as βk =∑∞

j=1 bkjφj, k =

1, . . . , K, we impose a decaying structure on its generalized Fourier coefficients bkj =

〈βk, φj〉,

Assumption 3.6. |bkj| ≤ Cj−b for j ≥ 1 and 1 ≤ k ≤ K, where a+ 12< b.

This implies that {βk}k=1,...,K is smoother relative to Σ. Here, we require a stronger

condition than a/2 + 1 < b assumed by Hall and Horowitz (2007) for functional linear

Page 60: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 51

model with completely observed Xi. This is not unexpected, as the index model (3.1) is

more flexible and we are dealing with sparse functional data.

Theorem 3.3. Under conditions 3.1–3.6 and 3.7–3.10 in the Appendix, for all k =

1, . . . , K, we have

∥∥βk,sn − βk∥∥H = Op

(s

32a+1

n√nh1

+s

(2a−b+2)+n√nh2

+1

sb−a− 1

2n

),

where (2a− b+ 2)+ = max(0, 2a− b+ 2).

This result explicitly associates the convergence of βk,sn with the regularizing trun-

cation size sn and the decay rates of αj and bkj. Specifically, the first two terms are

attributed to the variability of estimating Σ−1sn Λ using Σ−1

sn Λ, and the last to the ap-

proximation bias of Σ−1sn Λ. This indicates a bias-variance tradeoff associated with the

truncation size sn. One may view sn as a tuning parameter that controls the resolution

or smoothness of the covariance estimation. Furthermore, the first term of the variance

is due to ‖Σ−1sn ΛΣ

−1/2sn −Σ−1

sn ΛΣ−1/2sn ‖ (details in Appendix) and becomes increasingly un-

stable with a larger truncation. The bias and the second part of the variance contributed

from ‖(Σ−1sn − Σ−1

sn )ΛΣ−1/2sn ‖ are to some extent determined by the relative smoothness of

Σ and βk, i.e., a smoother βk with a larger b leads to less discrepancy.

3.4 Simulations

In this section we illustrate the performance of the proposed FCS method in terms of esti-

mation and prediction. We compare the proposed FCS to (i) FSIR with 5 slices (FSIR5),

(ii) FSIR with 10 slices (FSIR10), (iii) functional index model with nonparametric link

(FIND) proposed by Chen et al. (2011), and (iv) functional linear model (FLM) as a

misspecified baseline for assessing prediction. Although FCS and FSIR are “link-free”

for estimating index functions βk, a general index model (3.1) may lead to model pre-

Page 61: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 52

dictions with high variability, especially given relatively small sample sizes frequently

encountered in functional data analysis. Thus we follow Chen et al. (2011) by assuming

an additive structure on the link function g in (3.1), i.e., Y = β0 +∑K

k=1 gk(〈βk, X〉

)+ ε.

In each Monte Carlo run, a sample of n = 200 functional trajectories are generated

from the process Xi(t) =∑50

j=1 ξijφj(t), where φj(t) = sin(πtj/5)/√

5 for j even and

φj(t) = cos(πtj/5)/√

5 for j odd, FPC scores ξij are i.i.d. N(0, j−1.5), t ∈ [0, 10].

For the setting of sparsely observed functional data, the number of observations per

subject Ni is chosen uniformly from {15, . . . , 20}, the observational times Tij are i.i.d.

U [0, 10], and the measurement error εij is i.i.d. N(0, 0.1). For densely observed func-

tional data, let Tij = 0.1(j − 1) for j = 1, . . . , 101. The EDR directions are generated by

β1(t) =∑50

j=1 bjφj(t), where bj = 1 for j = 1, 2, 3 and bj = 4(j− 2)−3 for 4 ≤ j ≤ 50, and

β2(t) =√

3/10(t/5− 1) that is not representable with finite Fourier terms.

Since neither FSIR nor FIND are directly applicable to sparse functional data for

estimating βk, we adopt a two-stage method as suggested by Ferre and Yao (2003) and

Chen et al. (2011): first we use the PACE (Yao et al., 2005a) method, a functional

principal component approach specifically designed for sparse functional data and is

publicly available at http://www.stat.ucdavis.edu/PACE, to recover Xi with very little

dimension reduction (using fraction of variance explained of 99%), denoted by Xi; then

we apply FSIR or FIND to obtain βk,sn . The following single and multiple index models

are considered

Model I: Y = sin(π〈β1, X〉/4

)+ ε,

Model II: Y = arctan(π〈β1, X〉/2

)+ ε,

Model III: Y = sin(π〈β1, X〉/3

)+ exp

(〈β2, X〉/3

)+ ε,

Model IV: Y = arctan(π〈β1, X〉

)+ sin

(π〈β2, X〉/6

)/2 + ε,

where the regression error ε is i.i.d. N(0, 1) for all models. Due to the nonidentifiability

Page 62: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 53

of βk’s, we examine the projection operator of the the EDR space, i.e., P =∑K0

k=1 βk⊗βk

with K0 denoting the true structural dimension. To assess the estimation of the EDR

space, we calculate the average of the singular values of (PK,sn − P ) as the model error,

i.e., its operator norm ‖PK,sn − P‖ normalized by the number of singular values with

PK,sn =∑K

k=1 βK,sn ⊗ βK,sn . We compute the average model error and its standard error

over 100 Monte Carlo repetitions, shown in Table 3.1. The structure dimension K and

the truncation parameter sn are chosen by minimizing the average model error. One can

see that, for sparse functional data, the proposed FCS outperforms the other methods for

all models, while FSIR and FIND may suffer from the two-stage approach for estimating

index functions. As expected, the gains in the setting of dense functional data are less

noticeable.

To assess model prediction, we use a backfitting algorithm (Hastie and Tibshirani,

1990) to nonparametrically estimate the link functions gk by fitting Yi = β0+∑K

k=1 gk(Zik)+

εi, where Zik = 〈βk,sn , Xi〉. For dense functional data, Zik = 〈βk,sn , Xi〉 is given by an

integral approximation. When Xi are sparse, we substitute Xi with its PACE estimate

Xi. Unlike the FSIR and FCS, the FIND jointly estimates the index and link functions.

To calculating the prediction error, we additionally generate a validation sample of size

500 in each run, and calculate the Monte Carlo average of the mean squared predic-

tion error MSPE = 500−1∑500

i=1

(Y ∗i − Y ∗i

)2over different values of K and sn, where

Y ∗i = β0 +∑K

k=1 gk(Z∗ik) and Z∗ik = 〈βk,sn , X∗i 〉 with X∗i being the underlying trajectories

in the testing sample.

We report the minimized average MSPE and its standard error with corresponding

choice of {K, sn}, shown in Table 3.2. We see that the FCS substantially improves pre-

diction for sparse functional data for all models. In the dense data setting, the prediction

from FCS and FSIR are virtually indistinguishable, while FIND seems to be suboptimal

and the misspecified FLM fails as expected. The structural dimension K is the inherent

parameter of the underlying model, while the truncation sn plays a role of a tuning pa-

Page 63: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 54

Table 3.1: Shown are the model error in form of the operator norm ‖PK,sn − P‖ withits standard error (in parentheses), and the optimal K and sn that minimize the averagemodel error over 100 Monte Carlo repitetions.

Design Model FCS FSIR5 FSIR10 FIND

Sparse

I.476 (.016) .540 (.016) .555 (.018) .492 (.014)

K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3

II.415 (.013) .508 (.014) .511 (.016) .424 (.010)

K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3

III.640 (.009) .667 (.009) .692 (.010) .654 (.008)

K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3

IV.610 (.006) .625 (.007) .637 (.007) .620 (.007)

K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3

Dense

I.305 (.008) .302 (.009) .309 (.007) .310 (.009)

K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3

II.248 (.006) .254 (.007) .257 (.007) .290 (.007)

K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3

III.584 (.007) .590 (.006) .589 (.007) .581 (.009)

K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3

IV.539 (.005) .535 (.005) .543 (.005) .537 (.008)

K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3

rameter that might vary with the purpose of estimation or prediction. In our simulation,

the structural dimension K is correctly specified by both criteria, the average MSPE

and the model error in all cases. Since the model error is not obtainable in practice, we

suggest to approximate the prediction error with a suitable cross-validation procedure

for choosing K together with sn.

3.5 Data Applications

3.5.1 Ebay auction data

In this application, we study the relationship between the winning bid price of n = 156

Palm M515 PDA devices auctioned on eBay between March and May, 2003 and the

bidding history over the 7-day duration of each auction. The observation from a bidding

Page 64: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 55

Table 3.2: Shown are the average MSPE with its standard error (in parentheses), andthe optimal K and sn that minimize the average MSPE over 100 Monte Carlo repitetions.

Design Model FCS FSIR5 FSIR10 FIND FLM

Sparse

I.129 (.005) .149 (.005) .155 (.006) .135 (.005) .225 (.003)

K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 sn = 2

II.117 (.005) .148 (.006) .156 (.006) .125 (.007) .180 (.003)

K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 sn = 4

III.168 (.005) .182 (.006) .191 (.004) .190 (.008) .227 (.004)

K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 sn = 3

IV.231 (.007) .279 (.009) .301 (.009) .298 (.010) .427 (.007)

K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 sn = 3

Dense

I.075 (.003) .078 (.003) .081 (.004) .084 (.005) .193 (.003)

K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 4 sn = 3

II.058 (.002) .062 (.002) .066 (.003) .079 (.004) .108 (.001)

K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 K = 1, sn = 3 sn = 3

III.127 (.004) .135 (.004) .139 (.005) .132 (.006) .195 (.003)

K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 sn = 3

IV.141 (.005) .147 (.005) .150 (.005) .157 (.006) .285 (.002)

K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 K = 2, sn = 3 sn = 3

history represents a “live bid”, the actual price a winning bidder would pay for the device,

known as the “willingness-to-pay” price. Further details on the bidding mechanism can

be found in Liu and Muller (2009). We adopt the view that the bidding histories are i.i.d.

realizations of a smooth underlying price process. Due to the nature of online auctions,

the jth bid of the ith auction usually arrives irregularly at time Tij, and the number of

bids Ni vary widely, from 9 to 52 for this dataset. As common in modeling prices, we

take a log-transform of bid prices. Figure 3.1 shows a sample of 9 randomly selected log

bid histories over the 7-day duration of the auction. Typically, the bid histories are very

sparse until the final hours of each auction when “bid sniping” occurs. At this point,

“snipers” place their bids at the last possible moments in an attempt to deny competing

bidders the chance of placing a higher bid.

Since the main interest is the predictive power of price histories up to time T for

the winning bid prices, we consider the regression of the winning price on the history

Page 65: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 56

1 3 5 7−4−2

024

Day of Auction

Log

Bid

Pric

e

1 3 5 7−4−2

024

1 3 5 7−4−2

024

1 3 5 7−4−2

024

1 3 5 7−4−2

024

1 3 5 7−4−2

024

1 3 5 7−4−2

024

1 3 5 7−4−2

024

1 3 5 7−4−2

024

Figure 3.1: Irregularly and sparsely observed log bid price trajectories of 9 randomlyselected auctions over the 7-day duration.

trajectory X(t), t ∈ [0, T ], and set T = 4.5, 4.6, 4.7, . . . , 6.8 (in days). For each analysis

on the domain [0, T ], we select the optimal structural dimension K and the truncation

parameter sn by minimizing the average 5-fold cross-validated prediction error over 20

random partitions. Shown in Figure 3.2 are the minimized average cross-validated pre-

diction errors, compared with FSIR and FLM, where FSIR is obtained using 5 slices

(superior to FSIR using 10 slices). We did not show results from FIND that have consid-

erably larger errors. The results are not surprising: the prediction error decreases as the

bidding histories encompass more data and get closer to the end. Obviously the proposed

FCS outperforms the other methods and FLM yields the least favorable prediction, until

the last moments of the auction when any sensible method could achieve high predictive

power.

As an illustration, we present the analysis for the case of T = 6. The estimated model

components using FCS are shown in Figure 3.3 with the parameters chosen as K = 2 and

sn = 2. The first index function assigns contrasting weights to bids made before and after

the first day, indicating some bidders tend to underbid at the beginning only to quickly

overbid relative to the mean. The second index represents a cautious type of bidding

Page 66: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 57

4.5 5 5.5 6 6.5

1.02

1.04

1.06

1.08

1.1

1.12

1.14

1.16x 10

−3

T (days)

CV

pre

dic

tion

err

or

FCSFLMFSIR

Figure 3.2: Average 5-fold cross-validated prediction errors over 20 random partitionsacross various time domains [0, T ], for sparse Ebay auction data.

behavior, entering at a lower price and slowly increasing towards the average level. These

features contribute the most towards the prediction of the winning bid prices. Also seen

are the slightly nonlinear patterns in the estimated additive link functions.

3.5.2 Spectrometric data

In this example, we study the spectrometric data consisting of n = 215 pieces of finely

chopped meat, publicly available at http://lib.stat.cmu.edu/datasets/tecator. For each

meat sample, the moisture content and the absorbance spectrum, measured at 100 equally

spaced wavelengths between 850 nm to 1050 nm, were recorded using a Tecator Infratec

Food and Feed Analyzer. Each absorbance spectrum is treated as an i.i.d. realization of

the absorbance process. Thus, the 215 absorbance trajectories, shown in Figure 3.4, can

be regarded as densely observed functional data.

In Table 3.3, we present the minimized average 5-fold cross-validated prediction error

over 20 random partitions for different methods, together with the selected structural

dimensions and the truncation sizes. Similar to our simulation study for dense functional

Page 67: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 58

0 2 4 6

−1

−0.5

0

0.5

t (days)

β 1

0 2 4 6

−1

−0.5

0

0.5

t (days)

β 2

4 6 8 10−0.2

−0.15

−0.1

−0.05

0

0.05

⟨ β1, X ⟩

g 1

−10 −8 −6 −4 −2−0.2

−0.15

−0.1

−0.05

0

0.05

⟨ β2, X ⟩

g 2

Figure 3.3: Estimated model components for sparse Ebay auction data using FCS withK = 2 and sn = 2. The first and second row of plots shows the estimated index functions,i.e., the EDR directions, and the additive link functions, respectively.

0 10 20 30 40 50 60 70 80 90 1002

2.5

3

3.5

4

4.5

5

5.5

Spectrum channel

Ab

sorb

an

ce

Figure 3.4: Absorbance trajectories of 215 meat samples measured over 100 equallyspaced wavelengths between 850 nm and 1050 nm.

Page 68: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 59

Table 3.3: Average 5-fold cross-validated prediction error over 20 Monte Carlo runs withselected K and sn, for dense spectrometric data.

FCS FSIR5 FSIR10 FIND FLM.0093 (.0001) .0096 (.0001) .0095 (.0001) .0222 (.0016) .0128 (.0002)K = 2, sn = 5 K = 2, sn = 5 K = 2, sn = 5 K = 2, sn = 5 sn = 8

data, the results for FCS and FSIR are virtually indistinguishable, and both improve

significantly upon FIND and FLM. The estimated EDR directions and additive link

functions are displayed in Figure 3.5 with K = 2 and sn = 5, where the link functions

appear to be nearly linear. The first index function emphasizes the rising trend above

the mean at wavelengths around 930 nm, and the second index picks up the contrast

between wavelengths 930 nm and 950 nm. Such EDR directions suggest that the rise

and fall around wavelengths 930 nm and 950 nm in the spectrometric trajectories, seen

in Figure 3.4, are important features for predicting moisture content.

840 860 880 900 920 940 960 980 1000 1020 1040 1060−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

Wavelength (nm)

β 1

840 860 880 900 920 940 960 980 1000 1020 1040 1060−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

Wavelength (nm)

β 2

−1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4

−15

−10

−5

0

5

10

15

20

25

⟨ β1, X ⟩

g1

0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6

−15

−10

−5

0

5

10

15

20

25

⟨ β2, X ⟩

g2

Figure 3.5: Estimated model components for spectrometric data using FCS for (K, sn) =(2, 5). The first and second row of plots shows the estimated EDR directions and additivelink functions, respectively.

Page 69: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 60

3.6 Concluding Remarks

In this chapter we introduce a new method of effective dimension reduction for sparse

functional data, where one observes only a few noisy and irregular measurements for

some or all of the subjects. The proposed FCS estimation is link-free and targets at the

EDR space directly by borrowing information across the entire sample. Theoretical anal-

ysis reveals the bias and variance tradeoff associated with the truncation parameter, and

the impact due to decaying structures of the predictor process and the EDR directions.

Numerical results from simulated and real examples are shown superior to existing meth-

ods for sparse functional data. It is worth mentioning that the proposed method in fact

opens a door to more sophisticated dimension reduction approaches for sparse functional

data. Following the strategy of “pooling information together”, we may further extend

the idea of functional cumulative slicing to variance estimation or direction regression, by

analogy to the multivariate case (Zhu et al., 2010). The usefulness and justifications of

these extensions deserve further study and shall be explored in our future investigation.

3.A Regularity Conditions

Without loss of generality, we assume that the known weight function w(·) = 1. Denote

T = [a, b] and T δ = [a − δ, b + δ] for some δ > 0, a single observation time by T and

a pair by (T1, T2)> whose densities are f(t) and f2(s, t), respectively. Recall that the

unconditional mean function m(t, y) = E[X(t)1(Y ≤ y)]. The regularity conditions for

the underlying moment functions and design densities are as follows, where `1, `2 are

non-negative integers,

Assumption 3.7. ∂2

∂s`1∂t`2Σ is continuous on T δ×T δ for `1 +`2 = 2, ∂2m/∂t2 is bounded

and continuous respect to t ∈ T for all y ∈ R.

Assumption 3.8. f(1)1 (t) is continuous on T δ with f1(t) > 0, ∂

∂s`1∂t`2f2 is continuous on

T δ × T δ for `1 + `2 = 1 with f2(s, t) > 0.

Page 70: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 61

Assumption 3.7 can be guaranteed by twice differentiable process, and 3.8 is standard

and also implies the boundedness and Lipschitz continuity of f . Recall the bandwidths

h1 and h2 used in smoothing steps for m in (3.3) and Σ in (3.5), respectively. We assume

that

Assumption 3.9. h1 → 0, h2 → 0, nh31/ log n→∞, nh5

1 <∞, nh22 →∞ and nh6

2 <∞.

We say that a bivariate kernel function K2 is of order (ν, `), where ν is a multi-index

ν = (ν1, ν2)>, if

∫ ∫u`1v`2K2(u, v)dudv =

0 0 ≤ `1 + `2 < `, `1 6= ν1, `2 6= ν2,

(−1)|ν|ν1!ν2! `1 = ν1, `2 = ν2,

6= 0 `1 + `2 = `,

(3.7)

where |ν| = ν1 + ν2 < `. The univariate kernel K is said to be of order (ν, `) for a

univariate ν = ν1, if (3.7) holds with `2 = 0 on the right hand side, integrating only over

the argument u on the left hand side. The following standard conditions on the kernel

densities are required.

Assumption 3.10. Kernel functions K1 and K2 are non-negative with compact supports,

bounded and of order (0, 2) and ((0, 0), 2), respectively.

3.B Proof of Theorem 3.1

It is equivalent to show that if b ⊥ span(Σβ1, . . . ,ΣβK), i.e., 〈b,Σβk〉 = 0 for k =

1, . . . , K, then 〈b,m(y)〉 = 0. Observe that

〈b,m(y)〉 = E[E{〈b,X1(Y ≤ y)〉 | Y }

]= E

{E(〈b,X〉 | Y )1(Y ≤ y)

}= E

{E(〈b,X〉 | 〈β1, X〉, . . . , 〈βK , X〉)1(Y ≤ y)

},

Page 71: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 62

where the last line follows from the assumption of model (3.1). It suffices to show

the inner-expectation E(〈b,X〉 | 〈β1, X〉, . . . , 〈βK , X〉) = 0, implied by E{E2(〈b,X〉 |

〈β1, X〉, . . . , 〈βK , X〉)} = 0. Invoking assumptions 3.1–3.2 and that E(〈βk, X〉〈b,X〉) =

〈b,Σβk〉,

E{E2(〈b,X〉 | 〈β1, X〉, . . . , 〈βK , X〉)

}= E

{E(〈b,X〉 | 〈β1, X〉, . . . , 〈βK , X〉)(c0 +

K∑k=1

ck〈βk, X〉)}

= E{E(c0〈b,X〉+

K∑k=1

ck〈βk, X〉〈b,X〉 | 〈β1, X〉, . . . , 〈βK , X〉)}

= c0E(〈b,X〉) +K∑k=1

ck〈b,Σβk〉 = 0,

as desired.

3.C Proof of Theorem 3.2

Let M denote the upper bound of the random variable N ,

Sn(t) =1

n

n∑i=1

M∑j=1

1(Ni ≥ j)

h1E(N)K1

(Tij − th1

) 1Tij−th1

Tij−th1

(Tij−th1

)2

and

S(t) =

fT (t) 0

0 fT (t)σ2K

.

The local linear estimator of m(t, y) with kernel K1 is

m(t, y) = (1, 0)S−1n (t)

∑i

∑j

1(Ni≥j)nh1E(N)

K1

(Tij−th1

)Uij1(Yi ≤ y)∑

i

∑j

1(Ni≥j)nh1E(N)

K1

(Tij−th1

)(Tij−th1

)Uij1(Yi ≤ y)

.

Page 72: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 63

Let U∗ij(t, y) = Uij1(Yi ≤ y)−m(t, y)−m(1)(t, y)(Tij−t) andWn(z, t) = (1, 0)S−1n (t)(1, z)′K1(z).

Then

m(t, y)−m(t, y) =1

n

n∑i=1

M∑j=1

1(Ni ≥ j)

h1E(N)Wn

(Tij − th1

, t

)U∗ij(t, y).

If we denote a point between Tij and t by t∗ij, then by Taylor’s Theorem U∗ij(y) = Uij1(Yi ≤

y) −m(Tij, y) + 12m(2)(t∗ij, y)(Tij − t)2. Finally, if we let eij(y) denote the “error” term,

i.e., eij(y) = Uij1(Yi ≤ y)−m(Tij, y), we then have

m(t, y)−m(t, y) =1

n

n∑i=1

M∑j=1

1(Ni ≥ j)

h1E(N)fT (t)K1

(Tij − th1

)eij(y)

+1

2n

n∑i=1

M∑j=1

h11(Ni ≥ j)

E(N)fT (t)K1

(Tij − th1

)(Tij − th1

)2

m(2)(t∗ij, y) + An(t, y),

whereAn(t, y) = m(t, y)−m(t, y)−{nh1E(N)fT (t)}−1∑

i

∑j 1(Ni ≥ j)K1((Tij−t)/h1)U∗ij(t, y).

This allows us to write Λ(s, t)− Λ(s, t) = I1n(s, t) + I2n(s, t) + I3n(s, t), where

I1n(s, t) =1

n

n∑k=1

{m(s, Yk)

[m(t, Yk)−m(t, Yk)

]+m(t, Yk)

[m(s, Yk)−m(s, Yk)

]}I2n(s, t) =

1

n

n∑k=1

{m(s, Yk)−m(s, Yk)

}{m(t, Yk)−m(t, Yk)

}I3n(s, t) =

1

n

n∑k=1

m(s, Yk)m(t, Yk)− Λ(s, t),

which implies by the Cauchy-Schwarz inequality that ‖Λ−Λ‖2H = Op(‖I1n‖2

H +‖I2n‖2H +

‖I3n‖2H). We will drop the subscript H for brevity in the sequel. Recall that we defined

Zi as the underlying data quadruplet (Ti, Ui, Yi, Ni). Further, let∑

(p) hi1,...,ip denote the

sum of hi1,...,ip over the permutations of i1, . . . , ip. We will repeatedly make use of the

dominated convergence theorem (DCT) and its variant given in Prakasa-Rao (1983, p.35),

which we will call the PR proposition. We will refer to Corollary 1 of Martins-Filho and

Yao (2006) as MFY corollary. Unless otherwise stated, we will drop the dummy variable

Page 73: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 64

in all integrals for the sake of brevity. Finally, let 0 < BT ≤ f(t) ≤ BT <∞ denote the

lower and upper bounds of the density function of T , let |K1(x)| ≤ BK <∞ denote the

bound on the kernel function K1, and let |∂2m/∂t2| ≤ B2m < ∞ denote the bound on

the second partial derivative of m(t, y) with respect to t.

(a) We further decompose I1n(s, t) into I1n(s, t) = I11n(s, t) + I12n(s, t) + I13n(s, t), where

I11n(s, t) =1

n2

n∑k=1

n∑i=1

M∑j=1

{1(Ni ≥ j)

h1E(N)fT (t)K1

(Tij − th1

)eij(Yk)m(s, Yk)

+1(Ni ≥ j)

h1E(N)fT (s)K1

(Tij − sh1

)eij(Yk)m(t, Yk)

}

I12n(s, t) =1

2n2

n∑k=1

n∑i=1

M∑j=1

{h11(Ni ≥ j)

E(N)fT (t)K1

(Tij − th1

)(Tij − th1

)2

m(2)(t∗ij, Yk)m(s, Yk)

+h11(Ni ≥ j)

E(N)fT (s)K1

(Tij − sh1

)(Tij − sh1

)2

m(2)(t∗ij, Yk)m(t, Yk)

}

I13n(s, t) =1

n

n∑k=1

{m(s, Yk)An(t, Yk) +m(t, Yk)An(s, Yk)

},

which we analyze individually below.

(a-i) We will first show E‖I11n‖2 = O({nh1}−1). We write I11n(s, t) as

I11n(s, t) =1

2n2

n∑k=1

n∑i=1

∑(2)

{hik(s, t) + hik(t, s)

}=

1

2n2

n∑k=1

n∑i=1

ψn(Zi, Zk; s, t) =1

2vn(s, t),

where vn(s, t) is a V -statistic with symmetric kernel ψn(Zi, Zk; s, t) and

hik(s, t) =M∑j=1

1(Ni ≥ j)

h1E(N)fT (t)K1

(Tij − th1

)eij(Yk)m(s, Yk).

Since E[eij(Yk)|Tij, Yk

]= 0, it is easy to show that E[hik(s, t)] = E[hik(t, s)] = E[hki(s, t)] =

E[hki(t, s)] = 0. Thus θn(s, t) = E[ψn(Zi, Zk; s, t)] = 0. Additionally,

ψ1n(Zi; s, t) = E[ψn(Zi, Zk; s, t)|Zi

]

Page 74: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 65

=M∑j=1

1(Ni ≥ j)

h1E(N)fT (t)K1

(Tij − th1

)E[eij(Yk)m(s, Yk)|Zi

]+

M∑j=1

1(Ni ≥ j)

h1E(N)fT (s)K1

(Tij − sh1

)E[eij(Yk)m(t, Yk)|Zi

].

Provided E[ψ2n(Zi, Zk; s, t)

]= o(n), the MFY corollary gives nE

[vn(s, t) − un(s, t)

]2=

o(1), where un(s, t) = 2n−1∑n

i=1 ψ1n(Zi; s, t) is the projection of the corresponding U -

statistic. Recall that the projection of a U -statistic is a sum of i.i.d. random variables

ψ1n(Zi; s, t). Thus, E‖I11n‖2 ≤ 2n−1∫ ∫

var(E[hik(s, t)|Zi]

)+2n−1

∫ ∫var(E[hik(t, s)|Zi]

)+

o(n−1),

2

n

∫ ∫var(E[hik(s, t)|Zi

])≤

M∑j=1

2P (Ni ≥ j)

nh21E(N)

∫ ∫f−2T (t)E

{K2

1

(Tij − th1

)E2[eij(Yk)m(s, Yk)|Zi

]}

=M∑j=1

2P (Ni ≥ j)

nh1E(N)

∫ ∫ ∫f−2T (t)K2

1(u)

× EXi,Yi,εi

{E2Yk

[eij(Yk)m(s, Yk)|Tij = t+ uh1

]}fT (t+ uh1)dudsdt

→M∑j=1

2‖K1‖2P (Ni ≥ j)

nh1E(N)

∫ ∫f−1T (t)EXi,Yi,εi

{E2Yk

[eij(Yk)m(s, Yk)|Tij = t

]}≤ 8‖K1‖2

nh1BT

E‖X‖4 +4‖K1‖2σ2

nh1BT

E‖X‖2 = O

(1

nh1

),

where the first line follows from the Cauchy-Schwarz inequality, the second line by letting

u = h−11 (Tij − t) and observing that Tij is independent of Xi, Yi, εi, and the third line by

the DCT since the integrand is bounded by 4B−2T BTB

2KE‖X‖4 + 2B−2

T BTB2Kσ

2E‖X‖2.

Thus E‖I11n‖2 = O({nh1}−1), provided that E[ψ2n(Zi, Zk; s, t)] = o(n) for all i, k which

we will show below. For i 6= k,

E[ψ2n(Zi, Zk; s, t)

]= 2E

[h2ik(s, t)

]+ 2E

[h2ik(t, s)

]+ 4E

[hik(s, t)hik(t, s)

]

Page 75: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 66

+ 4E[hik(s, t)hki(s, t)

]+ 4E

[hik(s, t)hki(t, s)

].

Observe that

n−1E[h2ik(s, t)

]=

M∑j=l

M∑l=1

P (Ni ≥ max(j, l))

E2(N)f 2T (t)

E{(nh2

1

)−1K1

(Tij − th1

)×K1

(Til − th1

)eij(Yk)eil(Yk)m

2(s, Yk)

}

For j = l, the PR proposition applied to the expectation on the right hand side gives

n−1h−11 ‖K1‖2fT (t)E

[e2ij(Yk)m

2(s, Yk)|Tij = t]

= o(1) provided nh1 → ∞. For j 6= l, a

similar application gives n−1f 2T (t)E

[eij(Yk)eil(Yk)m

2(s, Yk)|Tij = Til = t]

= o(1). The

next two terms E[h2ik(t, s)

]and E

[hik(s, t)hik(t, s)

]can be handled similarly. For the

remaining two terms, we apply the PR proposition twice to derive,

n−1E[hik(s, t)hki(s, t)

]=

M∑j=1

M∑l=1

P (Ni ≥ j)P (Nk ≥ l)

nE2(N)f 2T (t)

∫ ∫K1(u)K1(v)fT (t+ uh1)

× fT (t+ vh1)E[eij(Yk)ekl(Yi)m(s, Yk)m(s, Yi)|Tij = t+ uh1, Tkl = t+ vh1

]= o(1).

The calculations above can be used in the same manner to derive similar results for the

case i = k. Thus we have E[ψ2n(Zi, Zk; s, t)

]= o(n).

(a-ii) We will now show E‖I12n‖2 = O(h41) + o(n−1), writing I12n(s, t) as

I12n(s, t) =1

4n2

n∑i=1

n∑k=1

∑(2)

[hik(s, t) + hik(t, s)

]=

1

4n2

n∑i=1

n∑k=1

ψn(Zi, Zk; s, t) =1

4vn(s, t),

where vn(s, t) is a V -statistic with

hik(s, t) =M∑j=1

h11(Ni ≥ j)

E(N)fT (t)K1

(Tij − th1

)(Tij − th1

)2

m(2)(t∗ij, Yk)m(s, Yk).

Page 76: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 67

By the MFY corollary, nE[vn(s, t) − un(s, t)]2 = o(1) provided E[ψ2n(Zi, Zk; s, t)] = o(n)

for all i, k. Hence

E∥∥I12n

∥∥2=

1

16

∫ ∫ {E2[un(s, t)

]+ var

(un(s, t)

)}+ o(n−1),

where the projection of the U -statistic is un(s, t) = 2n−1∑n

i=1 ψ1n(Zi; s, t) − θn(s, t),

ψ1n(Zi; s, t) =∑

(2)

{E[hik(s, t)|Zi

]+ E

[hik(t, s)|Zi

]}with mean θn(s, t) = E

[un(s, t)

]=

2E[hik(s, t)

]+ 2E

[hik(t, s)

]. Observe E2

[un(s, t)

]≤ 4E2

[hik(s, t)

]+ 4E2

[hik(t, s)

]. Thus,

similarly for E2[hik(t, s)

], we use the DCT to derive

4h−41

∫ ∫E2[hik(s, t)

]→

M∑j=1

4B2KP

2(Ni ≥ j)

5E(N)

∫E[(m(2)(t, Yk))

2] ∫

E[m2(s, Yk)

]≤ 4C1B

2KB

22mE‖X‖2 = O(1)

where C1 =∫T u

4du. This leads to∫ ∫

E2[un(s, t)

]= O(h4

1). Next,

var(h−2

1 un(s, t))

= 4(nh41)−1

{E[E2(hik(s, t)|Zi)

]+ E

[E2(hik(t, s)|Zi)

]+ E

[E2(hki(s, t)|Zi)

]+ E

[E2(hki(t, s)|Zi)

]+ 2E

[E(hik(s, t)|Zi)E(hki(s, t)|Zi)

]+ 2E

[E(hik(s, t)|Zi)E(hki(t, s)|Zi)

]+ 2E

[E(hik(t, s)|Zi)E(hki(s, t)|Zi)

]+ 2E

[E(hik(t, s)|Zi)E(hki(t, s)|Zi)

]− 4E2

[hik(s, t)

]− 4E2(hik(t, s)

]− 4[E(hik(s, t))E(hki(t, s))

]}.

Firstly, for j 6= l, using the DCT, it can be shown that 4(nh41)−1

∫ ∫E[E2(hik(s, t)|Zi)]

is bounded by 4n−1B2mσ4KE‖X2‖ = O(n−1). For j = l, it can be shown to be bounded

by n−1h−11 B2

KB2mB−1T E‖X‖ = O({nh1}−1). Combining the previous two results shows

that 4(nh41)−1

∫ ∫E[E2(hik(s, t)|Zi)] = o(1), provided nh1 → ∞. All of the remaining

terms can be handled similarly using the DCT, so∫ ∫

var(un(s, t)) = o(h41). Thus we

have E‖I12n‖2 = O(h41) + o(n−1) provided E

[ψ2n(Zi, Zk; s, t)

]= o(n), which can be shown

Page 77: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 68

similarly using the PR proposition as before.

(a-iii) We now show ‖I13n‖2 = Op(n−1h1 +h6

1). Following Lemma 2 of Martins-Filho and

Yao (2007),

|An(t, Yk)| =

∣∣∣∣∣M∑j=1

n∑i=1

1(Ni ≥ j)

nh1E(N)

{Wn

(Tij − th1

, t

)− f−1

T (t)K1

(Tij − th1

)}U∗ij(t, Yk)

∣∣∣∣∣≤ h−1

1

{(1, 0)

(S−1n (t)− S−1(t)

)2(1, 0)′

}1/2(∣∣∣∣∑

j

∑i

1(Ni ≥ j)

nE(N)K1

(Tij − th1

)

× U∗ij(t, Yk)∣∣∣∣+

∣∣∣∣∑j

∑i

1(Ni ≥ j)

nE(N)K1

(Tij − th1

)(Tij − th1

)U∗ij(t, Yk)

∣∣∣∣)

= h−11

{(1, 0)

(S−1n (t)− S−1(t)

)2(1, 0)′

}1/2

Rn(t, Yk).

If nh31/ log n→∞, a direct application of Lemma 1(b) of Martins-Filho and Yao (2007)

gives supt∈T h−11

∣∣{(1, 0)(S−1n (t)− S−1(t))2(1, 0)′}1/2

∣∣ = Op(1). Next,

Rn(t, Yk) ≤ |Rn1(t, Yk)|+ |Rn2(t, Yk)|+ |Rn3(t, Yk)|+ |Rn4(t, Yk)|,

where

Rn1(t, Yk) =M∑j=1

n∑i=1

1(Ni ≥ j)

nE(N)K1

(Tij − th1

)eij(Yk)

Rn2(t, Yk) =M∑j=1

n∑i=1

h211(Ni ≥ j)

2nE(N)K1

(Tij − th1

)(Tij − th1

)2

m(2)(t∗ij, Yk)

Rn3(t, Yk) =M∑j=1

n∑i=1

1(Ni ≥ j)

nE(N)K1

(Tij − th1

)(Tij − th1

)eij(Yk)

Rn4(t, Yk) =M∑j=1

n∑i=1

h211(Ni ≥ j)

2nE(N)K1

(Tij − th1

)(Tij − th1

)3

m(2)(t∗ij, Yk)

Thus n−1∑

km(s, Yk)Rn1(t, Yk) = h1fT (t)I11n(s, t) and from the analysis of I11n this im-

plies ‖h1fT I11n‖2 = Op(n−1h1). Secondly, n−1

∑km(s, Yk)Rn2(t, Yk) = h1fT (t)I12n(s, t)

and from the analysis of I12n this implies ‖h1fT I12n‖2 = Op(h61). It follows similarly that

Page 78: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 69

the third and fourth remaining terms are Op(n−1h1) and Op(h

61) respectively. Hence,

‖I13n‖2 = Op(n−1h1 + h6

1). Combining the previous results thus show that ‖I1n‖2 =

Op({nh1}−1 + h41).

(b) These terms are of higher order and are omitted for brevity.

(c) By the law of large numbers, ‖n−1∑n

i=1 m(s, Yi)m(t, Yi)−Λ(s, t)‖2 = Op(n−1). Com-

bining previous results leads to ‖Λ− Λ‖2 = Op{(nh1)−1} given h41 = O{(nh1)−1}.

3.D Proof of Theorem 3.3

For a bounded linear operator A, let ‖A‖ denote the norm defined on the space of

bounded linear operators from H to itself, i.e., ‖A‖ = sup{‖Af‖H : ‖f‖H ≤ 1}. To

facilitate theoretical analysis, for each k = 1, . . . , K, let ηk = Σ1/2βk (resp. ηk,sn =

Σ1/2sn βk,sn) be the normalized eigenvectors of the eigenvalue problem Σ−1ΛΣ−1/2ηk = λkβk

(resp. Σ−1sn ΛΣ

−1/2sn ηk,sn = λk,sn βk,sn). Recall that Σ−1 and Σ−1/2 are well-defined by

suitably restricting the domain as in Proposition 1. This allows us to write

∥∥βk,sn − βk∥∥H ≤ ∥∥λ−1k,sn

Σ−1sn ΛΣ−1/2

sn − λ−1k Σ−1ΛΣ−1/2

∥∥+ λ−1

k

∥∥Σ−1ΛΣ−1/2∥∥∥∥ηk,sn − ηk∥∥

≤ λ−1k,sn

∥∥Σ−1sn ΛΣ−1/2

sn − Σ−1ΛΣ−1/2∥∥

+∥∥Σ−1ΛΣ−1/2

∥∥(∣∣λ−1k,sn− λ−1

k

∣∣+ λ−1k

∥∥ηk,sn − ηk∥∥),using the inequality λ−1

k,sn≤ λ−1

k + |λ−1k,sn−λ−1

k |. Applying standard theory for self-adjoint

compact operators (Bosq, 2000) gives

∣∣λk,sn − λk∣∣ ≤ ∥∥Σ−1/2sn ΛΣ−1/2

sn − Σ−1/2ΛΣ−1/2∥∥∥∥ηk,sn − ηk∥∥H ≤ 2

√2δ−1k

∥∥Σ−1/2sn ΛΣ−1/2

sn − Σ−1/2ΛΣ−1/2∥∥,

Page 79: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 70

where δ1 = λ1 − λ2 and δk = min(λk−1 − λk, λk − λk+1) for k > 1. Thus, for each

k = 1, . . . , K, we have

∥∥βk,sn − βk∥∥2

H= Op

(I1n + I2n

),

where I1n = ‖Σ−1sn ΛΣ

−1/2sn −Σ−1ΛΣ−1/2‖2 and I2n = ‖Σ−1/2

sn ΛΣ−1/2sn −Σ−1/2ΛΣ−1/2‖2. Be-

low, we show I1n = Op(s3a+2n /

√nh1+s

(4a−2b+4)+n /(

√nh2)+1/(s2b−2a−1

n )). The calculations

for I2n are similar. Observe I1n ≤ 3I11n + 3I12n + 3I13n, where

I11n =∥∥Σ−1

sn ΛΣ−1/2sn − Σ−1ΛΣ−1/2

∥∥2,

I12n =∥∥Σ−1

sn ΛΣ−1/2sn − Σ−1

sn ΛΣ−1/2sn

∥∥2,

I13n =∥∥Σ−1

sn ΛΣ−1/2sn − Σ−1

sn ΛΣ−1/2sn

∥∥2,

which we study separately below.

(a) Recall that Πsn =∑sn

j=1 φj ⊗ φj is the orthogonal projector onto the eigenspace

associated with the sn largest eigenvalues of Σ. Let I denote the identity operator and

Π⊥sn = I − Πsn denote the operator perpendicular to Πsn , i.e., Π⊥sn is the orthogonal

projector onto the eigenspace associated with eigenvalues of Σ less than αsn . Thus

Σ−1sn ΛΣ

−1/2sn = ΠsnΣ−1ΛΣ−1/2Πsn , which allows us to write I11n ≤ ‖Π⊥snΣ−1ΛΣ−1/2‖ +

‖Σ−1ΛΣ−1/2Π⊥sn‖. Note that the range of ΛΣ−1/2 is spanned by β1, . . . , βK and a direct

calculation leads to

∥∥Π⊥snΣ−1ΛΣ−1/2∥∥2 ≤

K∑k=1

λ2k

∥∥∑i>sn

α−1i

∞∑j=1

bkj〈φi, φj〉φi∥∥2

≤K∑k=1

λ2k

∑j>sn

b2kj

α2j

≤ C1

K∑k=1

λ2k

∑j>sn

j−2b+2a = O

(1

s2b−2a−1n

),

and similarly for ‖Σ−1ΛΣ−1/2Π⊥sn‖2.

Page 80: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 71

(b) We decompose I12n as I12n ≤ 3I121n + 3I122n + 3I123n, where

I121n =∥∥(Σ−1

sn − Σ−1sn )ΛΣ−1/2

sn

∥∥2,

I122n =∥∥Σ−1

sn Λ(Σ−1/2sn − Σ−1/2

sn )∥∥2,

I123n =∥∥(Σ−1

sn − Σ−1sn )Λ(Σ−1/2

sn − Σ−1/2sn )

∥∥2.

(b-i) Note I121n ≤ 6‖ΛΣ−1/2Πsn‖2I1211n + 6‖ΛΣ−1/2Πsn‖2I1212n, where

I1211n =∥∥ sn∑j=1

(α−1j − α−1

j )φj ⊗ φj∥∥2, I1212n =

∥∥ sn∑j=1

α−1j (φj ⊗ φj − φj ⊗ φj)

∥∥2.

Then I1211n ≤∑sn

j=1(αj−αj)2(αjαj)−2 ≤ C2‖Σ−Σ‖2

∑snj=1 j

4a = Op

(s4a+1n /(nh2

2)), where

the second inequality follows from |αj−αj| ≤ ‖Σ−Σ‖ for self-adjoint compact operators.

Similarly, I1212n ≤ 2∑sn

j=1 α−2j ‖φj − φj‖2 ≤ C3

∥∥Σ − Σ∥∥2∑sn

j=1 j4a+2 = Op(s

4a+3n /(nh2

2)),

where the second inequality follows from ‖φj − φj‖ ≤ Cδ−1j ‖Σ − Σ‖ for self-adjoint

compact operators. Similar to the calculation for I11n, ‖ΛΣ−1/2Πsn‖2 = O(s−2b+1n ). Thus

I121n = Op(s(4a−2b+4)+n /(nh2

2)).

(b-ii) Using similar decompositions as for I121n, we write I122n ≤ 6‖ΠsnΣ−1Λ‖2I1221n +

6‖ΠsnΣ−1Λ‖2I1222n, where

I1221n =∥∥ sn∑j=1

(α−1/2j − α−1/2

j )φj ⊗ φj∥∥2, I1222n =

∥∥ sn∑j=1

α−1/2j (φj ⊗ φj − φj ⊗ φj)

∥∥2.

Then, I1221n ≤∑sn

j=1(αj − αj)2(αjα

2j )−1 ≤ C4‖Σ − Σ‖2

∑snj=1 j

3a = Op

(s3a+1n /(nh2

2)),

where the first inequality follows from the Mean Value Theorem with αj between αj and

αj. Next, I1222n ≤ C4‖Σ − Σ‖2∑sn

j=1 j3a+1 = Op

(s3a+2n /(nh2

2)). Also, ‖ΠsnΣ−1Λ‖2 =

O(s−2b+1n ) as before and thus I122n = op

(s

(4a−2b+2)+n /(nh2

2)).

(b-iii) Using similar calculations, I123n can also be shown to be op(s(4a−2b+2)+n /(nh2

2)).

This gives I12n = Op

(s

(4a−2b+2)+n /(nh2

2))

as a result.

Page 81: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 3. Cumulative Slicing Estimation for Dimension Reduction 72

(c) Observe I13n ≤ ‖Σ−1sn ‖

2‖Λ−Λ‖2‖Σ−1/2sn ‖2, where ‖Σ−1

sn ‖2 ≤

∑snj=1 α

−2j ≤ C5

∑snj=1 j

2a =

Op(s2a+1n ) and similarly ‖Σ−1/2

sn ‖ = Op(sa+1n ). From Theorem 2 we have ‖Λ − Λ‖2 =

Op({nh1}−1). Thus I13n = Op(s3a+2n /(nh1)). Combining previous results leads to

I1n = Op

(1

s2b−2a−1n

+s

(4a−2b+4)+n

nh22

+s3a+2n

nh1

).

Page 82: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 4

Cumulative Variance Estimation for

Classification

73

Page 83: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 4. Cumulative Variance Estimation for Classification 74

4.1 Introduction

In a typical classification problem in functional data analysis (FDA), one observes a train-

ing set {(Xi, Yi) : 1 ≤ i ≤ n}, where Xi is a random function and Yi ∈ {0, 1, . . . , C − 1}

is a known class label. Analogous to multivariate classification, the goal is to predict in

which class does a new observation X0 belong. This problem has been studied extensively

in FDA. Pfeiffer et al. (2002) suggested a simple method of using summary statistics such

as the mode; James and Hastie (2001), Shin (2008) extended linear discriminant analysis;

James and Sugar (2003) developed a clustering method for sparse functional data; Hall

et al. (2001), Song et al. (2008) constructed classifiers based on functional principal com-

ponents (FPCs); Leng and Muller (2006) proposed functional logistic regression of FPCs;

Ferraty and Vieu (2003) estimated posterior probabilities using kernel estimators; Biau

et al. (2005) worked with a nearest neighbor-type classifier of FPCs; Ferraty et al. (2007)

extended multivariate factorial analysis; Cuevas et al. (2007), Cuesta-Albertos and Nieto-

Reyes (2008) considered classification based on data depth; Wang et al. (2007) studied

Bayesian classification using wavelets; Tian and James (2013) projected the functional

process onto simple piecewise constant and piecewise linear functions; Hall and Delaigle

(2012) showed that perfect asymptotic classification is possible if the functional process

satisfy certain smoothness conditions.

In this chapter, we study functional classification from the perspective of effective

dimension reduction (EDR). Recall from Chapter 3 that EDR methods assume a very

flexible semiparametric multiple index model

Y = g(〈β1, X〉, . . . , 〈βK , X〉; ε

). (4.1)

Dimension reduction is particularly useful when the process X is infinite dimensional

since it is natural to expect that information relevant to the separation of the C classes is

contained in only a small number of projections 〈β1, X〉, . . . , 〈βK , X〉. Despite the sizable

Page 84: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 4. Cumulative Variance Estimation for Classification 75

literature on EDR methods for multivariate regression, the corresponding literature for

classification has been relatively scarce. Cook (1996), Cook and Critchley (2000) pursued

an exploratory graphical approach by studying binary plots of at most three projections of

X onto the EDR space. Cook and Lee (1999) showed that SIR (Li, 1991) can only detect

differences between the means of the two underlying classes, while sliced average variance

estimation (SAVE, Cook and Weisberg, 1991) can detect both mean and covariance

differences. Li (2000), Cook and Yin (2001) further proved that SIR’s EDR directions

and Fisher’s linear discriminant (LDA) coordinates are proportional and thus span the

same subspace. However, Velilla (2008, 2010) showed that quadratic discriminant analysis

(QDA, Schott, 1993) and SAVE estimate vastly different subspaces since proportionality

does not hold.

By analogy to the multivariate case (Zhu et al., 2010), we extend the ideas in Chap-

ter 3 for Functional Cumulative Slicing (FCS) to facilitate the derivation of Functional

Cumulative Variance (FCV), the cumulative slicing version of SAVE. Our primary mo-

tivation for developing FCV is that although classification does not pose any conceptual

or theoretical challenges to EDR methods in general, first moment methods such as FCS

suffer since, as we will demonstrate later, they do not adequately estimate the EDR space

in practice. Following the same strategy of “pooling data together across subjects”, our

proposal is applicable for both densely/completely and sparsely observed functional data.

The rest of this chapter is organized as follows. We present the proposed FCV methodol-

ogy and estimation procedure in Chapter 4.2, and Chapter 4.3 provides numerical studies

of simulated examples.

4.2 Methodology

Although our method is applicable to C-class classification, we will assume C = 2 for

simplicity. We observe data pairs {(Xi, Yi) : 1 ≤ i ≤ n} independent and identically

Page 85: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 4. Cumulative Variance Estimation for Classification 76

distributed (i.i.d.) as (X, Y ), where Xi is a random variable defined on the real and

separable Hilbert space H ≡ L2(T ) for a compact interval T , and Yi its class label that

equals k if Xi is sampled from the subpopulation Πk for k = 0, 1. Let π0 and π1 = 1−π0

denote the probabilities that X is drawn from subpopulations Π0 and Π1, respectively.

Finally, we make the same assumption as in Chapter 3 on the first and fourth moments

of X,

Assumption 4.1. X is centered and has a finite fourth moment∫τE[X4(t)]dt <∞.

Recall that under assumption 4.1, the covariance surface of X is given by Σ(s, t) =

E[X(s)X(t)], which generates a Hilbert-Schmidt operator Σ = E[X ⊗ X] on H. By

Mercer’s Theorem, Σ admits a spectral decomposition Σ =∑∞

j=1 αjφj⊗φj, where eigen-

functions {φj}j=1,2,... form a complete and orthonormal system in H and eigenvalues

{αj}j=1,2,... are assumed to be strictly decreasing and positive such that∑∞

j=1 αj < ∞.

Finally, recall that the EDR directions β1, . . . , βK in model (4.1) are linearly indepen-

dent functions in H, and the response Y is assumed to be conditionally independent of

X given the K projections 〈β1, X〉, . . . , 〈βK , X〉.

When the response is binary, the FCS operator defined in (3.2) reduces to ΛFCS =

π0w(0)E[X1(Y = 0)]⊗E[X1(Y = 0)] and thus can only recover 1 EDR direction regard-

less of the complexity of the underlying EDR space. In general, for C-class classification,

first moment EDR methods such as FSIR and FCS can recover at most C − 1 EDR

directions. This obvious limitation combined with its restrictiveness in only being able

to detect differences between the means of classes motivate our development of FCV, a

second order EDR method.

4.2.1 Validity of Functional Cumulative Variance

Originally proposed to estimate the EDR space when SIR fails, SAVE captures second

moment information on X|Y and targets the EDR space through the operator ΛSAV E =

Page 86: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 4. Cumulative Variance Estimation for Classification 77

E{Σ−V[X|Y ]}2. By analogy to Zhu et al. (2010) for extending multivariate cumulative

slicing to cumulative variance, we replace V[X|Y ] with its cumulative version V[X1(Y ≤

y)] = E{(X1(Y ≤ y)− E[X1(Y ≤ y)])⊗ (X1(Y ≤ y)− E[X1(Y ≤ y)])}. This leads to

the functional cumulative variance operator

ΛFCV = E[Λ2(Y )

], (4.2)

where Λ(y) = V[X1(Y ≤ y)] − F (y)Σ and F (y) = P (Y ≤ y). The following theorem

establishes the validity of FCV. Analogous to the multivariate case, the linearity and

constant variance assumptions are needed. For any function b ∈ H,

Assumption 4.2. The conditional mean E[〈b,X〉|〈β1, X〉, . . . , 〈βK , X〉] is a linear func-

tion of 〈β1, X〉, . . . , 〈βK , X〉.

Assumption 4.3. The conditional variance V[〈b,X〉|〈β1, X〉, . . . , 〈βK , X〉] is non-random.

Assumption 4.2 is the same as assumption 3.2 in Chapter 3 to derive the validity of

functional cumulative slicing. Recall a sufficient condition for the linearity assumption

is if X has an elliptically contoured distribution, which is more general than, but bears

a close connection to, a Gaussian process (Cambanis et al., 1981, Li and Hsing, 2010).

Assumption 4.3 is much more restrictive since it is satisfied if X is a Gaussian process,

but only holds approximately if X has an elliptically contoured distribution (Shao et al.,

2007).

Theorem 4.1. If assumptions 4.1-4.3 hold for model (4.1), span({Λ(y) : y ∈ R}) ⊆

span(Σβ1, . . . ,ΣβK).

A corollary to Theorem 4.1 is that range(ΛFCV ) ⊆ span(Σβ1, . . . ,ΣβK). It is also

easy to see that range(ΛFCS) ⊆ range(ΛFCV ) and thus functional cumulative variance is a

more comprehensive method in estimating the EDR space than FCS. If ΛFCV has K non-

zero eigenvalues, the space spanned by its eigenfunctions is precisely span(Σβ1, . . . ,ΣβK).

Page 87: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 4. Cumulative Variance Estimation for Classification 78

Similar to FCS in Chapter 3, recall that our target is the subcentral space SY |X , even

though the EDR directions themselves are not identifiable. For specificity, we again re-

gard these eigenfunctions of Σ−1ΛFCV associated with the K largest non-zero eigenvalues

as the index functions β1, . . . , βK themselves unless stated otherwise.

We refer the reader to Chapter 3.2 for dealing with the unboundedness of the operator

Σ−1. An assumption analogous to 3.3 on the principal components of X is needed to

ensure Σ−1ΛFCV is well-defined.

4.2.2 Functional Cumulative Variance for Sparse Functional Data

For the data{

(Xi, Yi) : 1 ≤ i ≤ n}

independently and identically distributed (i.i.d.)

as (X, Y ), the predictor trajectories Xi are observed intermittently, contaminated with

noise, and collected in the form of repeated measurements{

(Tij, Uij) : 1 ≤ i ≤ n, 1 ≤

j ≤ Ni

}, where Uij = Xi(Tij) + εij with i.i.d. measurement error εij that are of zero

mean, constant variance σ2x, and independent of all other random variables. When only

a few observations are available for some or even all subjects, individual smoothing to

recover Xi is infeasible and one must adopt the strategy of pooling together data from

across subjects for consistent estimation.

From functional cumulative slicing in Chapter 3, both the unconditional meanm(t, y) =

E[X(t)1(Y ≤ y)] and the covariance surface Σ(s, t) = E[X(s)X(t)] can be estimated by

local linear estimators defined in (3.3) and (3.5), respectively.

We use a local linear estimator similar to that of Σ(s, t) to estimate V[X1(Y ≤ y)].

Let Gi(Tij, Til; y) = {Uij1(Yi ≤ y) − m(Tij, y)}{Uil1(Yi ≤ y) − m(Til, y)} denote the

“raw” covariances of X1(Y ≤ y). It is easy to check that E[Gi(Tij, Til; y)|Tij, Til] ≈

V (Tij, Til; y) +F (y)σ2δjl, where V (s, t; y) = cov(X(s)1(Y ≤ y), X(t)1(Y ≤ y)) and δjl is

Page 88: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 4. Cumulative Variance Estimation for Classification 79

1 if j = l and 0 otherwise. This suggests the diagonal of Gi should be removed, and thus

min(b0,b1,b2)

n∑i=1

∑1≤j 6=l≤Ni

{Gi(Tij, Til; y)− b0 − b1(Tij − s)− b2(Til − t)

}2

×

K2

(Tij − sh2

,Til − th2

) (4.3)

yields V (s, t; y) = b0, where K2 is a non-negative bivariate kernel density and h2 = h2(n)

is the bandwidth chosen by leave-one-curve-out cross-validation.

Then, the kernel associated with the operator ΛFCV in (4.2) can be estimated by its

sample moment

ΛFCV (s, t) =1

n

n∑i=1

{V (s, t;Yi)− F (Yi)Σ(s, t)

}2, (4.4)

which reduces to ΛFCV (s, t) = π0{V (s, t; 0) − π0Σ(s, t)}2 when the response is binary.

Finally, the estimated EDR directions {βk,sn}k=1,...,K are the eigenfunctions associated

with the K largest nonzero eigenvalues of Σ−1sn ΛFCV , where Σ−1

sn is defined in (3.6).

The situation for completely observed Xi is similar to the multivariate case and con-

siderably simpler. The quantity V (s, t; y) is easily estimated by its sample moment

V (s, t; y) = n−1∑n

i=1{Xi(s)1(Y ≤ y) − m(s, y)}{Xi(t)1(Y ≤ y) − m(t, y)}, where

m(t, y) = n−1∑n

i=1Xi(t)1(Yi ≤ y), while the estimate of ΛFCV remains the same as

(4.4). For densely observed Xi, individual smoothing can be used as a preprocessing step

to recover smooth trajectories and the estimation error introduced in this step can be

shown to be asymptotically negligible under certain design conditions, i.e., it is equivalent

to the ideal situation of the completely observed Xi’s (Hall et al., 2006).

Page 89: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 4. Cumulative Variance Estimation for Classification 80

4.3 Simulations

In this section we illustrate the practical performance of the proposed FCV method us-

ing reduced rank quadratic discriminant analysis (see Hastie et al., 2009, chap. 4.3.3)

to split the K-dimensional EDR space into C = 2 regions for class prediction. For

i = 1, . . . , n, let Zi =(〈β1,sn , Xi〉, . . . , 〈βK,sn , Xi〉

)>denote the K-variate random vari-

able Xi that has been projected onto the EDR space via FCV. For a new observation

Z0 =(〈β1,sn , X0〉, . . . , 〈βK,sn , X0〉

)>, we calculate the reduced rank quadratic discrimi-

nant function

δk(Z0) = −1

2log |Σk| −

1

2(Z0 − µk)>Σ−1

k (Z0 − µk) + log πk, (4.5)

where µk and Σk are the mean vector and covariance matrix of subpopulation Πk cal-

culated from the reduced variables Zi, respectively, and πk is the estimated proportion

of subpopulation Πk. We classify X0 to subpopulation Π0 if δ0(Z0) > δ1(Z0), and to Π1

if δ0(Z0) < δ1(Z0). We remind the reader from Chapter 3.4 that 〈βk,snXi〉 is given by

an integral approximation when the functional data is dense, while Xi is replaced by its

PACE (Yao et al., 2005a) estimate Xi when the functional data is sparse.

We compare our proposal to (i) functional SAVE in the same reduced rank QDA

framework, (ii) FCS in a reduced rank LDA framework, (iii) QDA on the FPCs (Hall

et al., 2001), and (iv) a Naive Bayes (NB) classifier on the FPCs. In all of the following

simulations we generate a total of n = 100 curves from Π0 and Π1 with respective sizes

n0 = n/2 and n1 = n/2. For k = 0, 1, functional processes from Πk are generated as

Xki(t) =∑40

j=1(θkj + µkj)φj(t), where θkj is i.i.d. N(−(λkj/2)1/2, λkj/2) with probability

1/2 and N((λkj/2)1/2, λkj/2) with probability 1/2. The λkjs and µkjs are selected depend-

ing on the property of FCV we want to illustrate below. In each case the measurement

error on Xki is i.i.d. N(0, 0.01), the domain of observation is t ∈ [0, 1] and eigenfunctions

are φj(t) = sin(πtj/2)/√

2 for j even and φj(t) = cos(πtj/2)/√

2 for j odd. For dense

Page 90: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 4. Cumulative Variance Estimation for Classification 81

functional data the Tij are 101 equispaced points in [0, 1], while for sparse functional data

the number of observations per subject Ni is chosen uniformly from {5, . . . , 14} and the

observational times Tij are i.i.d. U(0, 1).

Shown in Table 4.1 are the combinations of λkj and µkj that are considered. Model A

captures the general classification problem where both the inter-class means and covari-

ances are different, model B depicts the scenario when only the inter-class covariances

are different, and model C describes the scenario when only the inter-class means are dif-

ferent. We compute the average percent of misclassification and its standard error over

100 Monte Carlo repetitions, shown in Table 4.2 for the sparse design. The structure

dimension K and the truncation parameter sn are chosen by minimizing the misclassifi-

cation rate. These results suggest that FCV is optimal when inter-class covariances are

distinct, but that FCS is optimal otherwise. The results for FCV and FSAVE when inter-

class covariances are equal corroborate those from Zhu and Hastie (2003) who showed

that multivariate SAVE tends to over-emphasize second-order differences between classes,

while ignoring first-order differences.

Table 4.1: Shown are the combinations of θkj and µkj we use in our simulation study.

Model λ0j λ1j µ0j µ1j

A j−3 4j−2 µ01 = µ02 = µ03 = µ04 = 1 0 for all j

B j−3 4j−2 0 for all j 0 for all j

C 3j−2 3j−2 µ01 = µ02 = µ03 = µ04 = 1 0 for all j

4.4 Data Applications

In this section we study temporal gene expression data for yeast cell cycle (Spellman

et al., 1998). Each trajectory contains 18 observations of gene expression, measured

Page 91: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 4. Cumulative Variance Estimation for Classification 82

Table 4.2: Shown are the average misclassification error (×100%) with its standard error(in parentheses), and the optimal K and sn that minimize the average misclassificationerror over 100 Monte Carlo repetitions for sparse functional data.

Model FCV FSAVE FCS QDA NB

A16.11 (1.36) 19.48 (1.52) 22.94 (.44) 21.93 (.59) 28.83 (.79)K = 3, sn = 3 K = 3, sn = 3 K = 1, sn = 3 sn = 3 sn = 2

B21.82 (.35) 24.34 (.35) 47.72 (.43) 46.51 (.43) 30.80 (.48)

K = 5, sn = 5 K = 5, sn = 6 K = 1, sn = 5 sn = 2 sn = 2

C33.31 (.81) 37.87 (.86) 25.91 (.48) 27.28 (.47) 27.74 (.63)

K = 2, sn = 2 K = 3, sn = 3 K = 1, sn = 4 sn = 3 sn = 3

Table 4.3: Shown are the average misclassification error (×100%) with its standarderror (in parentheses), and the optimal K and sn that minimize the average 5-fold cross-validated classification error for the temporal gene expression data.

FCV FSAVE FCS QDA NB

15.11 (.25) 15.12 (.27) 21.76 (.34) 37.47 (.51) 40.01 (.69)K = 2, sn = 2 K = 2, sn = 2 K = 1, sn = 2 sn = 3 sn = 4

every 7 minutes between 0 and 119 minutes. 92 genes were identified, of which 43 are

known to regulate the G1 (Y = 1) phase and the remaining 49 are known to regulate

the non-G1 (Y = 0) phase. The functional trajectories are shown in Figure 4.1. To

artificially create sparse functional trajectories from this dense data, we randomly select

9 observations to use from each trajectory. In Table 4.3, we present the minimized average

5-fold cross-validated prediction error over 20 random partitions for different methods,

together with the selected structural dimensions and the truncation sizes. The second

order EDR methods, FCV and FSAVE, are virtually indistinguishable from each other,

but both compare very favorably to the other methods.

Page 92: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Chapter 4. Cumulative Variance Estimation for Classification 83

0 20 40 60 80 100 120−3

−2

−1

0

1

2

3

Time

G1 ph

ase

0 20 40 60 80 100 120−3

−2

−1

0

1

2

3

Time

non−

G1 ph

ase

Figure 4.1: Temporal gene expressions.

4.A Appendix: Proof of Theorem 4.1

It suffices to show that for any b ∈ H, 〈b,Σβk〉 = 0 for all k = 1, . . . , K implies

〈b,Λ(y)b〉 = 0. First, observe that 〈b,Λ(y)b〉 = 〈b,V[X1(Y ≤ y)]b〉 − F (y)〈b,Σb〉. Then,

〈b,V[X1(Y ≤ y)]b〉

= 〈b,E[X ⊗X1(Y ≤ y)]b〉 − 〈b, (E[X1(Y ≤ y)]⊗ E[X1(Y ≤ y)]) b〉

= E[〈b,X〉21(Y ≤ y)

]− E2

[〈b,X〉1(Y ≤ y)

]= E

{E[〈b,X〉2|〈β1, X〉, . . . , 〈βK , X〉

]1(Y ≤ y)

}−

E2{E[〈b,X〉|〈β1, X〉, . . . , 〈βK , X〉

]1(Y ≤ y)

}= E

[1(Y ≤ y)

]E{E[〈b,X〉2|〈β1, X〉, . . . , 〈βK , X〉

]}= F (y)〈b,Σb〉,

where the second last equality follows by invoking the linearity and constant variance

assumptions. Thus, 〈b,Λ(y)b〉 = 0 as desired.

Page 93: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

Bibliography

Adler, R. J. and Taylor, J. E. (2007), Random Fields and Geometry, Springer Monographs

in Mathematics, Springer.

Ash, R. B. and Gardner, M. F. (1975), Topics in stochastic processes, New York: Aca-

demic Press [Harcourt Brace Jovanovich Publishers], probability and Mathematical

Statistics, Vol. 27.

Biau, G., Bunea, F., and Wegkamp, M. H. (2005), “Functional classification in Hilbert

spaces,” IEEE Transactions on Information Theory, 51, 2163–2172.

Bosq, D. (2000), Linear Processes in Function Spaces: Theory and Applications, vol. 149,

New York: Springer-Verlag Inc.

Cai, T. T. and Hall, P. (2006), “Prediction in functional linear regression,” Ann. Statist.,

34, 2159–2179.

Cambanis, S., Huang, S., and Simons, G. (1981), “On the theory of elliptically contoured

distrobutions,” Journal of Multivariate Analysis, 11, 368–385.

Cardot, H., Ferraty, F., and Sarda, P. (1999), “Functional linear model,” Statistics &

Probability Letters, 45, 11–22.

Chen, D., Hall, P., and Muller, H.-G. (2011), “Single and multiple index functional

regression models with nonparametric link,” The Annals of Statistics, 39, 1720–1747.

84

Page 94: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

BIBLIOGRAPHY 85

Chiaromonte, F., Cook, D. R., and Li, B. (2002), “Sufficient Dimension Reduction in

Regressions with Categorical Predictors,” The Annals of Statistics, 30, 475–497.

Cook, D. R. (1996), “Graphics for regressions with a binary response,” Journal of the

American Statistical Association, 91, 983–992.

— (1998), Regression Graphics: Ideas for Studying Regressions through Graphics, vol.

318 of Probability and Statistics, Wiley.

Cook, D. R. and Critchley, F. (2000), “Identifying Regression Outliers and Mixtures

Graphically,” Journal of the American Statistical Association, 95, 781–794.

Cook, D. R., Forzani, L., and Yao, A.-F. (2010), “Necessary and sufficient conditions for

consistency of a method for smoothed functional inverse regression,” Statistica Sinica,

20, 235–238.

Cook, D. R. and Lee, H. (1999), “Dimension Reduction in Binary Response Regression,”

Journal of the American Statistical Association, 94, 1187–1200.

Cook, D. R. and Weisberg, S. (1991), “Comment on “Sliced Inverse Regression for Di-

mension Reduction”,” Journal of the American Statistical Association, 86, 328–332.

Cook, D. R. and Yin, X. (2001), “Special Invited Paper: Dimension Reduction and

Visualization in Discriminant Analysis (with discussion),” Australian and New Zealand

Journal of Statistics, 43, 147–199.

Cuesta-Albertos, J. and Nieto-Reyes, A. (2008), “The random Tukey depth,” Computa-

tional Statistics & Data Analysis, 52, 4979–4988.

Cuevas, A., Febrero, M., and Fraiman, R. (2007), “Robust estimation and classication

for functional data via projection-based depth notions,” Computational Statistics &

Data Analysis, 22, 481–496.

Page 95: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

BIBLIOGRAPHY 86

Demidenko, E. (2004), Mixed Models: Theory and Applications, Wiley Series in Proba-

bility and Statistics, Wiley.

Di, C.-Z., Crainiceanu, C. M., Caffo, B. S., and Punjabi, N. M. (2011), “Multilevel

functional principal component analysis,” Annals of Applied Statistics, 3, 458–488.

Duan, N. and Li, K.-C. (1991), “Slicing regression: a link-free regression method,” The

Annals of Statistics, 19, 505–530.

Fan, J. and Gijbels, I. (1996), Local polynomial modelling and its applications, vol. 66 of

Monographs on Statistics and Applied Probability, London: Chapman & Hall.

Ferraty, F. and Vieu, P. (2003), “Curves Discrimination: a Nonparametric Functional

Approach,” Computational Statistics & Data Analysis, 44, 161–173.

Ferraty, F., Vieu, P., and Pla-Viguier, S. (2007), “Factor-based comparison of groups of

curves,” Computational Statistics & Data Analysis, 51, 4903–4910.

Ferre, L. and Yao, A. F. (2003), “Functional sliced inverse regression analysis,” Statistics,

37, 475–488.

Ferre, L. and Yao, A.-F. (2005), “Smoothed functional inverse regression,” Statistica

Sinica, 15, 665–683.

Fisher, R. A. (1918), “The Correlation Between Relatives on the Supposition of

Mendelian Inheritance,” Transactions of the Royal Society of Edinburgh, 52, 399–433.

Griswold, C., Gomulkiewicz, R., and Heckman, N. (2008), “Hypothesis testing in com-

parative and experimental studies of function-valued traits,” Evolution, 62, 1229–1242.

Hall, P. and Delaigle, A. (2012), “Achieving near perfect classification for functional

data,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74,

267–286.

Page 96: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

BIBLIOGRAPHY 87

Hall, P. and Horowitz, J. L. (2007), “Methodology and convergence rates for functional

linear regression,” The Annals of Statistics, 35, 70–91.

Hall, P. and Hosseini-Nasab, M. (2006), “On properties of functional principal compo-

nents analysis,” Journal of the Royal Statistical Society: Series B (Statistical Method-

ology), 68, 109–126.

Hall, P., Muller, H.-G., and Wang, J.-L. (2006), “Properties of principal component

methods for functional and longitudinal data analysis,” The Annals of Statistics, 34,

1493–1517.

Hall, P., Muller, H. G., and Yao, F. (2008), “Modeling sparse generalized longitudinal

observations with latent Gaussian processes,” Journal of the Royal Statistical Society:

Series B (Statistical Methodology), 70, 730–723.

Hall, P., Poskitt, D. S., and Presnell, B. (2001), “A functional data-analytic approach to

signal discrimination,” Technometrics, 43, 1–9.

Hastie, T. and Tibshirani, R. (1990), Generalized additive models, vol. 43 of Monographs

on Statistics and Applied Probability, London: Chapman and Hall Ltd.

Hastie, T., Tibshirani, R., and Friedman, J. (2009), The elements of statistical learn-

ing, Springer Series in Statistics, New York: Springer-Verlag, 2nd ed., data mining,

inference, and prediction.

He, G., Muller, H.-G., and Wang, J.-L. (2003), “Functional canonical analysis for square

integrable stochastic processes,” Journal of Multivariate Analysis, 85, 54–77.

Heckman, N. (2003), “Functional data analysis in evolutionary biology,” in Recent Ad-

vances and Trends in Nonparametric Statistics, eds. Akritas, M. G. and Politis, D. N.,

Elsevier, pp. 49–60.

Page 97: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

BIBLIOGRAPHY 88

Henderson, C. R. (1950), “Estimation of genetic parameters (abstract),” Annals of Math-

ematical Statistics, 21, 309–310.

James, G. M. and Hastie, T. J. (2001), “Functional linear discriminant analysis for irreg-

ularly sampled curves,” Journal of the Royal Statistical Society: Series B (Statistical

Methodology), 63, 533–550.

James, G. M., Hastie, T. J., and Sugar, C. A. (2000), “Principal component models for

sparse functional data,” Biometrika, 87, 587–602.

James, G. M. and Silverman, B. W. (2005), “Functional adaptive model estimation,”

Journal of the American Statistical Association, 100, 565–576.

James, G. M. and Sugar, C. A. (2003), “Clustering for sparsely sampled functional data,”

Journal of the American Statistical Association, 98, 397–408.

Kaslow, R. A., Ostrow, D. G., Detels, R., Phair, J. P., Polk, B. F., and Rinaldo, C. R.

(1987), “The Multicenter AIDS Cohort Study: Rationale, Organization and Selected

Characteristics of the Participants,” American Journal of Epidemiology, 126, 310–318.

Kato, T. (1995), Perturbation theory for linear operators, Berlin: Springer-Verlag.

Kirkpatrick, M. and Heckman, N. (1989), “A quantitative genetic model for growth,

shape, reaction norms, and other infinite-dimensional characters,” Journal of Mathe-

matical Biology, 27, 429–450.

Leng, X. and Muller, H.-G. (2006), “Classification using functional data analysis for

temporal gene expression data,” Bioinformatics, 22, 68–76.

Li, B. and Wang, S. (2007), “On directional regression for dimension reduction,” Journal

of the American Statisticsl Association, 102, 997–1008.

Page 98: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

BIBLIOGRAPHY 89

Li, K.-C. (1991), “Sliced inverse regression for dimension reduction,” Journal of the

American Statistical Association, 86, 316–342, with discussion and a rejoinder by the

author.

— (1992), “On principal hessian directions for data visualization and dimension reduc-

tion: another application of Stein’s lemma,” Journal of the American Statistical Asso-

ciation, 87, 1025–1039.

— (2000), “High Dimensional Data Analysis via the SIR/PHD Approach,” .

Li, Y. and Hsing, T. (2010), “Deciding the dimension of effective dimension reduction

space for functional and high-dimensional data,” The Annals of Statistics, 38, 3028–

3062.

Lin, X. and Carroll, R. J. (2000), “Nonparametric function estimation for clustered

data when the predictor is measured without/with error,” Journal of the American

Statistical Association, 95, 520–534.

Liu, B. and Muller, H.-G. (2009), “Estimating derivatives for samples of sparsely observed

functions, with application to on-line auction dynamics.” Journal of the American

Statistical Association, 104, 704–714.

Loeve, M. (1978), Probability Theory II, vol. 46 of Graduate Texts in Mathematics,

Springer.

Lynch, M. and Walsh, B. (1998), Genetics and analysis of quantitative traits, Sinauer.

Martins-Filho, C. and Yao, F. (2006), “A note on the use of V and U statistics in

nonparametric model of regression,” Annals of the Institute of Statistical Mathematics,

58, 389–406.

— (2007), “Nonparametric frontier estimation via local linear regression,” Journal of

Econometrics, 141, 283–319.

Page 99: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

BIBLIOGRAPHY 90

Meyer, K. (1985), “Genetic parameters for dairy production of Australian Black and

White cows,” Livestock Production Science, 12, 205–219.

— (1999), “Estimates of genetic and phenotypic covariance functions for postweaning

growth and mature weight of beef cows,” Journal of Animal Breeding and Genetics,

116, 181–205.

— (2007), “WOMBAT – A tool for mixed model analyses in quantitative genetics by

restricted maximum likelihood (REML),” Journal of Zhejiang University Science, 8,

815–821.

Meyer, K., Carrick, M. J., and Donnelly, B. J. P. (1993), “Genetic parameters for growth

traits of Australian beef cattle from a multi-breed selection experiment,” Journal of

Animal Science, 71, 2614–2622.

Meyer, K. and Hill, W. (1997), “Estimation of genetic and phenotypic covariance func-

tions for longitudinal or repeated records by restricted maximum likelihood,” Livestock

Production Science, 47, 185–200.

Morris, J. S., Vannucci, M., Brown, P. J., and Carroll, R. J. (2003), “Wavelet-based

nonparametric modeling of hierarchical functions in colon carcinogenesis,” Journal of

the American Statistical Association, 98, 573–597, with comments and a rejoinder by

the authors.

Muller, H.-G. (2005), “Functional modelling and classification of longitudinal data,”

Scandinavian Journal of Statistics. Theory and Applications, 32, 223–240.

— (2008), “Functional modeling of longitudinal data,” in Longitudinal Data Analysis

(Handbooks of Modern Statistical Methods), eds. Fitzmaurice, G., Davidian, M., Ver-

beke, G., and Molenberghs, G., New York: Chapman & Hall/CRC, pp. 223–252.

Page 100: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

BIBLIOGRAPHY 91

Muller, H.-G. and Prewitt, K. A. (1993), “Multiparameter bandwidth processes and

adaptive surface smoothing,” Journal of Multivariate Analysis, 47, 1–21.

Muller, H.-G. and Stadtmuller, U. (2005), “Generalized functional linear models,” The

Annals of Statistics, 33, 774–805.

Peng, J. and Paul, D. (2011), “Principal components analysis for sparsely observed cor-

related functional data using a kernel smoothing approach,” Electronic Journal of

Statistics, 5, 1960–2003.

Pfeiffer, R. M., Bura, E., Smith, A., and Rutter, J. L. (2002), “Two approaches to

mutation detection based on functional data,” Statistics in Medicine, 21, 3447–3464.

Prakasa-Rao, B. (1983), Nonparametric functional estimation, Orlando, FL: Academic

Press.

Ramsay, J. O., Bock, D. R., and Gasser, T. (1995), “Comparison of height acceleration

curves in the Fels, Zurich, and Berkeley growth data,” Annals of Human Biology, 22,

413–426.

Ramsay, J. O. and Silverman, B. W. (2005), Functional data analysis, Springer Series in

Statistics, New York: Springer, 2nd ed.

Rice, J. A. (2004), “Functional and longitudinal data analysis: Perspectives on smooth-

ing,” Statistica Sinica, 631–647.

Rice, J. A. and Silverman, B. W. (1991), “Estimating the mean and covariance structure

nonparametrically when the data are curves,” Journal of the Royal Statistical Society:

Series B (Statistical Methodology), 53, 233–243.

Rice, J. A. and Wu, C. O. (2001), “Nonparametric mixed effects models for unequally

sampled noisy curves,” Biometrics, 57, 253–259.

Page 101: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

BIBLIOGRAPHY 92

Schott, J. R. (1993), “Dimensionality reduction in quadratic discriminant analysis,” Com-

putational Statistics & Data Analysis, 16, 161–174.

Shao, Y., Cook, D. R., and Weisberg, S. (2007), “Marginal tests with sliced average

variance estimation,” Biometrika, 94, 285–296.

Shin, H. (2008), “An extension of Fisher’s discriminant analysis for stochastic processes,”

Journal of Multivariate Analysis, 99, 1191–1216.

Song, J. J., Deng, W., Lee, H.-J., and Kwon, D. (2008), “Optimal classification for time-

course gene expression data using functional data analysis,” Computational Biology

and Chemistry, 32, 426–432.

Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B.,

Brown, P. O., Botstein, D., and Futcher, B. (1998), “Comprehensive Identification

of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray

Hybridization,” Molecular Biology of the Cell, 9, 3273–3297.

Tian, T. S. and James, G. M. (2013), “Interpretable dimension reduction for classifying

functional data,” Computational Statistics and Data Analysis, 57, 282–296.

Tuddenham, R. and Snyder, M. (1954), “Physical growth of California boys and girls

from birth to age 18,” Calif. Publ. Child Deve., 1, 183–364.

Velilla, S. (2008), “A method for dimension reduction in quadratic classification prob-

lems,” Journal of Computation and Graphical Statistics, 17, 572–589.

— (2010), “On the structure of the quadratic subspace in discriminant analysis,” Journal

of Multivariate Analysis, 101, 1239–1251.

Wang, X., Ray, S., and Mallick, B. K. (2007), “Bayesian curve classification using

wavelets,” Journal of the American Statistical Association, 102, 962–973.

Page 102: by Edwin Kam Fai Lei - University of Toronto T-Space · Edwin Kam Fai Lei Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto 2014 The primary aim

BIBLIOGRAPHY 93

Xia, Y., Tong, H., Li, W., and Zhu, L.-X. (2002), “An adaptive estimation of dimen-

sion reduction space,” Journal of the Royal Statistical Society: Series B (Statistical

Methodology), 64, 363–410.

Yao, F. and Muller, H.-G. (2010), “Empirical dynamics for longitudinal data,” The An-

nals of Statistics, 38, 3458–3486.

Yao, F., Muller, H.-G., Clifford, A. J., Dueker, S. R., Follett, J., Lin, Y., Buchholz, B. A.,

and Vogel, J. S. (2003), “Shrinkage estimation for functional principal component

scores with application to the population kinetics of plasma folate,” Biometrics, 59,

676–685.

Yao, F., Muller, H.-G., and Wang, J.-L. (2005a), “Functional data analysis for sparse

longitudinal data,” Journal of the American Statistical Association, 100, 577–590.

— (2005b), “Functional Linear Regression Analysis for Longitudinal Data,” The Annals

of Statistics, 33, 2873–2903.

Yuan, M. and Cai, T. T. (2010), “A reproducing kernel Hilbert space approach to func-

tional linear regression,” The Annals of Statistics, 38, 3412–3444.

Zhou, L., Huang, J. Z., Martinez, J. G., Maity, A., Baladandayuthapani, V., and Carroll,

R. J. (2010), “Reduced Rank Mixed Effects Models for Spatially Correlated Hierarchi-

cal Functional Data,” Journal of the American Statistical Association, 105, 390–400.

Zhu, L.-P., Zhu, L.-X., and Feng, Z.-H. (2010), “Dimension reduction in regressions

through cumulative slicing estimation,” Journal of the American Statistical Associa-

tion, 105, 1455–1466.

Zhu, M. and Hastie, T. J. (2003), “Feature extraction for nonparametric discriminant

analysis,” Journal of Computation and Graphical Statistics, 12, 101–120.