Upload
tuxette
View
72
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Short courses on functional data analysis and statistical learning, part 2 CENATAV, Havana, Cuba September 16th, 2008
Citation preview
Several nonlinear models and methods for FDA
Nathalie Villa-Vialaneix - [email protected]://www.nathalievilla.org
Institut de Mathématiques de Toulouse - IUT de Carcassonne, Université de PerpignanFrance
La Havane, September 16th, 2008
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 1 / 42
Table of contents
1 Nonparametric kernel
2 Neural networks
3 References
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 2 / 42
Nonparametric model in FDA
In this section, X is a random variable taking its values in a metric space(X, ‖.‖X) where ‖.‖X denotes a semi-norm (i.e., ‖x‖X = 0; x = 0).
In the following presentation, we are interesting in the followingnonparametric functional model:
Y = Φ(X) + ε
where Y is a real random variable (regression case), X is a functionalrandom variable taking its values in (X, ‖.‖X) and ε is a centered randomvariable, independant from X and Φ is a unknown operator from X to R.We also suppose that we are given a set of n i.i.d. realizations of therandom pair (X ,Y):
(xi , yi), (x2, y2), . . . , (xn, yn).
From this training set, we aim at building an estimate, Φn, of Φ suchthat Φn converge to the true Φ when n tends to infinity, in a sense thatwill be developed later.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 3 / 42
Nonparametric model in FDA
In this section, X is a random variable taking its values in a metric space(X, ‖.‖X) where ‖.‖X denotes a semi-norm (i.e., ‖x‖X = 0; x = 0).In the following presentation, we are interesting in the followingnonparametric functional model:
Y = Φ(X) + ε
where Y is a real random variable (regression case), X is a functionalrandom variable taking its values in (X, ‖.‖X) and ε is a centered randomvariable, independant from X and Φ is a unknown operator from X to R.
We also suppose that we are given a set of n i.i.d. realizations of therandom pair (X ,Y):
(xi , yi), (x2, y2), . . . , (xn, yn).
From this training set, we aim at building an estimate, Φn, of Φ suchthat Φn converge to the true Φ when n tends to infinity, in a sense thatwill be developed later.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 3 / 42
Nonparametric model in FDA
In this section, X is a random variable taking its values in a metric space(X, ‖.‖X) where ‖.‖X denotes a semi-norm (i.e., ‖x‖X = 0; x = 0).In the following presentation, we are interesting in the followingnonparametric functional model:
Y = Φ(X) + ε
where Y is a real random variable (regression case), X is a functionalrandom variable taking its values in (X, ‖.‖X) and ε is a centered randomvariable, independant from X and Φ is a unknown operator from X to R.We also suppose that we are given a set of n i.i.d. realizations of therandom pair (X ,Y):
(xi , yi), (x2, y2), . . . , (xn, yn).
From this training set, we aim at building an estimate, Φn, of Φ suchthat Φn converge to the true Φ when n tends to infinity, in a sense thatwill be developed later.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 3 / 42
Nonparametric model in FDA
In this section, X is a random variable taking its values in a metric space(X, ‖.‖X) where ‖.‖X denotes a semi-norm (i.e., ‖x‖X = 0; x = 0).In the following presentation, we are interesting in the followingnonparametric functional model:
Y = Φ(X) + ε
where Y is a real random variable (regression case), X is a functionalrandom variable taking its values in (X, ‖.‖X) and ε is a centered randomvariable, independant from X and Φ is a unknown operator from X to R.We also suppose that we are given a set of n i.i.d. realizations of therandom pair (X ,Y):
(xi , yi), (x2, y2), . . . , (xn, yn).
From this training set, we aim at building an estimate, Φn, of Φ suchthat Φn converge to the true Φ when n tends to infinity, in a sense thatwill be developed later.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 3 / 42
Nadaraya-Watson kernel estimate[Nadaraya, 1964, Watson, 1964]
Returning to the real case (i.e. X ∈ R), the Nadaraya-Watson kernelestimate is the regression function:
Φn : x ∈ R →
∑ni=1 yiK
(x−xi
h
)∑n
i=1 K(
x−xih
)where
K is the so-called kernel, i.e., K : R → R is a bounded, integrablefunction. Additionally, K is often positive and null everywhere, excepton a compact subset of R.
h is the smoothing parameter: this parameter controls thesmoothness of the estimate Φn.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 4 / 42
Nadaraya-Watson kernel estimate[Nadaraya, 1964, Watson, 1964]
Returning to the real case (i.e. X ∈ R), the Nadaraya-Watson kernelestimate is the regression function:
Φn : x ∈ R →
∑ni=1 yiK
(x−xi
h
)∑n
i=1 K(
x−xih
)where
K is the so-called kernel, i.e., K : R → R is a bounded, integrablefunction. Additionally, K is often positive and null everywhere, excepton a compact subset of R.
h is the smoothing parameter: this parameter controls thesmoothness of the estimate Φn.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 4 / 42
Nadaraya-Watson kernel estimate[Nadaraya, 1964, Watson, 1964]
Returning to the real case (i.e. X ∈ R), the Nadaraya-Watson kernelestimate is the regression function:
Φn : x ∈ R →
∑ni=1 yiK
(x−xi
h
)∑n
i=1 K(
x−xih
)where
K is the so-called kernel, i.e., K : R → R is a bounded, integrablefunction. Additionally, K is often positive and null everywhere, excepton a compact subset of R.
h is the smoothing parameter: this parameter controls thesmoothness of the estimate Φn.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 4 / 42
Remark on the parameters
Two parameters of the estimator Φn have to be set:1 The kernel K : its choice does not have much influence on the
accuracy of Φn. Several common choices for K are:the uniform kernel: K : x ∈ R → I[−1,1](x) (“moving averageestimate”),
the Epanechnikov kernel: K : x ∈ R → (1 − u2)I[−1,1](x),the Gaussian kernel: K : x ∈ R → e−u2
.. . .
2 The smoothing parameter h: it is of a main importance to obtain agood approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 5 / 42
Remark on the parameters
Two parameters of the estimator Φn have to be set:1 The kernel K : its choice does not have much influence on the
accuracy of Φn. Several common choices for K are:the uniform kernel: K : x ∈ R → I[−1,1](x) (“moving averageestimate”),the Epanechnikov kernel: K : x ∈ R → (1 − u2)I[−1,1](x),
the Gaussian kernel: K : x ∈ R → e−u2.
. . .2 The smoothing parameter h: it is of a main importance to obtain a
good approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 5 / 42
Remark on the parameters
Two parameters of the estimator Φn have to be set:1 The kernel K : its choice does not have much influence on the
accuracy of Φn. Several common choices for K are:the uniform kernel: K : x ∈ R → I[−1,1](x) (“moving averageestimate”),the Epanechnikov kernel: K : x ∈ R → (1 − u2)I[−1,1](x),the Gaussian kernel: K : x ∈ R → e−u2
.
. . .2 The smoothing parameter h: it is of a main importance to obtain a
good approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 5 / 42
Remark on the parameters
Two parameters of the estimator Φn have to be set:1 The kernel K : its choice does not have much influence on the
accuracy of Φn. Several common choices for K are:the uniform kernel: K : x ∈ R → I[−1,1](x) (“moving averageestimate”),the Epanechnikov kernel: K : x ∈ R → (1 − u2)I[−1,1](x),the Gaussian kernel: K : x ∈ R → e−u2
.. . .
2 The smoothing parameter h: it is of a main importance to obtain agood approximation.
The smoothing parameter h: it is of a mainimportance to obtain a good approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 5 / 42
Remark on the parameters
Two parameters of the estimator Φn have to be set:1 The kernel K : its choice does not have much influence on the
accuracy of Φn. Several common choices for K are:the uniform kernel: K : x ∈ R → I[−1,1](x) (“moving averageestimate”),the Epanechnikov kernel: K : x ∈ R → (1 − u2)I[−1,1](x),the Gaussian kernel: K : x ∈ R → e−u2
.
. . .
2 In particular, h depends on n.
h = 1 h = 0.5
The smoothing parameter h: it is of a main importance to obtain agood approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 5 / 42
Remark on the parameters
Two parameters of the estimator Φn have to be set:1 The kernel K : its choice does not have much influence on the
accuracy of Φn. Several common choices for K are:the uniform kernel: K : x ∈ R → I[−1,1](x) (“moving averageestimate”),the Epanechnikov kernel: K : x ∈ R → (1 − u2)I[−1,1](x),the Gaussian kernel: K : x ∈ R → e−u2
.
. . .
2 The smoothing parameter h: it is of a main importance to obtain agood approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 5 / 42
Generalization of the N.W. kernel to FDA[Ferraty and Vieu, 2002, Ferraty and Vieu, 2000]
When X takes its values in the Hilbert space (X, 〈., .〉X), this estimates canbe generalized by
Φn : x ∈ X →n∑
i=1
wi(x)yi
where
wi(x) =K
(‖xi−x‖X
h
)∑n
k=1 K(‖xk−x‖X
h
)and K and h are defined as previously (i.e., as in the real case).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 6 / 42
Basic assumption about the fractal dimension
The main assumption for theoretical results on the convergence of theN.W. kernel estimate in Hilbertian case is:
(Afd1) Fractal dimension assumption(also called “small balls assumption”)
limα→0+
P (X ∈ B(x, α))
αa(x)= c(x)
where a(x) and c(x) are positive real numbers andB(x, α) :=
{u ∈ X : ‖u − x‖X ≤ α
}.
a(x) is named fractal dimension of the probability distribution of X .
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 7 / 42
Basic assumption about the fractal dimension
The main assumption for theoretical results on the convergence of theN.W. kernel estimate in Hilbertian case is:
(Afd1) Fractal dimension assumption(also called “small balls assumption”)
limα→0+
P (X ∈ B(x, α))
αa(x)= c(x)
where a(x) and c(x) are positive real numbers andB(x, α) :=
{u ∈ X : ‖u − x‖X ≤ α
}.
a(x) is named fractal dimension of the probability distribution of X .
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 7 / 42
Assumptions for pointwise convergence
(A1) limn→+∞ hn = 0 and limn→+∞nha(x)
nlog n = +∞
(A2) K is bounded and is null except on a compact subset of R+;moreover, K satisfies
∀ t , t ′ ∈ R+, |K(t) − K(t ′)| ≤ |t − t ′|
(A3) Y is bounded
(A4) Φ is continuous at x ∈ X
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 8 / 42
Pointwise convergence
Theorem [Ferraty and Vieu, 2000, Ferraty and Vieu, 2002]
Under assumptions (A1)-(A4) and assumption (Afd1),
limn→+∞
Φn(x) = Φ(x).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 9 / 42
Assumptions for optimal rate of pointwiseconvergence
(Afd2) P (X ∈ B(x, α)) = αa(x)c(x) + OP(αa(x)+b(x)
)(A4’) It exists B > 0, C > 0 and β > 0 such that: for all u and v inB(x,B)
|Φ(u) − Φ(v)| ≤ C |u − v |β
(A1’) hn = h(
log nn
) 11γ(x)+a(x) where γ(x) = min{b(x), β} and
limn→+∞nha(x)
nlog n = 0
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 10 / 42
Rate of convergence for pointwise convergence
Theorem [Ferraty and Vieu, 2000, Ferraty and Vieu, 2002]
Under assumptions (A1’), (A2), (A3), (A4’) and assumption (Afd2),
Φn(x) − Φ(x) = O
(log n
n
) γ(x)2γ(x)+a(x)
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 11 / 42
Assumptions for uniform convergence on a compactsubset of X, C
(Afd3) limα→0+ supx∈SP(X∈B(x,α))
αA = c(x) where infx∈S c(x) > 0
(A5) The covering number of S, N(S, l) (i.e., the minimum numberof balls of radius l that are needed to cover S is such that it existsα > 0 and C > 0 such that N(C, l) = Cl−α
(A4”) Φ is continuous on S
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 12 / 42
Uniform convergence
Theorem [Ferraty and Vieu, 2004, Ferraty and Vieu, 2008]
Under assumptions (A1)-(A3), (A4”), (A5) and assumption (Afd3),
Φn(x) − Φ(x) = O
(log n
n
) γ(x)2γ(x)+a(x)
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 13 / 42
Note on possible choices for the semi-norm
1 PCA semi-norm: suppose that X is a squared integrable randomvariable of L2 (i.e., E
(‖X‖2L2
)< +∞) then,
ΓX can be written as∑
k≥1 λk vk ⊗ vk where ((λk ), (vk ))k is theeigensystem of ΓX , (λk ) are in decreasing order and (vk ) areorthonormal vectors of L2,This defines a semi-norm on X = L2: for a given K ∈ N∗,
∀ x ∈ X, ‖x‖2X =K∑
k=1
〈vk , x〉2L2 =∥∥∥PSpan{v1,...,vK }(x)
∥∥∥2L2 .
This semi-norm emphasizes the main directions for the representationof the random variable X .
2 q-th derivative semi-norm: suppose now thatX =
{h ∈ L2 : h(q) exists and is in L2
}. Then,
∀ x ∈ X, ‖x‖2X
=∥∥∥∥x(q)
∥∥∥∥2
L2.
This norm is strongly related to RKHS (Sobolev spaces) and splinesand will be further investigated in Presentation 4.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 14 / 42
Note on possible choices for the semi-norm
1 PCA semi-norm: suppose that X is a squared integrable randomvariable of L2 (i.e., E
(‖X‖2L2
)< +∞) then,
ΓX can be written as∑
k≥1 λk vk ⊗ vk where ((λk ), (vk ))k is theeigensystem of ΓX , (λk ) are in decreasing order and (vk ) areorthonormal vectors of L2,
This defines a semi-norm on X = L2: for a given K ∈ N∗,
∀ x ∈ X, ‖x‖2X =K∑
k=1
〈vk , x〉2L2 =∥∥∥PSpan{v1,...,vK }(x)
∥∥∥2L2 .
This semi-norm emphasizes the main directions for the representationof the random variable X .
2 q-th derivative semi-norm: suppose now thatX =
{h ∈ L2 : h(q) exists and is in L2
}. Then,
∀ x ∈ X, ‖x‖2X
=∥∥∥∥x(q)
∥∥∥∥2
L2.
This norm is strongly related to RKHS (Sobolev spaces) and splinesand will be further investigated in Presentation 4.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 14 / 42
Note on possible choices for the semi-norm
1 PCA semi-norm: suppose that X is a squared integrable randomvariable of L2 (i.e., E
(‖X‖2L2
)< +∞) then,
ΓX can be written as∑
k≥1 λk vk ⊗ vk where ((λk ), (vk ))k is theeigensystem of ΓX , (λk ) are in decreasing order and (vk ) areorthonormal vectors of L2,This defines a semi-norm on X = L2: for a given K ∈ N∗,
∀ x ∈ X, ‖x‖2X =K∑
k=1
〈vk , x〉2L2 =∥∥∥PSpan{v1,...,vK }(x)
∥∥∥2L2 .
This semi-norm emphasizes the main directions for the representationof the random variable X .
2 q-th derivative semi-norm: suppose now thatX =
{h ∈ L2 : h(q) exists and is in L2
}. Then,
∀ x ∈ X, ‖x‖2X
=∥∥∥∥x(q)
∥∥∥∥2
L2.
This norm is strongly related to RKHS (Sobolev spaces) and splinesand will be further investigated in Presentation 4.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 14 / 42
Note on possible choices for the semi-norm
1 PCA semi-norm: suppose that X is a squared integrable randomvariable of L2 (i.e., E
(‖X‖2L2
)< +∞) then,
ΓX can be written as∑
k≥1 λk vk ⊗ vk where ((λk ), (vk ))k is theeigensystem of ΓX , (λk ) are in decreasing order and (vk ) areorthonormal vectors of L2,This defines a semi-norm on X = L2: for a given K ∈ N∗,
∀ x ∈ X, ‖x‖2X =K∑
k=1
〈vk , x〉2L2 =∥∥∥PSpan{v1,...,vK }(x)
∥∥∥2L2 .
This semi-norm emphasizes the main directions for the representationof the random variable X .
2 q-th derivative semi-norm: suppose now thatX =
{h ∈ L2 : h(q) exists and is in L2
}. Then,
∀ x ∈ X, ‖x‖2X
=∥∥∥∥x(q)
∥∥∥∥2
L2.
This norm is strongly related to RKHS (Sobolev spaces) and splinesand will be further investigated in Presentation 4.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 14 / 42
Note on possible choices for the semi-norm
1 PCA semi-norm: suppose that X is a squared integrable randomvariable of L2 (i.e., E
(‖X‖2L2
)< +∞) then,
ΓX can be written as∑
k≥1 λk vk ⊗ vk where ((λk ), (vk ))k is theeigensystem of ΓX , (λk ) are in decreasing order and (vk ) areorthonormal vectors of L2,This defines a semi-norm on X = L2: for a given K ∈ N∗,
∀ x ∈ X, ‖x‖2X =K∑
k=1
〈vk , x〉2L2 =∥∥∥PSpan{v1,...,vK }(x)
∥∥∥2L2 .
This semi-norm emphasizes the main directions for the representationof the random variable X .
2 q-th derivative semi-norm: suppose now thatX =
{h ∈ L2 : h(q) exists and is in L2
}. Then,
∀ x ∈ X, ‖x‖2X
=∥∥∥∥x(q)
∥∥∥∥2
L2.
This norm is strongly related to RKHS (Sobolev spaces) and splinesand will be further investigated in Presentation 4.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 14 / 42
Generalization to curve classification[Ferraty and Vieu, 2003, Ferraty and Vieu, 2004]
Suppose that X ∈ X but also that Y ∈ {1, 2, . . . ,G}.
Then, a classification rule is given by:
∀ x ∈ X, g(x) = arg maxg=1,...,G
P (Y = g|X = x) .
Then, this rule needs the estimation of the probability P (Y = g|X = x):
Pn(Y = g|X = x) :=n∑
i=1
wi(x)I[Yi=g]
where wi(x) =K(‖x−xi‖X
h
)∑n
l=1 K(‖x−xl‖X
h
) .
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 15 / 42
Generalization to curve classification[Ferraty and Vieu, 2003, Ferraty and Vieu, 2004]
Suppose that X ∈ X but also that Y ∈ {1, 2, . . . ,G}.Then, a classification rule is given by:
∀ x ∈ X, g(x) = arg maxg=1,...,G
P (Y = g|X = x) .
Then, this rule needs the estimation of the probability P (Y = g|X = x):
Pn(Y = g|X = x) :=n∑
i=1
wi(x)I[Yi=g]
where wi(x) =K(‖x−xi‖X
h
)∑n
l=1 K(‖x−xl‖X
h
) .
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 15 / 42
Generalization to curve classification[Ferraty and Vieu, 2003, Ferraty and Vieu, 2004]
Suppose that X ∈ X but also that Y ∈ {1, 2, . . . ,G}.Then, a classification rule is given by:
∀ x ∈ X, g(x) = arg maxg=1,...,G
P (Y = g|X = x) .
Then, this rule needs the estimation of the probability P (Y = g|X = x):
Pn(Y = g|X = x) :=n∑
i=1
wi(x)I[Yi=g]
where wi(x) =K(‖x−xi‖X
h
)∑n
l=1 K(‖x−xl‖X
h
) .
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 15 / 42
Example of nonparametric curve classification withPCA semi-norm
Problem: Discriminating 5 phonemes from their log-periodograms.
Competitor methods:Ridge PDA (i.e., Principal Discriminant Analysis penalized by the norm)Partial Least Squares with the L2 norm (denoted by MPLSR)Partial Least Squares with the PCA semi-norm (denoted byNPCD/MPLSR)Nonparametric kernel estimator with the PCA semi-norm (denotedby NPCD/PCA)
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 16 / 42
Example of nonparametric curve classification withPCA semi-norm
Problem: Discriminating 5 phonemes from their log-periodograms.
Competitor methods:Ridge PDA (i.e., Principal Discriminant Analysis penalized by the norm)Partial Least Squares with the L2 norm (denoted by MPLSR)Partial Least Squares with the PCA semi-norm (denoted byNPCD/MPLSR)Nonparametric kernel estimator with the PCA semi-norm (denotedby NPCD/PCA)
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 16 / 42
Obtained result
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 17 / 42
Generalization to time series [Ferraty and Vieu, 2004]
Problem and notations: If we are given a time series (Z(t))t∈R , one isoften interesting, knowing {Z(t), t ∈ [Tmax − T ,Tmax]}, to predict Z(t + τ).
Denoting
X ={Z(t), t ∈ [Tmax − T ,Tmax]
}Y = Z(Tmax + τ)
we can see that this problem is strongly related to a functional regressionmodel.The observations are given by
xi ={z(t), t ∈ [(i − 1)T , iT ]
}yi = z(iT + τ)
for i = 1, . . . , n but they are not independant.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 18 / 42
Generalization to time series [Ferraty and Vieu, 2004]
Problem and notations: If we are given a time series (Z(t))t∈R , one isoften interesting, knowing {Z(t), t ∈ [Tmax − T ,Tmax]}, to predict Z(t + τ).Denoting
X ={Z(t), t ∈ [Tmax − T ,Tmax]
}Y = Z(Tmax + τ)
we can see that this problem is strongly related to a functional regressionmodel.
The observations are given by
xi ={z(t), t ∈ [(i − 1)T , iT ]
}yi = z(iT + τ)
for i = 1, . . . , n but they are not independant.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 18 / 42
Generalization to time series [Ferraty and Vieu, 2004]
Problem and notations: If we are given a time series (Z(t))t∈R , one isoften interesting, knowing {Z(t), t ∈ [Tmax − T ,Tmax]}, to predict Z(t + τ).Denoting
X ={Z(t), t ∈ [Tmax − T ,Tmax]
}Y = Z(Tmax + τ)
we can see that this problem is strongly related to a functional regressionmodel.The observations are given by
xi ={z(t), t ∈ [(i − 1)T , iT ]
}yi = z(iT + τ)
for i = 1, . . . , n but they are not independant.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 18 / 42
Mixing assumptions
Under mixing assumptions on ((xi , yi))i and other assumptions, thesame convergence results holds.
More precisely, suppose that, for
α(n) = supk∈Z, A∈σk
−∞, B∈σ+∞k+n
∣∣∣P (A ∩ B) − P (A)P (B)∣∣∣
where σlk = σ ({(xi , yi) : k ≤ i ≤ l}), we have
limn→+∞
α(n) = 0.
α is named mixing coefficient.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 19 / 42
Mixing assumptions
Under mixing assumptions on ((xi , yi))i and other assumptions, thesame convergence results holds.More precisely, suppose that, for
α(n) = supk∈Z, A∈σk
−∞, B∈σ+∞k+n
∣∣∣P (A ∩ B) − P (A)P (B)∣∣∣
where σlk = σ ({(xi , yi) : k ≤ i ≤ l}), we have
limn→+∞
α(n) = 0.
α is named mixing coefficient.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 19 / 42
Mixing assumptions
Under mixing assumptions on ((xi , yi))i and other assumptions, thesame convergence results holds.More precisely, suppose that, for
α(n) = supk∈Z, A∈σk
−∞, B∈σ+∞k+n
∣∣∣P (A ∩ B) − P (A)P (B)∣∣∣
where σlk = σ ({(xi , yi) : k ≤ i ≤ l}), we have
limn→+∞
α(n) = 0.
α is named mixing coefficient.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 19 / 42
Table of contents
1 Nonparametric kernel
2 Neural networks
3 References
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 20 / 42
Biological neural networks
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 21 / 42
Biological neural networks
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 21 / 42
Biological neural networks
∑Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 21 / 42
Biological neural networks
If∑> activation threshold then
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 21 / 42
Biological neural networks
If∑< activation threshold then
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 21 / 42
Mathematical multilayer perceptrons:multidimensional case
Inpu
ts
2
1.5
11
Variable d
Variable 2
Variable 1
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 22 / 42
Mathematical multilayer perceptrons:multidimensional case
Inpu
ts
2
1.5
11
Variable d
Variable 2
Variable 1× weights
0.5
-1
0.2
∑= 4.4
Activation function G
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 22 / 42
Mathematical multilayer perceptrons:multidimensional case
Inpu
ts
2
1.5
11
Variable d
Variable 2
Variable 1× weights
0.5
-1
0.2
∑= 4.4
Activation function G
∑+G
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 22 / 42
Mathematical multilayer perceptrons:multidimensional case
Inpu
ts
2
1.5
11
Variable d
Variable 2
Variable 1× weights
0.5
-1
0.2
∑= 4.4
Activation function G
∑+G
× weights
∑+f
Outputs
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 22 / 42
In summary, multilayer perceptrons with 1 hiddenlayer are:
a class of functions of the form:
φpw : x ∈ Rd →
p∑k=1
w(2)k G
(w(0)
k + (w(1)k )T x
)
whereG is given and called the activation function.
p is also given and called the number of neurons on the hiddenlayer.For all k = 1, . . . , p, w(2)
k ∈ R, w(0)k ∈ R and w(1)
k ∈ Rd are theweights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:
w∗n = arg minw∈(R×R×Rd)p
n∑i=1
(yi − φw(xi))2 .
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42
In summary, multilayer perceptrons with 1 hiddenlayer are:
a class of functions of the form:
φpw : x ∈ Rd →
p∑k=1
w(2)k G
(w(0)
k + (w(1)k )T x
)where
G is given and called the activation function.
p is also given and called the number of neurons on the hiddenlayer.For all k = 1, . . . , p, w(2)
k ∈ R, w(0)k ∈ R and w(1)
k ∈ Rd are theweights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:
w∗n = arg minw∈(R×R×Rd)p
n∑i=1
(yi − φw(xi))2 .
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42
In summary, multilayer perceptrons with 1 hiddenlayer are:
a class of functions of the form:
φpw : x ∈ Rd →
p∑k=1
w(2)k G
(w(0)
k + (w(1)k )T x
)where
G is given and called the activation function. Popular examples ofactivation functions are:
the linear activation function
p is also given and called the number of neurons on the hiddenlayer.For all k = 1, . . . , p, w(2)
k ∈ R, w(0)k ∈ R and w(1)
k ∈ Rd are theweights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:
w∗n = arg minw∈(R×R×Rd)p
n∑i=1
(yi − φw(xi))2 .
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42
In summary, multilayer perceptrons with 1 hiddenlayer are:
a class of functions of the form:
φpw : x ∈ Rd →
p∑k=1
w(2)k G
(w(0)
k + (w(1)k )T x
)where
G is given and called the activation function. Popular examples ofactivation functions are:
the sigmoid activation function
p is also given and called the number of neurons on the hiddenlayer.For all k = 1, . . . , p, w(2)
k ∈ R, w(0)k ∈ R and w(1)
k ∈ Rd are theweights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:
w∗n = arg minw∈(R×R×Rd)p
n∑i=1
(yi − φw(xi))2 .
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42
In summary, multilayer perceptrons with 1 hiddenlayer are:
a class of functions of the form:
φpw : x ∈ Rd →
p∑k=1
w(2)k G
(w(0)
k + (w(1)k )T x
)where
G is given and called the activation function.p is also given and called the number of neurons on the hiddenlayer.
For all k = 1, . . . , p, w(2)k ∈ R, w(0)
k ∈ R and w(1)k ∈ Rd are the
weights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:
w∗n = arg minw∈(R×R×Rd)p
n∑i=1
(yi − φw(xi))2 .
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42
In summary, multilayer perceptrons with 1 hiddenlayer are:
a class of functions of the form:
φpw : x ∈ Rd →
p∑k=1
w(2)k G
(w(0)
k + (w(1)k )T x
)where
G is given and called the activation function.p is also given and called the number of neurons on the hiddenlayer. p is a main parameter to obtain a good generalization ability.
For all k = 1, . . . , p, w(2)k ∈ R, w(0)
k ∈ R and w(1)k ∈ Rd are the
weights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:
w∗n = arg minw∈(R×R×Rd)p
n∑i=1
(yi − φw(xi))2 .
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42
In summary, multilayer perceptrons with 1 hiddenlayer are:
a class of functions of the form:
φpw : x ∈ Rd →
p∑k=1
w(2)k G
(w(0)
k + (w(1)k )T x
)where
G is given and called the activation function.p is also given and called the number of neurons on the hiddenlayer. p is a main parameter to obtain a good generalization ability.
For all k = 1, . . . , p, w(2)k ∈ R, w(0)
k ∈ R and w(1)k ∈ Rd are the
weights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:
w∗n = arg minw∈(R×R×Rd)p
n∑i=1
(yi − φw(xi))2 .
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42
In summary, multilayer perceptrons with 1 hiddenlayer are:
a class of functions of the form:
φpw : x ∈ Rd →
p∑k=1
w(2)k G
(w(0)
k + (w(1)k )T x
)where
G is given and called the activation function.p is also given and called the number of neurons on the hiddenlayer. p is a main parameter to obtain a good generalization ability.
For all k = 1, . . . , p, w(2)k ∈ R, w(0)
k ∈ R and w(1)k ∈ Rd are the
weights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:
w∗n = arg minw∈(R×R×Rd)p
n∑i=1
(yi − φw(xi))2 .
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42
In summary, multilayer perceptrons with 1 hiddenlayer are:
a class of functions of the form:
φpw : x ∈ Rd →
p∑k=1
w(2)k G
(w(0)
k + (w(1)k )T x
)where
G is given and called the activation function.p is also given and called the number of neurons on the hiddenlayer.For all k = 1, . . . , p, w(2)
k ∈ R, w(0)k ∈ R and w(1)
k ∈ Rd are theweights that have to be set from the learning data set.
More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:
w∗n = arg minw∈(R×R×Rd)p
n∑i=1
(yi − φw(xi))2 .
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42
In summary, multilayer perceptrons with 1 hiddenlayer are:
a class of functions of the form:
φpw : x ∈ Rd →
p∑k=1
w(2)k G
(w(0)
k + (w(1)k )T x
)where
G is given and called the activation function.p is also given and called the number of neurons on the hiddenlayer.For all k = 1, . . . , p, w(2)
k ∈ R, w(0)k ∈ R and w(1)
k ∈ Rd are theweights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:
w∗n = arg minw∈(R×R×Rd)p
n∑i=1
(yi − φw(xi))2 .
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42
Universal approximation
The popularity of MLP in the multidimensional context comes from twomain properties. The first one is called universal approximationcapability:
Theorem [Hornik, 1991, Hornik, 1993, Stinchcombe, 1999]If the activation function is continuous and non polynomial, then,
{φpw : w ∈
(R × R × Rd
)p, p ∈ N∗}
is dense in the set of continuous functions from Rd to R for thetopology induced by the uniform norm on compact sets.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 24 / 42
Consistency to optimal weights
The second property deals with the fact that the optimal empirical weightstends to the optimal weights when the size of the training data tends toinfinity.
Theorem [White, 1989, White, 1990]
Denote w∗ = arg minw∈(R×R×Rd)p E((
X − φpw(Y)
)2)
and
Θ(p,∆) ={φ
pw :
∑pk=1 |w
(2)k | ≤ ∆ and
∑ki=1(|w(0)
k |+∑d
l=1 |w(1)kl |) ≤ ∆p
}.
If1 the activation function G is bounded;2 limn→+∞ pn = +∞, limn→+∞∆n = +∞, ∆n = o(n) and
limn→+∞ pn∆4n log pn∆n = o(n);
3 w∗ exists and is unique;4 w∗n = arg minw: φw∈Θ(pn ,∆n)
∑ni=1(yi − φw(xi))2;
then limn→+∞
∥∥∥w∗n − w∗∥∥∥Rd = 0.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 25 / 42
Multilayer perceptrons with functional input
Suppose now that (X ,Y) is a random pair taking its values in X × R where(X, 〈., .〉X) is a Hilbert space.
Multilayer perceptrons with 1 hidden layer generalize to functional input by:
φpw : x ∈ X →
p∑k=1
w(2)k G
(w(0)
k + 〈x,w(1)k 〉X
)where the weights w(1)
k ∈ X (functional values).
With relevant representations of the weights (w(1)k )k (rich enough
representations), universal property of this model remains valid.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 26 / 42
Multilayer perceptrons with functional input
Suppose now that (X ,Y) is a random pair taking its values in X × R where(X, 〈., .〉X) is a Hilbert space.Multilayer perceptrons with 1 hidden layer generalize to functional input by:
φpw : x ∈ X →
p∑k=1
w(2)k G
(w(0)
k + 〈x,w(1)k 〉X
)where the weights w(1)
k ∈ X (functional values).
With relevant representations of the weights (w(1)k )k (rich enough
representations), universal property of this model remains valid.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 26 / 42
Multilayer perceptrons with functional input
Suppose now that (X ,Y) is a random pair taking its values in X × R where(X, 〈., .〉X) is a Hilbert space.Multilayer perceptrons with 1 hidden layer generalize to functional input by:
φpw : x ∈ X →
p∑k=1
w(2)k G
(w(0)
k + 〈x,w(1)k 〉X
)where the weights w(1)
k ∈ X (functional values).
With relevant representations of the weights (w(1)k )k (rich enough
representations), universal property of this model remains valid.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 26 / 42
Practical implementation: discrete sampling case
Suppose now that (xi)i are known on a discrete sampling gridt1, t2, . . . , td .
Then, 〈w(1)k , xi〉X can be approximated by
〈w(1)k , xi〉X '
1d
d∑l=1
xi(tk )w(1)k (tl).
w(1)k should be searched in a class of functions F such that, for any sets of
real numbers (cl)l=1,...,d , it exists a function w ∈ F for which
∀ l = 1, . . . , d, w(tl) = cl .
Splines have such a property (see Presentation 4).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 27 / 42
Practical implementation: discrete sampling case
Suppose now that (xi)i are known on a discrete sampling gridt1, t2, . . . , td .Then, 〈w(1)
k , xi〉X can be approximated by
〈w(1)k , xi〉X '
1d
d∑l=1
xi(tk )w(1)k (tl).
w(1)k should be searched in a class of functions F such that, for any sets of
real numbers (cl)l=1,...,d , it exists a function w ∈ F for which
∀ l = 1, . . . , d, w(tl) = cl .
Splines have such a property (see Presentation 4).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 27 / 42
Practical implementation: discrete sampling case
Suppose now that (xi)i are known on a discrete sampling gridt1, t2, . . . , td .Then, 〈w(1)
k , xi〉X can be approximated by
〈w(1)k , xi〉X '
1d
d∑l=1
xi(tk )w(1)k (tl).
w(1)k should be searched in a class of functions F such that, for any sets of
real numbers (cl)l=1,...,d , it exists a function w ∈ F for which
∀ l = 1, . . . , d, w(tl) = cl .
Splines have such a property (see Presentation 4).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 27 / 42
Practical implementation: discrete sampling case
Suppose now that (xi)i are known on a discrete sampling gridt1, t2, . . . , td .Then, 〈w(1)
k , xi〉X can be approximated by
〈w(1)k , xi〉X '
1d
d∑l=1
xi(tk )w(1)k (tl).
w(1)k should be searched in a class of functions F such that, for any sets of
real numbers (cl)l=1,...,d , it exists a function w ∈ F for which
∀ l = 1, . . . , d, w(tl) = cl .
Splines have such a property (see Presentation 4).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 27 / 42
Practical implementation: projection approach
Suppose now that X admits a family of functions (ψqk )q∈N∗, 1≤k≤q such that:
for all q ∈ N∗, (ψqk )k is an orthonormal system,
if Pq denotes the projection in X on Span{ψ
q1, . . . , ψ
}then, the
pointwise consistent property is satisfied:
∀ x ∈ X, ∀ ε > 0,∃Q ∈ N∗ : ∀ q ≥ Q ,∥∥∥Pq(x) − x
∥∥∥∞≤ ε.
Hence, 〈w(1)k , xi〉X can be approximated by
〈w(1)k , xi〉X ' 〈w
(1)k ,Pq(xi)〉X = 〈Pq(w(1)
k ), xi〉X = 〈Pq(w(1)k ),Pq(xi)〉X.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 28 / 42
Practical implementation: projection approach
Suppose now that X admits a family of functions (ψqk )q∈N∗, 1≤k≤q such that:
for all q ∈ N∗, (ψqk )k is an orthonormal system,
if Pq denotes the projection in X on Span{ψ
q1, . . . , ψ
}then, the
pointwise consistent property is satisfied:
∀ x ∈ X, ∀ ε > 0,∃Q ∈ N∗ : ∀ q ≥ Q ,∥∥∥Pq(x) − x
∥∥∥∞≤ ε.
Hence, 〈w(1)k , xi〉X can be approximated by
〈w(1)k , xi〉X ' 〈w
(1)k ,Pq(xi)〉X = 〈Pq(w(1)
k ), xi〉X = 〈Pq(w(1)k ),Pq(xi)〉X.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 28 / 42
Practical implementation: projection approach
Suppose now that X admits a family of functions (ψqk )q∈N∗, 1≤k≤q such that:
for all q ∈ N∗, (ψqk )k is an orthonormal system,
if Pq denotes the projection in X on Span{ψ
q1, . . . , ψ
}then, the
pointwise consistent property is satisfied:
∀ x ∈ X, ∀ ε > 0,∃Q ∈ N∗ : ∀ q ≥ Q ,∥∥∥Pq(x) − x
∥∥∥∞≤ ε.
Hence, 〈w(1)k , xi〉X can be approximated by
〈w(1)k , xi〉X ' 〈w
(1)k ,Pq(xi)〉X = 〈Pq(w(1)
k ), xi〉X = 〈Pq(w(1)k ),Pq(xi)〉X.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 28 / 42
Universal approximation for projection basedapproach
Theorem[Rossi and Conan-Guez, 2005, Rossi and Conan-Guez, 2006]If G is continuous and non polynomial then the setx ∈ X →
p∑k=1
w(2)k G
w(0)k +
q∑l=1
βkl(Pq(x))l
is dense in the set of all continuous functions defined on X for the uniformnorm on any compact subset of X.
Remark 1: A similar result exists for the pointwise consistent approach.Remark 2: A convergence results for the optimal weights of thefunctional MLP also exists but requires many technical assumptions sothat we do not detail it here. See [Rossi and Conan-Guez, 2005] for moredetails about it.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 29 / 42
Universal approximation for projection basedapproach
Theorem[Rossi and Conan-Guez, 2005, Rossi and Conan-Guez, 2006]If G is continuous and non polynomial then the setx ∈ X →
p∑k=1
w(2)k G
w(0)k +
q∑l=1
βkl(Pq(x))l
is dense in the set of all continuous functions defined on X for the uniformnorm on any compact subset of X.
Remark 1: A similar result exists for the pointwise consistent approach.
Remark 2: A convergence results for the optimal weights of thefunctional MLP also exists but requires many technical assumptions sothat we do not detail it here. See [Rossi and Conan-Guez, 2005] for moredetails about it.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 29 / 42
Universal approximation for projection basedapproach
Theorem[Rossi and Conan-Guez, 2005, Rossi and Conan-Guez, 2006]If G is continuous and non polynomial then the setx ∈ X →
p∑k=1
w(2)k G
w(0)k +
q∑l=1
βkl(Pq(x))l
is dense in the set of all continuous functions defined on X for the uniformnorm on any compact subset of X.
Remark 1: A similar result exists for the pointwise consistent approach.Remark 2: A convergence results for the optimal weights of thefunctional MLP also exists but requires many technical assumptions sothat we do not detail it here. See [Rossi and Conan-Guez, 2005] for moredetails about it.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 29 / 42
Functional Inverse Regression
Suppose that we are given a random pair (X ,Y) taking its values in R × Xfor which a are given n i.i.d. realizations (x1, y1), . . . , (xn, yn)
Model: Moreover, we suppose that (X ,Y) satisfies the following model:
Y = Ψ(〈X , a1〉X, . . . , 〈X , aq〉X, ε)
where ε is a centered real random variable independant of X , Ψ is anunknown function that has to be estimated and {a1, . . . , aq} are unknownelements of X that are linearly independents and that have to beestimated.We call the space Span
{a1, . . . , aq
}, the Effective Dimension Reduction
subspace of X, denoted by EDR.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 30 / 42
Functional Inverse Regression
Suppose that we are given a random pair (X ,Y) taking its values in R × Xfor which a are given n i.i.d. realizations (x1, y1), . . . , (xn, yn)Model: Moreover, we suppose that (X ,Y) satisfies the following model:
Y = Ψ(〈X , a1〉X, . . . , 〈X , aq〉X, ε)
where ε is a centered real random variable independant of X , Ψ is anunknown function that has to be estimated and {a1, . . . , aq} are unknownelements of X that are linearly independents and that have to beestimated.
We call the space Span{a1, . . . , aq
}, the Effective Dimension Reduction
subspace of X, denoted by EDR.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 30 / 42
Functional Inverse Regression
Suppose that we are given a random pair (X ,Y) taking its values in R × Xfor which a are given n i.i.d. realizations (x1, y1), . . . , (xn, yn)Model: Moreover, we suppose that (X ,Y) satisfies the following model:
Y = Ψ(〈X , a1〉X, . . . , 〈X , aq〉X, ε)
where ε is a centered real random variable independant of X , Ψ is anunknown function that has to be estimated and {a1, . . . , aq} are unknownelements of X that are linearly independents and that have to beestimated.We call the space Span
{a1, . . . , aq
}, the Effective Dimension Reduction
subspace of X, denoted by EDR.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 30 / 42
Fundamental property of EDR space
Denote A = (〈X , a1〉X, . . . , 〈X , aq〉X)T .
Li’s condition [Li, 1991]If
(A-Li) ∀ x ∈ X, ∃v ∈ Rq : E (〈u,X〉X | A) = vT A ,
then E (X | Y) ∈ ΓX (EDR).
⇒ EDR is estimated through the estimation of a1, . . . , aq, eigenvectorsof Γ−1
X ΓE(X |Y).But, as for the functional linear model, Γ−1
X has to be estimated by apenalized or a regularized approach.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 31 / 42
Fundamental property of EDR space
Denote A = (〈X , a1〉X, . . . , 〈X , aq〉X)T .
Li’s condition [Li, 1991]If
(A-Li) ∀ x ∈ X, ∃v ∈ Rq : E (〈u,X〉X | A) = vT A ,
then E (X | Y) ∈ ΓX (EDR).
⇒ EDR is estimated through the estimation of a1, . . . , aq, eigenvectorsof Γ−1
X ΓE(X |Y).
But, as for the functional linear model, Γ−1X has to be estimated by a
penalized or a regularized approach.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 31 / 42
Fundamental property of EDR space
Denote A = (〈X , a1〉X, . . . , 〈X , aq〉X)T .
Li’s condition [Li, 1991]If
(A-Li) ∀ x ∈ X, ∃v ∈ Rq : E (〈u,X〉X | A) = vT A ,
then E (X | Y) ∈ ΓX (EDR).
⇒ EDR is estimated through the estimation of a1, . . . , aq, eigenvectorsof Γ−1
X ΓE(X |Y).But, as for the functional linear model, Γ−1
X has to be estimated by apenalized or a regularized approach.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 31 / 42
PCA approach[Ferré and Yao, 2003, Ferré and Yao, 2005]
Note
((λni , v
ni ))i≥1 the eigenvalue decomposition of Γn
X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn
i )i areorthonormal)
kn an integer such that: kn ≤ n and limn→+∞ kn = +∞
Pkn the projector Pkn (u) =∑kn
i=1〈vni , .〉Xvn
i
Γn,knX = Pkn ◦ Γn,kn
X ◦ Pkn = 1n∑kn
i=1 λni 〈v
ni , .〉Xvn
i
if (Is)s=1,...,S is a partition of the subspace of R where Y takes itsvalues, then we estimate
P (Y ∈ Is) by pns = 1
n
∑ni=1 I{yi∈Is },
E (X | Y ∈ Is) by µns = 1
n
∑ni=1 xiI{yi∈Is },
ΓE(X |Y) by ΓnE(X |Y)
=∑S
s=1 pnsµ
ns ⊗ µ
ns .
((αknk ), an
k )k=1,...,q the eigensystem of (Γn,knX )+ΓE(X |Y).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 32 / 42
PCA approach[Ferré and Yao, 2003, Ferré and Yao, 2005]
Note
((λni , v
ni ))i≥1 the eigenvalue decomposition of Γn
X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn
i )i areorthonormal)
kn an integer such that: kn ≤ n and limn→+∞ kn = +∞
Pkn the projector Pkn (u) =∑kn
i=1〈vni , .〉Xvn
i
Γn,knX = Pkn ◦ Γn,kn
X ◦ Pkn = 1n∑kn
i=1 λni 〈v
ni , .〉Xvn
i
if (Is)s=1,...,S is a partition of the subspace of R where Y takes itsvalues, then we estimate
P (Y ∈ Is) by pns = 1
n
∑ni=1 I{yi∈Is },
E (X | Y ∈ Is) by µns = 1
n
∑ni=1 xiI{yi∈Is },
ΓE(X |Y) by ΓnE(X |Y)
=∑S
s=1 pnsµ
ns ⊗ µ
ns .
((αknk ), an
k )k=1,...,q the eigensystem of (Γn,knX )+ΓE(X |Y).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 32 / 42
PCA approach[Ferré and Yao, 2003, Ferré and Yao, 2005]
Note
((λni , v
ni ))i≥1 the eigenvalue decomposition of Γn
X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn
i )i areorthonormal)
kn an integer such that: kn ≤ n and limn→+∞ kn = +∞
Pkn the projector Pkn (u) =∑kn
i=1〈vni , .〉Xvn
i
Γn,knX = Pkn ◦ Γn,kn
X ◦ Pkn = 1n∑kn
i=1 λni 〈v
ni , .〉Xvn
i
if (Is)s=1,...,S is a partition of the subspace of R where Y takes itsvalues, then we estimate
P (Y ∈ Is) by pns = 1
n
∑ni=1 I{yi∈Is },
E (X | Y ∈ Is) by µns = 1
n
∑ni=1 xiI{yi∈Is },
ΓE(X |Y) by ΓnE(X |Y)
=∑S
s=1 pnsµ
ns ⊗ µ
ns .
((αknk ), an
k )k=1,...,q the eigensystem of (Γn,knX )+ΓE(X |Y).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 32 / 42
PCA approach[Ferré and Yao, 2003, Ferré and Yao, 2005]
Note
((λni , v
ni ))i≥1 the eigenvalue decomposition of Γn
X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn
i )i areorthonormal)
kn an integer such that: kn ≤ n and limn→+∞ kn = +∞
Pkn the projector Pkn (u) =∑kn
i=1〈vni , .〉Xvn
i
Γn,knX = Pkn ◦ Γn,kn
X ◦ Pkn = 1n∑kn
i=1 λni 〈v
ni , .〉Xvn
i
if (Is)s=1,...,S is a partition of the subspace of R where Y takes itsvalues, then we estimate
P (Y ∈ Is) by pns = 1
n
∑ni=1 I{yi∈Is },
E (X | Y ∈ Is) by µns = 1
n
∑ni=1 xiI{yi∈Is },
ΓE(X |Y) by ΓnE(X |Y)
=∑S
s=1 pnsµ
ns ⊗ µ
ns .
((αknk ), an
k )k=1,...,q the eigensystem of (Γn,knX )+ΓE(X |Y).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 32 / 42
PCA approach[Ferré and Yao, 2003, Ferré and Yao, 2005]
Note
((λni , v
ni ))i≥1 the eigenvalue decomposition of Γn
X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn
i )i areorthonormal)
kn an integer such that: kn ≤ n and limn→+∞ kn = +∞
Pkn the projector Pkn (u) =∑kn
i=1〈vni , .〉Xvn
i
Γn,knX = Pkn ◦ Γn,kn
X ◦ Pkn = 1n∑kn
i=1 λni 〈v
ni , .〉Xvn
i
if (Is)s=1,...,S is a partition of the subspace of R where Y takes itsvalues, then we estimate
P (Y ∈ Is) by pns = 1
n
∑ni=1 I{yi∈Is },
E (X | Y ∈ Is) by µns = 1
n
∑ni=1 xiI{yi∈Is },
ΓE(X |Y) by ΓnE(X |Y)
=∑S
s=1 pnsµ
ns ⊗ µ
ns .
((αknk ), an
k )k=1,...,q the eigensystem of (Γn,knX )+ΓE(X |Y).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 32 / 42
PCA approach[Ferré and Yao, 2003, Ferré and Yao, 2005]
Note
((λni , v
ni ))i≥1 the eigenvalue decomposition of Γn
X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn
i )i areorthonormal)
kn an integer such that: kn ≤ n and limn→+∞ kn = +∞
Pkn the projector Pkn (u) =∑kn
i=1〈vni , .〉Xvn
i
Γn,knX = Pkn ◦ Γn,kn
X ◦ Pkn = 1n∑kn
i=1 λni 〈v
ni , .〉Xvn
i
if (Is)s=1,...,S is a partition of the subspace of R where Y takes itsvalues, then we estimate
P (Y ∈ Is) by pns = 1
n
∑ni=1 I{yi∈Is },
E (X | Y ∈ Is) by µns = 1
n
∑ni=1 xiI{yi∈Is },
ΓE(X |Y) by ΓnE(X |Y)
=∑S
s=1 pnsµ
ns ⊗ µ
ns .
((αknk ), an
k )k=1,...,q the eigensystem of (Γn,knX )+ΓE(X |Y).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 32 / 42
PCA approach[Ferré and Yao, 2003, Ferré and Yao, 2005]
Note
((λni , v
ni ))i≥1 the eigenvalue decomposition of Γn
X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn
i )i areorthonormal)
kn an integer such that: kn ≤ n and limn→+∞ kn = +∞
Pkn the projector Pkn (u) =∑kn
i=1〈vni , .〉Xvn
i
Γn,knX = Pkn ◦ Γn,kn
X ◦ Pkn = 1n∑kn
i=1 λni 〈v
ni , .〉Xvn
i
if (Is)s=1,...,S is a partition of the subspace of R where Y takes itsvalues, then we estimate
P (Y ∈ Is) by pns = 1
n
∑ni=1 I{yi∈Is },
E (X | Y ∈ Is) by µns = 1
n
∑ni=1 xiI{yi∈Is },
ΓE(X |Y) by ΓnE(X |Y)
=∑S
s=1 pnsµ
ns ⊗ µ
ns .
((αknk ), an
k )k=1,...,q the eigensystem of (Γn,knX )+ΓE(X |Y).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 32 / 42
PCA approach[Ferré and Yao, 2003, Ferré and Yao, 2005]
Note
((λni , v
ni ))i≥1 the eigenvalue decomposition of Γn
X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn
i )i areorthonormal)
kn an integer such that: kn ≤ n and limn→+∞ kn = +∞
Pkn the projector Pkn (u) =∑kn
i=1〈vni , .〉Xvn
i
Γn,knX = Pkn ◦ Γn,kn
X ◦ Pkn = 1n∑kn
i=1 λni 〈v
ni , .〉Xvn
i
if (Is)s=1,...,S is a partition of the subspace of R where Y takes itsvalues, then we estimate
P (Y ∈ Is) by pns = 1
n
∑ni=1 I{yi∈Is },
E (X | Y ∈ Is) by µns = 1
n
∑ni=1 xiI{yi∈Is },
ΓE(X |Y) by ΓnE(X |Y)
=∑S
s=1 pnsµ
ns ⊗ µ
ns .
((αknk ), an
k )k=1,...,q the eigensystem of (Γn,knX )+ΓE(X |Y).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 32 / 42
Assumptions for convergence of the estimated EDRspace
(A1) X has 4 th moment;
(A2) the (λk )k≥1 are all distinct and positive;
(A3) if we note b1 = 2√
2λ1−λ2
and bj = 2√
2min(λj−1−λj ,λj−λj+1)
for j ≥ 2,
limn→+∞1√nλkn
∑knj=1 bj = 0 and limn→+∞
1√nλ2
kn
= 0;
(A4) the eigenvalues of Γ−1X ΓE(X |Y) are all distinct.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 33 / 42
Convergence of the estimated EDR space
Theorem [Ferré and Yao, 2003, Ferré and Yao, 2005]Under assumption (A-Li) and assumptions (A1)-(A4),∥∥∥an
k − ak∥∥∥X
n→+∞, P−−−−−−−−→ 0.
Remark: A similar result for a smoothing approach is described in[Ferré and Villa, 2005, Ferré and Villa, 2006].
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 34 / 42
Convergence of the estimated EDR space
Theorem [Ferré and Yao, 2003, Ferré and Yao, 2005]Under assumption (A-Li) and assumptions (A1)-(A4),∥∥∥an
k − ak∥∥∥X
n→+∞, P−−−−−−−−→ 0.
Remark: A similar result for a smoothing approach is described in[Ferré and Villa, 2005, Ferré and Villa, 2006].
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 34 / 42
FIR multilayer perceptrons
The idea of this model, developed in [Ferré and Villa, 2006], is to estimatethe function Ψ by a multilayer perceptron.
〈X , a1〉X
〈X , a2〉X
〈X , aq〉X w(1)
w(2)∑
+
Bias
∑+
Bias: w(0)
Y
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 35 / 42
FIR multilayer perceptrons
The idea of this model, developed in [Ferré and Villa, 2006], is to estimatethe function Ψ by a multilayer perceptron.
〈X , a1〉X
〈X , a2〉X
〈X , aq〉X w(1)
w(2)∑
+
Bias
∑+
Bias: w(0)
Y
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 35 / 42
Notations
Outputs : ∀ u ∈ Rq,
φw(u) =
p∑k=1
w(2)k G
((w(1)
k )T u + w(0)k
);
For a given error function L (e.g., mean square error),ζ((u, y),w) = L(φw(u), y);
Variables: Z = ((〈X , aj〉X)j=1,...,q,Y), zni = ((〈xi , aj〉X)j=1,...,q, yi): Z
and (zni )i take their values in a open subspace, O, of Rq+1;
The weights are chosen in a compact subset,W, of (R × Rq × R)p
theoretical: w∗ = arg minw∈W E (ζ(Z ,w)): W∗ is the set of all w∗;empirical: w∗n = arg minw∈W
∑ni=1 ζ(zn
i ,w).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 36 / 42
Notations
Outputs : ∀ u ∈ Rq,
φw(u) =
p∑k=1
w(2)k G
((w(1)
k )T u + w(0)k
);
For a given error function L (e.g., mean square error),ζ((u, y),w) = L(φw(u), y);
Variables: Z = ((〈X , aj〉X)j=1,...,q,Y), zni = ((〈xi , aj〉X)j=1,...,q, yi): Z
and (zni )i take their values in a open subspace, O, of Rq+1;
The weights are chosen in a compact subset,W, of (R × Rq × R)p
theoretical: w∗ = arg minw∈W E (ζ(Z ,w)): W∗ is the set of all w∗;empirical: w∗n = arg minw∈W
∑ni=1 ζ(zn
i ,w).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 36 / 42
Notations
Outputs : ∀ u ∈ Rq,
φw(u) =
p∑k=1
w(2)k G
((w(1)
k )T u + w(0)k
);
For a given error function L (e.g., mean square error),ζ((u, y),w) = L(φw(u), y);
Variables: Z = ((〈X , aj〉X)j=1,...,q,Y), zni = ((〈xi , aj〉X)j=1,...,q, yi): Z
and (zni )i take their values in a open subspace, O, of Rq+1;
The weights are chosen in a compact subset,W, of (R × Rq × R)p
theoretical: w∗ = arg minw∈W E (ζ(Z ,w)): W∗ is the set of all w∗;empirical: w∗n = arg minw∈W
∑ni=1 ζ(zn
i ,w).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 36 / 42
Notations
Outputs : ∀ u ∈ Rq,
φw(u) =
p∑k=1
w(2)k G
((w(1)
k )T u + w(0)k
);
For a given error function L (e.g., mean square error),ζ((u, y),w) = L(φw(u), y);
Variables: Z = ((〈X , aj〉X)j=1,...,q,Y), zni = ((〈xi , aj〉X)j=1,...,q, yi): Z
and (zni )i take their values in a open subspace, O, of Rq+1;
The weights are chosen in a compact subset,W, of (R × Rq × R)p
theoretical: w∗ = arg minw∈W E (ζ(Z ,w)): W∗ is the set of all w∗;empirical: w∗n = arg minw∈W
∑ni=1 ζ(zn
i ,w).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 36 / 42
Notations
Outputs : ∀ u ∈ Rq,
φw(u) =
p∑k=1
w(2)k G
((w(1)
k )T u + w(0)k
);
For a given error function L (e.g., mean square error),ζ((u, y),w) = L(φw(u), y);
Variables: Z = ((〈X , aj〉X)j=1,...,q,Y), zni = ((〈xi , aj〉X)j=1,...,q, yi): Z
and (zni )i take their values in a open subspace, O, of Rq+1;
The weights are chosen in a compact subset,W, of (R × Rq × R)p
theoretical: w∗ = arg minw∈W E (ζ(Z ,w)): W∗ is the set of all w∗;
empirical: w∗n = arg minw∈W∑n
i=1 ζ(zni ,w).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 36 / 42
Notations
Outputs : ∀ u ∈ Rq,
φw(u) =
p∑k=1
w(2)k G
((w(1)
k )T u + w(0)k
);
For a given error function L (e.g., mean square error),ζ((u, y),w) = L(φw(u), y);
Variables: Z = ((〈X , aj〉X)j=1,...,q,Y), zni = ((〈xi , aj〉X)j=1,...,q, yi): Z
and (zni )i take their values in a open subspace, O, of Rq+1;
The weights are chosen in a compact subset,W, of (R × Rq × R)p
theoretical: w∗ = arg minw∈W E (ζ(Z ,w)): W∗ is the set of all w∗;empirical: w∗n = arg minw∈W
∑ni=1 ζ(zn
i ,w).
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 36 / 42
Assumptions for convergence of the optimalempirical weights
(A1) ∀ z ∈ O, ζ(z, .) is continuous;
(A2) ∀w ∈ W, ζ(.,w) is measurable;
(A3) it exists a measurable function ζ̃ from O in R such that, ∀ z ∈ O,∀w ∈ W,
∣∣∣ζ(z,w)∣∣∣ < ζ̃(z) and E
(ζ̃(Z)
)< +∞;
(A4) ∀w ∈ W, ∃C(w) > 0 such that for all (x, y) and (x′, y′) in O,∣∣∣ζ((x, y),w) − ζ((x′, y),w)∣∣∣ ≤ C(w) ‖x − x′‖Rq
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 37 / 42
Assumptions for convergence of the optimalempirical weights
(A1) ∀ z ∈ O, ζ(z, .) is continuous;
(A2) ∀w ∈ W, ζ(.,w) is measurable;
(A3) it exists a measurable function ζ̃ from O in R such that, ∀ z ∈ O,∀w ∈ W,
∣∣∣ζ(z,w)∣∣∣ < ζ̃(z) and E
(ζ̃(Z)
)< +∞;
(A4) ∀w ∈ W, ∃C(w) > 0 such that for all (x, y) and (x′, y′) in O,∣∣∣ζ((x, y),w) − ζ((x′, y),w)∣∣∣ ≤ C(w) ‖x − x′‖Rq
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 37 / 42
Assumptions for convergence of the optimalempirical weights
(A1) ∀ z ∈ O, ζ(z, .) is continuous;
(A2) ∀w ∈ W, ζ(.,w) is measurable;
(A3) it exists a measurable function ζ̃ from O in R such that, ∀ z ∈ O,∀w ∈ W,
∣∣∣ζ(z,w)∣∣∣ < ζ̃(z) and E
(ζ̃(Z)
)< +∞;
(A4) ∀w ∈ W, ∃C(w) > 0 such that for all (x, y) and (x′, y′) in O,∣∣∣ζ((x, y),w) − ζ((x′, y),w)∣∣∣ ≤ C(w) ‖x − x′‖Rq
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 37 / 42
Assumptions for convergence of the optimalempirical weights
(A1) ∀ z ∈ O, ζ(z, .) is continuous;
(A2) ∀w ∈ W, ζ(.,w) is measurable;
(A3) it exists a measurable function ζ̃ from O in R such that, ∀ z ∈ O,∀w ∈ W,
∣∣∣ζ(z,w)∣∣∣ < ζ̃(z) and E
(ζ̃(Z)
)< +∞;
(A4) ∀w ∈ W, ∃C(w) > 0 such that for all (x, y) and (x′, y′) in O,∣∣∣ζ((x, y),w) − ζ((x′, y),w)∣∣∣ ≤ C(w) ‖x − x′‖Rq
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 37 / 42
Convergence of the optimal empirical weights
Theorem [Ferré and Villa, 2006]Under assumption (A-Li), assumptions ensuring the convergence of theestimated EDR space and assumptions (A1)-(A4),
d(w∗n,W∗)
n→+∞, P−−−−−−−−→ 0,
where d is defined by: d(w,W) = infw̃∈W∥∥∥w − w̃
∥∥∥R(q+2)p .
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 38 / 42
Application: Tecator Data Set
Aim : Predicting the fat content of peaces of meat from their infraredspectra.
Comparison of:
PCA and then MLP [Thodberg, 1996];
Functional MLP (with a discrete sampling based approach)[Rossi and Conan-Guez, 2005];
smoothing FIR and then MLP [Ferré and Villa, 2006];
projection based FIR and then MLP[Ferré and Yao, 2003, Ferré and Villa, 2006];
smoothing FIR and then linear model.
Methodology: Repetition of 50 experiments with random training/testsets.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 39 / 42
Application: Tecator Data Set
Aim : Predicting the fat content of peaces of meat from their infraredspectra.Comparison of:
PCA and then MLP [Thodberg, 1996];
Functional MLP (with a discrete sampling based approach)[Rossi and Conan-Guez, 2005];
smoothing FIR and then MLP [Ferré and Villa, 2006];
projection based FIR and then MLP[Ferré and Yao, 2003, Ferré and Villa, 2006];
smoothing FIR and then linear model.
Methodology: Repetition of 50 experiments with random training/testsets.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 39 / 42
Application: Tecator Data Set
Aim : Predicting the fat content of peaces of meat from their infraredspectra.Comparison of:
PCA and then MLP [Thodberg, 1996];
Functional MLP (with a discrete sampling based approach)[Rossi and Conan-Guez, 2005];
smoothing FIR and then MLP [Ferré and Villa, 2006];
projection based FIR and then MLP[Ferré and Yao, 2003, Ferré and Villa, 2006];
smoothing FIR and then linear model.
Methodology: Repetition of 50 experiments with random training/testsets.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 39 / 42
Results of the experiments
1
2
3
4
5
PCA-MLPNN f
SIRr-MLPSIRp-MLP
SIRr-ML
Remark: A similar experiment has been made on the phoneme dataset tocompare “FIR-MLP” to the functional nonparametric kernel.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 40 / 42
Results of the experiments
1
2
3
4
5
PCA-MLPNN f
SIRr-MLPSIRp-MLP
SIRr-ML
Remark: A similar experiment has been made on the phoneme dataset tocompare “FIR-MLP” to the functional nonparametric kernel.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 40 / 42
Table of contents
1 Nonparametric kernel
2 Neural networks
3 References
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 41 / 42
References
Further details for the references are given in the joint document.
Ferraty, F. and Vieu, P. (2000).Dimension fractale et estimation de la régression dans des espacesvectoriels semi-normés.Comptes Rendus Mathématique. Académie des Sciences. Paris,330:139–142.
Ferraty, F. and Vieu, P. (2002).The functional nonparametric model and application to spectrometricdata.Computational Statistics, 17:515–561.
Ferraty, F. and Vieu, P. (2003).Curves discrimination: a non parametric approach.Computational and Statistical Data Analysis, 44:161–173.
Ferraty, F. and Vieu, P. (2004).Nonparametric models for functional data, with application inregression, time series prediction and curves discrimination.Journal of Nonparametric Statistics, 16:111–125.
Ferraty, F. and Vieu, P. (2008).Erratum of: ’non-parametric models for functional data, withapplication in regression, time-series prediction and curvediscrimination’.Journal of Nonparametric Statistics, 20(2):187–189.
Ferré, L. and Villa, N. (2005).Discrimination de courbes par régression inverse fonctionnelle.Revue de Statistique Appliquée, LIII(1):39–57.
Ferré, L. and Villa, N. (2006).Multi-layer perceptron with functional inputs: an inverse regressionapproach.Scandinavian Journal of Statistics, 33(4):807–823.
Ferré, L. and Yao, A. (2003).Functional sliced inverse regression analysis.Statistics, 37(6):475–488.
Ferré, L. and Yao, A. (2005).Smoothed functional inverse regression.Statistica Sinica, 15(3):665–683.
Hornik, K. (1991).Approximation capabilities of multilayer feedfoward networks.Neural Networks, 4(2):251–257.
Hornik, K. (1993).Some new results on neural network approximation.Neural Networks, 6(8):1069–1072.
Li, K. (1991).Sliced inverse regression for dimension reduction.Journal of the American Statistical Association, 86:316–342.
Nadaraya, E. (1964).On estimating regression.Theory of Probability and its Applications, 10:186–196.
Rossi, F. and Conan-Guez, B. (2005).Functional multi-layer perceptron: a nonlinear tool for functional dataanlysis.Neural Networks, 18(1):45–60.
Rossi, F. and Conan-Guez, B. (2006).Theoretical properties of projection based multilayer perceptrons withfunctional inputs.Neural Processing Letters, 23(1):55–70.
Stinchcombe, M. (1999).Neural network approximation of continuous functionals andcontinuous functions on compactifications.Neural Network, 12(3):467–477.
Thodberg, H. (1996).A review of bayesian neural network with an application to nearinfrared spectroscopy.IEEE Transaction on Neural Networks, 7(1):56–72.
Watson, G. (1964).Smooth regression analysis.Sankhya Series, A(26):359–372.
White, A. (1990).Connectionist nonparametric regression: multilayer feedforwardnetworks can learn arbitraty mappings.Neural Networks, 3:535–549.
White, H. (1989).Learning in artificial neural network: a statistical perspective.Neural Computation, 1:425–464.
Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 42 / 42