Several nonlinear models and methods for FDA

Several nonlinear models and methods for FDA

Nathalie Villa-Vialaneix - [email protected]://www.nathalievilla.org

Institut de Mathématiques de Toulouse - IUT de Carcassonne, Université de PerpignanFrance

La Havane, September 16th, 2008

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 1 / 42

Table of contents

1 Nonparametric kernel

2 Neural networks

3 References


Nonparametric model in FDA

In this section, X is a random variable taking its values in a metric space(X, ‖.‖X) where ‖.‖X denotes a semi-norm (i.e., ‖x‖X = 0; x = 0).

In the following presentation, we are interesting in the followingnonparametric functional model:

Y = Φ(X) + ε

where Y is a real random variable (regression case), X is a functionalrandom variable taking its values in (X, ‖.‖X) and ε is a centered randomvariable, independant from X and Φ is a unknown operator from X to R.We also suppose that we are given a set of n i.i.d. realizations of therandom pair (X ,Y):

(xi , yi), (x2, y2), . . . , (xn, yn).

From this training set, we aim at building an estimate, Φn, of Φ suchthat Φn converge to the true Φ when n tends to infinity, in a sense thatwill be developed later.



In this section, X is a random variable taking its values in a metric space(X, ‖.‖X) where ‖.‖X denotes a semi-norm (i.e., ‖x‖X = 0; x = 0).In the following presentation, we are interesting in the followingnonparametric functional model:

Y = Φ(X) + ε

where Y is a real random variable (regression case), X is a functionalrandom variable taking its values in (X, ‖.‖X) and ε is a centered randomvariable, independant from X and Φ is a unknown operator from X to R.

We also suppose that we are given a set of n i.i.d. realizations of therandom pair (X ,Y):

(xi , yi), (x2, y2), . . . , (xn, yn).





Y = Φ(X) + ε


(xi , yi), (x2, y2), . . . , (xn, yn).





Y = Φ(X) + ε


(xi , yi), (x2, y2), . . . , (xn, yn).



Nadaraya-Watson kernel estimate[Nadaraya, 1964, Watson, 1964]

Returning to the real case (i.e. X ∈ R), the Nadaraya-Watson kernelestimate is the regression function:

Φn : x ∈ R →

∑ni=1 yiK

(x−xi

h

)∑n

i=1 K(

x−xih

)where

K is the so-called kernel, i.e., K : R → R is a bounded, integrablefunction. Additionally, K is often positive and null everywhere, excepton a compact subset of R.

h is the smoothing parameter: this parameter controls thesmoothness of the estimate Φn.




Φn : x ∈ R →

∑ni=1 yiK

(x−xi

h

)∑n

i=1 K(

x−xih

)where






Φn : x ∈ R →

∑ni=1 yiK

(x−xi

h

)∑n

i=1 K(

x−xih

)where




Remark on the parameters

Two parameters of the estimator Φn have to be set:1 The kernel K : its choice does not have much influence on the

accuracy of Φn. Several common choices for K are:the uniform kernel: K : x ∈ R → I[−1,1](x) (“moving averageestimate”),

the Epanechnikov kernel: K : x ∈ R → (1 − u2)I[−1,1](x),the Gaussian kernel: K : x ∈ R → e−u2

.. . .

2 The smoothing parameter h: it is of a main importance to obtain agood approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.




accuracy of Φn. Several common choices for K are:the uniform kernel: K : x ∈ R → I[−1,1](x) (“moving averageestimate”),the Epanechnikov kernel: K : x ∈ R → (1 − u2)I[−1,1](x),

the Gaussian kernel: K : x ∈ R → e−u2.

. . .2 The smoothing parameter h: it is of a main importance to obtain a

good approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.




accuracy of Φn. Several common choices for K are:the uniform kernel: K : x ∈ R → I[−1,1](x) (“moving averageestimate”),the Epanechnikov kernel: K : x ∈ R → (1 − u2)I[−1,1](x),the Gaussian kernel: K : x ∈ R → e−u2

.

. . .2 The smoothing parameter h: it is of a main importance to obtain a

good approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.





.. . .

2 The smoothing parameter h: it is of a main importance to obtain agood approximation.

The smoothing parameter h: it is of a mainimportance to obtain a good approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.





.

. . .

2 In particular, h depends on n.

h = 1 h = 0.5

The smoothing parameter h: it is of a main importance to obtain agood approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.





.

. . .

2 The smoothing parameter h: it is of a main importance to obtain agood approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.


Generalization of the N.W. kernel to FDA[Ferraty and Vieu, 2002, Ferraty and Vieu, 2000]

When X takes its values in the Hilbert space (X, 〈., .〉X), this estimates canbe generalized by

Φn : x ∈ X →n∑

i=1

wi(x)yi

where

wi(x) =K

(‖xi−x‖X

h

)∑n

k=1 K(‖xk−x‖X

h

)and K and h are defined as previously (i.e., as in the real case).


Basic assumption about the fractal dimension

The main assumption for theoretical results on the convergence of theN.W. kernel estimate in Hilbertian case is:

(Afd1) Fractal dimension assumption(also called “small balls assumption”)

limα→0+

P (X ∈ B(x, α))

αa(x)= c(x)

where a(x) and c(x) are positive real numbers andB(x, α) :=

{u ∈ X : ‖u − x‖X ≤ α

}.

a(x) is named fractal dimension of the probability distribution of X .


Basic assumption about the fractal dimension

The main assumption for theoretical results on the convergence of theN.W. kernel estimate in Hilbertian case is:

(Afd1) Fractal dimension assumption(also called “small balls assumption”)

limα→0+

P (X ∈ B(x, α))

αa(x)= c(x)

where a(x) and c(x) are positive real numbers andB(x, α) :=

{u ∈ X : ‖u − x‖X ≤ α

}.

a(x) is named fractal dimension of the probability distribution of X .


Assumptions for pointwise convergence

(A1) limn→+∞ hn = 0 and limn→+∞nha(x)

nlog n = +∞

(A2) K is bounded and is null except on a compact subset of R+;moreover, K satisfies

∀ t , t ′ ∈ R+, |K(t) − K(t ′)| ≤ |t − t ′|

(A3) Y is bounded

(A4) Φ is continuous at x ∈ X


Pointwise convergence

Theorem [Ferraty and Vieu, 2000, Ferraty and Vieu, 2002]

Under assumptions (A1)-(A4) and assumption (Afd1),

limn→+∞

Φn(x) = Φ(x).


Assumptions for optimal rate of pointwiseconvergence

(Afd2) P (X ∈ B(x, α)) = αa(x)c(x) + OP(αa(x)+b(x)

)(A4’) It exists B > 0, C > 0 and β > 0 such that: for all u and v inB(x,B)

|Φ(u) − Φ(v)| ≤ C |u − v |β

(A1’) hn = h(

log nn

) 11γ(x)+a(x) where γ(x) = min{b(x), β} and

limn→+∞nha(x)

nlog n = 0


Rate of convergence for pointwise convergence


Under assumptions (A1’), (A2), (A3), (A4’) and assumption (Afd2),

Φn(x) − Φ(x) = O

(log n

n

) γ(x)2γ(x)+a(x)


Assumptions for uniform convergence on a compactsubset of X, C

(Afd3) limα→0+ supx∈SP(X∈B(x,α))

αA = c(x) where infx∈S c(x) > 0

(A5) The covering number of S, N(S, l) (i.e., the minimum numberof balls of radius l that are needed to cover S is such that it existsα > 0 and C > 0 such that N(C, l) = Cl−α

(A4”) Φ is continuous on S


Uniform convergence


Under assumptions (A1)-(A3), (A4”), (A5) and assumption (Afd3),

Φn(x) − Φ(x) = O

(log n

n

) γ(x)2γ(x)+a(x)


Note on possible choices for the semi-norm

1 PCA semi-norm: suppose that X is a squared integrable randomvariable of L2 (i.e., E

(‖X‖2L2

)< +∞) then,

ΓX can be written as∑

k≥1 λk vk ⊗ vk where ((λk ), (vk ))k is theeigensystem of ΓX , (λk ) are in decreasing order and (vk ) areorthonormal vectors of L2,This defines a semi-norm on X = L2: for a given K ∈ N∗,

∀ x ∈ X, ‖x‖2X =K∑

k=1

〈vk , x〉2L2 =∥∥∥PSpan{v1,...,vK }(x)

∥∥∥2L2 .

This semi-norm emphasizes the main directions for the representationof the random variable X .

2 q-th derivative semi-norm: suppose now thatX =

{h ∈ L2 : h(q) exists and is in L2

}. Then,

∀ x ∈ X, ‖x‖2X

=∥∥∥∥x(q)

∥∥∥∥2

L2.

This norm is strongly related to RKHS (Sobolev spaces) and splinesand will be further investigated in Presentation 4.




(‖X‖2L2

)< +∞) then,


k≥1 λk vk ⊗ vk where ((λk ), (vk ))k is theeigensystem of ΓX , (λk ) are in decreasing order and (vk ) areorthonormal vectors of L2,

This defines a semi-norm on X = L2: for a given K ∈ N∗,

∀ x ∈ X, ‖x‖2X =K∑

k=1

〈vk , x〉2L2 =∥∥∥PSpan{v1,...,vK }(x)

∥∥∥2L2 .




}. Then,

∀ x ∈ X, ‖x‖2X

=∥∥∥∥x(q)

∥∥∥∥2

L2.





(‖X‖2L2

)< +∞) then,



∀ x ∈ X, ‖x‖2X =K∑

k=1

〈vk , x〉2L2 =∥∥∥PSpan{v1,...,vK }(x)

∥∥∥2L2 .




}. Then,

∀ x ∈ X, ‖x‖2X

=∥∥∥∥x(q)

∥∥∥∥2

L2.





(‖X‖2L2

)< +∞) then,



∀ x ∈ X, ‖x‖2X =K∑

k=1

〈vk , x〉2L2 =∥∥∥PSpan{v1,...,vK }(x)

∥∥∥2L2 .




}. Then,

∀ x ∈ X, ‖x‖2X

=∥∥∥∥x(q)

∥∥∥∥2

L2.





(‖X‖2L2

)< +∞) then,



∀ x ∈ X, ‖x‖2X =K∑

k=1

〈vk , x〉2L2 =∥∥∥PSpan{v1,...,vK }(x)

∥∥∥2L2 .




}. Then,

∀ x ∈ X, ‖x‖2X

=∥∥∥∥x(q)

∥∥∥∥2

L2.



Generalization to curve classification[Ferraty and Vieu, 2003, Ferraty and Vieu, 2004]

Suppose that X ∈ X but also that Y ∈ {1, 2, . . . ,G}.

Then, a classification rule is given by:

∀ x ∈ X, g(x) = arg maxg=1,...,G

P (Y = g|X = x) .

Then, this rule needs the estimation of the probability P (Y = g|X = x):

Pn(Y = g|X = x) :=n∑

i=1

wi(x)I[Yi=g]

where wi(x) =K(‖x−xi‖X

h

)∑n

l=1 K(‖x−xl‖X

h

) .



Suppose that X ∈ X but also that Y ∈ {1, 2, . . . ,G}.Then, a classification rule is given by:

∀ x ∈ X, g(x) = arg maxg=1,...,G

P (Y = g|X = x) .


Pn(Y = g|X = x) :=n∑

i=1

wi(x)I[Yi=g]


h

)∑n

l=1 K(‖x−xl‖X

h

) .



Suppose that X ∈ X but also that Y ∈ {1, 2, . . . ,G}.Then, a classification rule is given by:

∀ x ∈ X, g(x) = arg maxg=1,...,G

P (Y = g|X = x) .


Pn(Y = g|X = x) :=n∑

i=1

wi(x)I[Yi=g]


h

)∑n

l=1 K(‖x−xl‖X

h

) .


Example of nonparametric curve classification withPCA semi-norm

Problem: Discriminating 5 phonemes from their log-periodograms.

Competitor methods:Ridge PDA (i.e., Principal Discriminant Analysis penalized by the norm)Partial Least Squares with the L2 norm (denoted by MPLSR)Partial Least Squares with the PCA semi-norm (denoted byNPCD/MPLSR)Nonparametric kernel estimator with the PCA semi-norm (denotedby NPCD/PCA)


Example of nonparametric curve classification withPCA semi-norm

Problem: Discriminating 5 phonemes from their log-periodograms.

Competitor methods:Ridge PDA (i.e., Principal Discriminant Analysis penalized by the norm)Partial Least Squares with the L2 norm (denoted by MPLSR)Partial Least Squares with the PCA semi-norm (denoted byNPCD/MPLSR)Nonparametric kernel estimator with the PCA semi-norm (denotedby NPCD/PCA)


Obtained result


Generalization to time series [Ferraty and Vieu, 2004]

Problem and notations: If we are given a time series (Z(t))t∈R , one isoften interesting, knowing {Z(t), t ∈ [Tmax − T ,Tmax]}, to predict Z(t + τ).

Denoting

X ={Z(t), t ∈ [Tmax − T ,Tmax]

}Y = Z(Tmax + τ)

we can see that this problem is strongly related to a functional regressionmodel.The observations are given by

xi ={z(t), t ∈ [(i − 1)T , iT ]

}yi = z(iT + τ)

for i = 1, . . . , n but they are not independant.



Problem and notations: If we are given a time series (Z(t))t∈R , one isoften interesting, knowing {Z(t), t ∈ [Tmax − T ,Tmax]}, to predict Z(t + τ).Denoting


}Y = Z(Tmax + τ)

we can see that this problem is strongly related to a functional regressionmodel.

The observations are given by

xi ={z(t), t ∈ [(i − 1)T , iT ]

}yi = z(iT + τ)




Problem and notations: If we are given a time series (Z(t))t∈R , one isoften interesting, knowing {Z(t), t ∈ [Tmax − T ,Tmax]}, to predict Z(t + τ).Denoting


}Y = Z(Tmax + τ)

we can see that this problem is strongly related to a functional regressionmodel.The observations are given by

xi ={z(t), t ∈ [(i − 1)T , iT ]

}yi = z(iT + τ)



Mixing assumptions

Under mixing assumptions on ((xi , yi))i and other assumptions, thesame convergence results holds.

More precisely, suppose that, for

α(n) = supk∈Z, A∈σk

−∞, B∈σ+∞k+n

∣∣∣P (A ∩ B) − P (A)P (B)∣∣∣

where σlk = σ ({(xi , yi) : k ≤ i ≤ l}), we have

limn→+∞

α(n) = 0.

α is named mixing coefficient.


Mixing assumptions

Under mixing assumptions on ((xi , yi))i and other assumptions, thesame convergence results holds.More precisely, suppose that, for


−∞, B∈σ+∞k+n

∣∣∣P (A ∩ B) − P (A)P (B)∣∣∣


limn→+∞

α(n) = 0.



Mixing assumptions

Under mixing assumptions on ((xi , yi))i and other assumptions, thesame convergence results holds.More precisely, suppose that, for


−∞, B∈σ+∞k+n

∣∣∣P (A ∩ B) − P (A)P (B)∣∣∣


limn→+∞

α(n) = 0.



Table of contents


2 Neural networks

3 References


Biological neural networks





∑Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 21 / 42


If∑> activation threshold then



If∑< activation threshold then


Mathematical multilayer perceptrons:multidimensional case

Inpu

ts

2

1.5

11

Variable d

Variable 2

Variable 1



Inpu

ts

2

1.5

11

Variable d

Variable 2

Variable 1× weights

0.5

-1

0.2

∑= 4.4

Activation function G



Inpu

ts

2

1.5

11

Variable d

Variable 2


0.5

-1

0.2

∑= 4.4


∑+G



Inpu

ts

2

1.5

11

Variable d

Variable 2


0.5

-1

0.2

∑= 4.4


∑+G

× weights

∑+f

Outputs


In summary, multilayer perceptrons with 1 hiddenlayer are:

a class of functions of the form:

φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)

whereG is given and called the activation function.

p is also given and called the number of neurons on the hiddenlayer.For all k = 1, . . . , p, w(2)

k ∈ R, w(0)k ∈ R and w(1)

k ∈ Rd are theweights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:

w∗n = arg minw∈(R×R×Rd)p

n∑i=1

(yi − φw(xi))2 .




φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where

G is given and called the activation function.


k ∈ R, w(0)k ∈ R and w(1)



n∑i=1

(yi − φw(xi))2 .




φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where

G is given and called the activation function. Popular examples ofactivation functions are:

the linear activation function


k ∈ R, w(0)k ∈ R and w(1)



n∑i=1

(yi − φw(xi))2 .




φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where

G is given and called the activation function. Popular examples ofactivation functions are:

the sigmoid activation function


k ∈ R, w(0)k ∈ R and w(1)



n∑i=1

(yi − φw(xi))2 .




φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where

G is given and called the activation function.p is also given and called the number of neurons on the hiddenlayer.

For all k = 1, . . . , p, w(2)k ∈ R, w(0)

k ∈ R and w(1)k ∈ Rd are the

weights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:


n∑i=1

(yi − φw(xi))2 .




φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where

G is given and called the activation function.p is also given and called the number of neurons on the hiddenlayer. p is a main parameter to obtain a good generalization ability.

For all k = 1, . . . , p, w(2)k ∈ R, w(0)




n∑i=1

(yi − φw(xi))2 .




φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where


For all k = 1, . . . , p, w(2)k ∈ R, w(0)




n∑i=1

(yi − φw(xi))2 .




φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where


For all k = 1, . . . , p, w(2)k ∈ R, w(0)




n∑i=1

(yi − φw(xi))2 .




φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where

G is given and called the activation function.p is also given and called the number of neurons on the hiddenlayer.For all k = 1, . . . , p, w(2)

k ∈ R, w(0)k ∈ R and w(1)

k ∈ Rd are theweights that have to be set from the learning data set.

More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:


n∑i=1

(yi − φw(xi))2 .




φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where

G is given and called the activation function.p is also given and called the number of neurons on the hiddenlayer.For all k = 1, . . . , p, w(2)

k ∈ R, w(0)k ∈ R and w(1)



n∑i=1

(yi − φw(xi))2 .


Universal approximation

The popularity of MLP in the multidimensional context comes from twomain properties. The first one is called universal approximationcapability:

Theorem [Hornik, 1991, Hornik, 1993, Stinchcombe, 1999]If the activation function is continuous and non polynomial, then,

{φpw : w ∈

(R × R × Rd

)p, p ∈ N∗}

is dense in the set of continuous functions from Rd to R for thetopology induced by the uniform norm on compact sets.


Consistency to optimal weights

The second property deals with the fact that the optimal empirical weightstends to the optimal weights when the size of the training data tends toinfinity.

Theorem [White, 1989, White, 1990]

Denote w∗ = arg minw∈(R×R×Rd)p E((

X − φpw(Y)

)2)

and

Θ(p,∆) ={φ

pw :

∑pk=1 |w

(2)k | ≤ ∆ and

∑ki=1(|w(0)

k |+∑d

l=1 |w(1)kl |) ≤ ∆p

}.

If1 the activation function G is bounded;2 limn→+∞ pn = +∞, limn→+∞∆n = +∞, ∆n = o(n) and

limn→+∞ pn∆4n log pn∆n = o(n);

3 w∗ exists and is unique;4 w∗n = arg minw: φw∈Θ(pn ,∆n)

∑ni=1(yi − φw(xi))2;

then limn→+∞

∥∥∥w∗n − w∗∥∥∥Rd = 0.


Multilayer perceptrons with functional input

Suppose now that (X ,Y) is a random pair taking its values in X × R where(X, 〈., .〉X) is a Hilbert space.

Multilayer perceptrons with 1 hidden layer generalize to functional input by:

φpw : x ∈ X →

p∑k=1

w(2)k G

(w(0)

k + 〈x,w(1)k 〉X

)where the weights w(1)

k ∈ X (functional values).

With relevant representations of the weights (w(1)k )k (rich enough

representations), universal property of this model remains valid.



Suppose now that (X ,Y) is a random pair taking its values in X × R where(X, 〈., .〉X) is a Hilbert space.Multilayer perceptrons with 1 hidden layer generalize to functional input by:

φpw : x ∈ X →

p∑k=1

w(2)k G

(w(0)

k + 〈x,w(1)k 〉X







Suppose now that (X ,Y) is a random pair taking its values in X × R where(X, 〈., .〉X) is a Hilbert space.Multilayer perceptrons with 1 hidden layer generalize to functional input by:

φpw : x ∈ X →

p∑k=1

w(2)k G

(w(0)

k + 〈x,w(1)k 〉X






Practical implementation: discrete sampling case

Suppose now that (xi)i are known on a discrete sampling gridt1, t2, . . . , td .

Then, 〈w(1)k , xi〉X can be approximated by

〈w(1)k , xi〉X '

1d

d∑l=1

xi(tk )w(1)k (tl).

w(1)k should be searched in a class of functions F such that, for any sets of

real numbers (cl)l=1,...,d , it exists a function w ∈ F for which

∀ l = 1, . . . , d, w(tl) = cl .

Splines have such a property (see Presentation 4).



Suppose now that (xi)i are known on a discrete sampling gridt1, t2, . . . , td .Then, 〈w(1)

k , xi〉X can be approximated by

〈w(1)k , xi〉X '

1d

d∑l=1

xi(tk )w(1)k (tl).



∀ l = 1, . . . , d, w(tl) = cl .






〈w(1)k , xi〉X '

1d

d∑l=1

xi(tk )w(1)k (tl).



∀ l = 1, . . . , d, w(tl) = cl .






〈w(1)k , xi〉X '

1d

d∑l=1

xi(tk )w(1)k (tl).



∀ l = 1, . . . , d, w(tl) = cl .



Practical implementation: projection approach

Suppose now that X admits a family of functions (ψqk )q∈N∗, 1≤k≤q such that:

for all q ∈ N∗, (ψqk )k is an orthonormal system,

if Pq denotes the projection in X on Span{ψ

q1, . . . , ψ

qq

}then, the

pointwise consistent property is satisfied:

∀ x ∈ X, ∀ ε > 0,∃Q ∈ N∗ : ∀ q ≥ Q ,∥∥∥Pq(x) − x

∥∥∥∞≤ ε.

Hence, 〈w(1)k , xi〉X can be approximated by

〈w(1)k , xi〉X ' 〈w

(1)k ,Pq(xi)〉X = 〈Pq(w(1)

k ), xi〉X = 〈Pq(w(1)k ),Pq(xi)〉X.






q1, . . . , ψ

qq

}then, the


∀ x ∈ X, ∀ ε > 0,∃Q ∈ N∗ : ∀ q ≥ Q ,∥∥∥Pq(x) − x

∥∥∥∞≤ ε.


〈w(1)k , xi〉X ' 〈w

(1)k ,Pq(xi)〉X = 〈Pq(w(1)

k ), xi〉X = 〈Pq(w(1)k ),Pq(xi)〉X.






q1, . . . , ψ

qq

}then, the


∀ x ∈ X, ∀ ε > 0,∃Q ∈ N∗ : ∀ q ≥ Q ,∥∥∥Pq(x) − x

∥∥∥∞≤ ε.


〈w(1)k , xi〉X ' 〈w

(1)k ,Pq(xi)〉X = 〈Pq(w(1)

k ), xi〉X = 〈Pq(w(1)k ),Pq(xi)〉X.


Universal approximation for projection basedapproach

Theorem[Rossi and Conan-Guez, 2005, Rossi and Conan-Guez, 2006]If G is continuous and non polynomial then the setx ∈ X →

p∑k=1

w(2)k G

w(0)k +

q∑l=1

βkl(Pq(x))l

is dense in the set of all continuous functions defined on X for the uniformnorm on any compact subset of X.

Remark 1: A similar result exists for the pointwise consistent approach.Remark 2: A convergence results for the optimal weights of thefunctional MLP also exists but requires many technical assumptions sothat we do not detail it here. See [Rossi and Conan-Guez, 2005] for moredetails about it.




p∑k=1

w(2)k G

w(0)k +

q∑l=1

βkl(Pq(x))l


Remark 1: A similar result exists for the pointwise consistent approach.

Remark 2: A convergence results for the optimal weights of thefunctional MLP also exists but requires many technical assumptions sothat we do not detail it here. See [Rossi and Conan-Guez, 2005] for moredetails about it.




p∑k=1

w(2)k G

w(0)k +

q∑l=1

βkl(Pq(x))l


Remark 1: A similar result exists for the pointwise consistent approach.Remark 2: A convergence results for the optimal weights of thefunctional MLP also exists but requires many technical assumptions sothat we do not detail it here. See [Rossi and Conan-Guez, 2005] for moredetails about it.


Functional Inverse Regression

Suppose that we are given a random pair (X ,Y) taking its values in R × Xfor which a are given n i.i.d. realizations (x1, y1), . . . , (xn, yn)

Model: Moreover, we suppose that (X ,Y) satisfies the following model:

Y = Ψ(〈X , a1〉X, . . . , 〈X , aq〉X, ε)

where ε is a centered real random variable independant of X , Ψ is anunknown function that has to be estimated and {a1, . . . , aq} are unknownelements of X that are linearly independents and that have to beestimated.We call the space Span

{a1, . . . , aq

}, the Effective Dimension Reduction

subspace of X, denoted by EDR.



Suppose that we are given a random pair (X ,Y) taking its values in R × Xfor which a are given n i.i.d. realizations (x1, y1), . . . , (xn, yn)Model: Moreover, we suppose that (X ,Y) satisfies the following model:

Y = Ψ(〈X , a1〉X, . . . , 〈X , aq〉X, ε)

where ε is a centered real random variable independant of X , Ψ is anunknown function that has to be estimated and {a1, . . . , aq} are unknownelements of X that are linearly independents and that have to beestimated.

We call the space Span{a1, . . . , aq





Suppose that we are given a random pair (X ,Y) taking its values in R × Xfor which a are given n i.i.d. realizations (x1, y1), . . . , (xn, yn)Model: Moreover, we suppose that (X ,Y) satisfies the following model:

Y = Ψ(〈X , a1〉X, . . . , 〈X , aq〉X, ε)

where ε is a centered real random variable independant of X , Ψ is anunknown function that has to be estimated and {a1, . . . , aq} are unknownelements of X that are linearly independents and that have to beestimated.We call the space Span

{a1, . . . , aq




Fundamental property of EDR space

Denote A = (〈X , a1〉X, . . . , 〈X , aq〉X)T .

Li’s condition [Li, 1991]If

(A-Li) ∀ x ∈ X, ∃v ∈ Rq : E (〈u,X〉X | A) = vT A ,

then E (X | Y) ∈ ΓX (EDR).

⇒ EDR is estimated through the estimation of a1, . . . , aq, eigenvectorsof Γ−1

X ΓE(X |Y).But, as for the functional linear model, Γ−1

X has to be estimated by apenalized or a regularized approach.








X ΓE(X |Y).

But, as for the functional linear model, Γ−1X has to be estimated by a

penalized or a regularized approach.








X ΓE(X |Y).But, as for the functional linear model, Γ−1

X has to be estimated by apenalized or a regularized approach.


PCA approach[Ferré and Yao, 2003, Ferré and Yao, 2005]

Note

((λni , v

ni ))i≥1 the eigenvalue decomposition of Γn

X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn

i )i areorthonormal)

kn an integer such that: kn ≤ n and limn→+∞ kn = +∞

Pkn the projector Pkn (u) =∑kn

i=1〈vni , .〉Xvn

i

Γn,knX = Pkn ◦ Γn,kn

X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

i

if (Is)s=1,...,S is a partition of the subspace of R where Y takes itsvalues, then we estimate

P (Y ∈ Is) by pns = 1

n

∑ni=1 I{yi∈Is },

E (X | Y ∈ Is) by µns = 1

n

∑ni=1 xiI{yi∈Is },

ΓE(X |Y) by ΓnE(X |Y)

=∑S

s=1 pnsµ

ns ⊗ µ

ns .

((αknk ), an

k )k=1,...,q the eigensystem of (Γn,knX )+ΓE(X |Y).



Note

((λni , v






i=1〈vni , .〉Xvn

i


X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

i



n



n



=∑S

s=1 pnsµ

ns ⊗ µ

ns .

((αknk ), an




Note

((λni , v






i=1〈vni , .〉Xvn

i


X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

i



n



n



=∑S

s=1 pnsµ

ns ⊗ µ

ns .

((αknk ), an




Note

((λni , v






i=1〈vni , .〉Xvn

i


X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

i



n



n



=∑S

s=1 pnsµ

ns ⊗ µ

ns .

((αknk ), an




Note

((λni , v






i=1〈vni , .〉Xvn

i


X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

i



n



n



=∑S

s=1 pnsµ

ns ⊗ µ

ns .

((αknk ), an




Note

((λni , v






i=1〈vni , .〉Xvn

i


X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

i



n



n



=∑S

s=1 pnsµ

ns ⊗ µ

ns .

((αknk ), an




Note

((λni , v






i=1〈vni , .〉Xvn

i


X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

i



n



n



=∑S

s=1 pnsµ

ns ⊗ µ

ns .

((αknk ), an




Note

((λni , v






i=1〈vni , .〉Xvn

i


X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

i



n



n



=∑S

s=1 pnsµ

ns ⊗ µ

ns .

((αknk ), an



Assumptions for convergence of the estimated EDRspace

(A1) X has 4 th moment;

(A2) the (λk )k≥1 are all distinct and positive;

(A3) if we note b1 = 2√

2λ1−λ2

and bj = 2√

2min(λj−1−λj ,λj−λj+1)

for j ≥ 2,

limn→+∞1√nλkn

∑knj=1 bj = 0 and limn→+∞

1√nλ2

kn

= 0;

(A4) the eigenvalues of Γ−1X ΓE(X |Y) are all distinct.


Convergence of the estimated EDR space

Theorem [Ferré and Yao, 2003, Ferré and Yao, 2005]Under assumption (A-Li) and assumptions (A1)-(A4),∥∥∥an

k − ak∥∥∥X

n→+∞, P−−−−−−−−→ 0.

Remark: A similar result for a smoothing approach is described in[Ferré and Villa, 2005, Ferré and Villa, 2006].


Convergence of the estimated EDR space

Theorem [Ferré and Yao, 2003, Ferré and Yao, 2005]Under assumption (A-Li) and assumptions (A1)-(A4),∥∥∥an

k − ak∥∥∥X

n→+∞, P−−−−−−−−→ 0.

Remark: A similar result for a smoothing approach is described in[Ferré and Villa, 2005, Ferré and Villa, 2006].


FIR multilayer perceptrons

The idea of this model, developed in [Ferré and Villa, 2006], is to estimatethe function Ψ by a multilayer perceptron.

〈X , a1〉X

〈X , a2〉X

〈X , aq〉X w(1)

w(2)∑

+

Bias

∑+

Bias: w(0)

Y


FIR multilayer perceptrons

The idea of this model, developed in [Ferré and Villa, 2006], is to estimatethe function Ψ by a multilayer perceptron.

〈X , a1〉X

〈X , a2〉X

〈X , aq〉X w(1)

w(2)∑

+

Bias

∑+

Bias: w(0)

Y


Notations

Outputs : ∀ u ∈ Rq,

φw(u) =

p∑k=1

w(2)k G

((w(1)

k )T u + w(0)k

);

For a given error function L (e.g., mean square error),ζ((u, y),w) = L(φw(u), y);

Variables: Z = ((〈X , aj〉X)j=1,...,q,Y), zni = ((〈xi , aj〉X)j=1,...,q, yi): Z

and (zni )i take their values in a open subspace, O, of Rq+1;

The weights are chosen in a compact subset,W, of (R × Rq × R)p

theoretical: w∗ = arg minw∈W E (ζ(Z ,w)): W∗ is the set of all w∗;empirical: w∗n = arg minw∈W

∑ni=1 ζ(zn

i ,w).


Notations


φw(u) =

p∑k=1

w(2)k G

((w(1)

k )T u + w(0)k

);






∑ni=1 ζ(zn

i ,w).


Notations


φw(u) =

p∑k=1

w(2)k G

((w(1)

k )T u + w(0)k

);






∑ni=1 ζ(zn

i ,w).


Notations


φw(u) =

p∑k=1

w(2)k G

((w(1)

k )T u + w(0)k

);






∑ni=1 ζ(zn

i ,w).


Notations


φw(u) =

p∑k=1

w(2)k G

((w(1)

k )T u + w(0)k

);





theoretical: w∗ = arg minw∈W E (ζ(Z ,w)): W∗ is the set of all w∗;

empirical: w∗n = arg minw∈W∑n

i=1 ζ(zni ,w).


Notations


φw(u) =

p∑k=1

w(2)k G

((w(1)

k )T u + w(0)k

);






∑ni=1 ζ(zn

i ,w).


Assumptions for convergence of the optimalempirical weights

(A1) ∀ z ∈ O, ζ(z, .) is continuous;

(A2) ∀w ∈ W, ζ(.,w) is measurable;

(A3) it exists a measurable function ζ̃ from O in R such that, ∀ z ∈ O,∀w ∈ W,

∣∣∣ζ(z,w)∣∣∣ < ζ̃(z) and E

(ζ̃(Z)

)< +∞;

(A4) ∀w ∈ W, ∃C(w) > 0 such that for all (x, y) and (x′, y′) in O,∣∣∣ζ((x, y),w) − ζ((x′, y),w)∣∣∣ ≤ C(w) ‖x − x′‖Rq






∣∣∣ζ(z,w)∣∣∣ < ζ̃(z) and E

(ζ̃(Z)

)< +∞;







∣∣∣ζ(z,w)∣∣∣ < ζ̃(z) and E

(ζ̃(Z)

)< +∞;







∣∣∣ζ(z,w)∣∣∣ < ζ̃(z) and E

(ζ̃(Z)

)< +∞;



Convergence of the optimal empirical weights

Theorem [Ferré and Villa, 2006]Under assumption (A-Li), assumptions ensuring the convergence of theestimated EDR space and assumptions (A1)-(A4),

d(w∗n,W∗)

n→+∞, P−−−−−−−−→ 0,

where d is defined by: d(w,W) = infw̃∈W∥∥∥w − w̃

∥∥∥R(q+2)p .


Application: Tecator Data Set

Aim : Predicting the fat content of peaces of meat from their infraredspectra.

Comparison of:

PCA and then MLP [Thodberg, 1996];

Functional MLP (with a discrete sampling based approach)[Rossi and Conan-Guez, 2005];

smoothing FIR and then MLP [Ferré and Villa, 2006];

projection based FIR and then MLP[Ferré and Yao, 2003, Ferré and Villa, 2006];

smoothing FIR and then linear model.

Methodology: Repetition of 50 experiments with random training/testsets.



Aim : Predicting the fat content of peaces of meat from their infraredspectra.Comparison of:









Aim : Predicting the fat content of peaces of meat from their infraredspectra.Comparison of:








Results of the experiments

1

2

3

4

5

PCA-MLPNN f

SIRr-MLPSIRp-MLP

SIRr-ML

Remark: A similar experiment has been made on the phoneme dataset tocompare “FIR-MLP” to the functional nonparametric kernel.


Results of the experiments

1

2

3

4

5

PCA-MLPNN f

SIRr-MLPSIRp-MLP

SIRr-ML

Remark: A similar experiment has been made on the phoneme dataset tocompare “FIR-MLP” to the functional nonparametric kernel.


Table of contents


2 Neural networks

3 References


References

Further details for the references are given in the joint document.

Ferraty, F. and Vieu, P. (2000).Dimension fractale et estimation de la régression dans des espacesvectoriels semi-normés.Comptes Rendus Mathématique. Académie des Sciences. Paris,330:139–142.

Ferraty, F. and Vieu, P. (2002).The functional nonparametric model and application to spectrometricdata.Computational Statistics, 17:515–561.

Ferraty, F. and Vieu, P. (2003).Curves discrimination: a non parametric approach.Computational and Statistical Data Analysis, 44:161–173.

Ferraty, F. and Vieu, P. (2004).Nonparametric models for functional data, with application inregression, time series prediction and curves discrimination.Journal of Nonparametric Statistics, 16:111–125.

Ferraty, F. and Vieu, P. (2008).Erratum of: ’non-parametric models for functional data, withapplication in regression, time-series prediction and curvediscrimination’.Journal of Nonparametric Statistics, 20(2):187–189.

Ferré, L. and Villa, N. (2005).Discrimination de courbes par régression inverse fonctionnelle.Revue de Statistique Appliquée, LIII(1):39–57.

Ferré, L. and Villa, N. (2006).Multi-layer perceptron with functional inputs: an inverse regressionapproach.Scandinavian Journal of Statistics, 33(4):807–823.

Ferré, L. and Yao, A. (2003).Functional sliced inverse regression analysis.Statistics, 37(6):475–488.

Ferré, L. and Yao, A. (2005).Smoothed functional inverse regression.Statistica Sinica, 15(3):665–683.

Hornik, K. (1991).Approximation capabilities of multilayer feedfoward networks.Neural Networks, 4(2):251–257.

Hornik, K. (1993).Some new results on neural network approximation.Neural Networks, 6(8):1069–1072.

Li, K. (1991).Sliced inverse regression for dimension reduction.Journal of the American Statistical Association, 86:316–342.

Nadaraya, E. (1964).On estimating regression.Theory of Probability and its Applications, 10:186–196.

Rossi, F. and Conan-Guez, B. (2005).Functional multi-layer perceptron: a nonlinear tool for functional dataanlysis.Neural Networks, 18(1):45–60.

Rossi, F. and Conan-Guez, B. (2006).Theoretical properties of projection based multilayer perceptrons withfunctional inputs.Neural Processing Letters, 23(1):55–70.

Stinchcombe, M. (1999).Neural network approximation of continuous functionals andcontinuous functions on compactifications.Neural Network, 12(3):467–477.

Thodberg, H. (1996).A review of bayesian neural network with an application to nearinfrared spectroscopy.IEEE Transaction on Neural Networks, 7(1):56–72.

Watson, G. (1964).Smooth regression analysis.Sankhya Series, A(26):359–372.

White, A. (1990).Connectionist nonparametric regression: multilayer feedforwardnetworks can learn arbitraty mappings.Neural Networks, 3:535–549.

White, H. (1989).Learning in artificial neural network: a statistical perspective.Neural Computation, 1:425–464.