An introduction to functional data analysismaths.cnam.fr/IMG/pdf/cnam_df_preda_cle8fb16e.pdf · 2018-04-26 · An introduction to functional data analysis Cristian Preda April, 2018

An introduction to functional data analysisCristian Preda

April, 2018

When the observations are curves :

0 2 4 6 8 10

−1.

00.

01.

0

time t

X(t

)

Some words about the index t :

• continuous set of values (interval),

• time in general, but also wavelength, fraction of a cycle, etc,

• undimensional in general but also multidimensional (images)

When the observations are curves or . . . images :

A 3-D representation:

0 1 2 3 4 5 60.0

0.2

0.4

0.6

0.8

1.0

01

23

4

t

s

X(t

,s)

1

0.0

0.2

0.4

0.6

0.8

1.0

0 1 2 3 4 5 6

0

1

2

3

4

Image 2D

A first sample of functional data

Let have a look on the tecator dataset. A detailed description of this data is available at: http://lib.stat.cmu.edu/datasets/tecator.

The data we consider consist of 215 spectra (curves). Each spectra is associated to an observation of a pieceof meat and represents the absorbance of 100 wevelengths (near infrared : 850-1050 nm). For each spectrumone has also the fat content of the piece of meat.

The question is : can the spectrum predict the fat content ?

Data is presented in the file http://math.univ-lille1.fr/~preda/FDA/tecator.txt. There are 215 rows and100+1 columns (1-100 columns for the spectra and the column 101 for the fat content). The separator fieldbetween columns is the space " " character. There are neither header of the columns and row names.d0 = read.table("~/LO/DOC/tecator.txt", sep=" ", header=FALSE)dim(d0)

[1] 215 101names(d0)

[1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10"[11] "V11" "V12" "V13" "V14" "V15" "V16" "V17" "V18" "V19" "V20"[21] "V21" "V22" "V23" "V24" "V25" "V26" "V27" "V28" "V29" "V30"[31] "V31" "V32" "V33" "V34" "V35" "V36" "V37" "V38" "V39" "V40"[41] "V41" "V42" "V43" "V44" "V45" "V46" "V47" "V48" "V49" "V50"[51] "V51" "V52" "V53" "V54" "V55" "V56" "V57" "V58" "V59" "V60"[61] "V61" "V62" "V63" "V64" "V65" "V66" "V67" "V68" "V69" "V70"[71] "V71" "V72" "V73" "V74" "V75" "V76" "V77" "V78" "V79" "V80"[81] "V81" "V82" "V83" "V84" "V85" "V86" "V87" "V88" "V89" "V90"[91] "V91" "V92" "V93" "V94" "V95" "V96" "V97" "V98" "V99" "V100"

[101] "V101"

2

http://lib.stat.cmu.edu/datasets/tecator

http://lib.stat.cmu.edu/datasets/tecator

http://math.univ-lille1.fr/~preda/FDA/tecator.txt

Ploting the data.

Let first create the index t_index (time), here the range of wavelenghts between 850 et 1050. Let ranamealso the last column by Y (the response, the fat content).t_index = seq(850, 1050, by =(1050-850)/99)names(d0)[101] = "Y"# names(d0)

Let plot the first spectrum :plot(t_index, d0[1,1:100 ], xlab = "t = wavelenghts", ylab = "Absorbance : X(t)", main = "Specrum no. 1")

850 900 950 1000 1050

2.6

2.8

3.0

3.2

3.4

Specrum no. 1

t = wavelenghts

Abs

orba

nce

: X(t

)

and just by doing lines (linear interpolation) :par(mfrow = c(1,2))plot(t_index, d0[1, 1:100], xlab = "t = wavelenghts", ylab = "Absorbance : X(t)")lines(t_index, d0[1,1:100], lwd=1)plot(t_index, d0[1, 1:100], xlab = "t = wavelenghts", ylab = "Absorbance : X(t)", type="l")title("Spectrum 1 : linear interpolation", outer=TRUE, line =-1)

3

850 900 950 1000

2.6

2.8

3.0

3.2

3.4

t = wavelenghts

Abs

orba

nce

: X(t

)

850 900 950 10002.

62.

83.

03.

23.

4t = wavelenghts

Abs

orba

nce

: X(t

)

Spectrum 1 : linear interpolation

The set of all the 215 spectra is represented in the figure below :matplot(t_index, t(d0[, 1:100]),type="l", lty=1, col = "black", ylab = "Absorbance : X(t)", xlab = "wavelengths", main = "All spectra")

850 900 950 1000 1050

2.0

3.0

4.0

5.0

All spectra

wavelengths

Abs

orba

nce

: X(t

)

4

Let have also a look on the distribution of the response variable Y (the fat content) :par(mfrow = c(1, 2))hist(d0$Y, xlab = "Fat content (%)", ylab = "Frequency", col = "blue", main = "Fat content histogram")boxplot(d0$Y, main ="Fat content boxplot")

Fat content histogram

Fat content (%)

Fre

quen

cy

0 10 20 30 40 50

010

2030

4050

60

010

2030

4050

Fat content boxplot

A second sample of functional data : kneading data

For a given flour, the resistance of the dough is recorded during the kneading process over a time period of 8minutes, from 0 to 480 seconds. Measurements are recorded every 2 seconds.

Data is available at : http://math.univ-lille1.fr/~preda/FDA/flours.txt.

Data is presented as columns, as indicated by the header of the file. The first column represents the timepoints of observations. The next columns represents the measurements of the resistance of 90 flours (eachone is refered by a header Fi, i = 1, . . . , 90) at each time indicated in the first column. Thus, one has 90curves. Moreover, information on the flours is available : one knows that the first 50 flours (F_1 to F_50)have been produced cookies of good quality (Y=1), whereas the next 40 flours (F_51 to F_90) producedcookies of bad quality.

The main question is : what is predictive power of the resistance curves on the cookie’s quality?d1 = read.table("~/LO/DOC/flours.txt", sep=";", header=TRUE)dim(d1)

[1] 241 91names(d1)

[1] "time" "F_1" "F_2" "F_3" "F_4" "F_5" "F_6" "F_7" "F_8" "F_9"[11] "F_10" "F_11" "F_12" "F_13" "F_14" "F_15" "F_16" "F_17" "F_18" "F_19"

5

http://math.univ-lille1.fr/~preda/FDA/flours.txt

[21] "F_20" "F_21" "F_22" "F_23" "F_24" "F_25" "F_26" "F_27" "F_28" "F_29"[31] "F_30" "F_31" "F_32" "F_33" "F_34" "F_35" "F_36" "F_37" "F_38" "F_39"[41] "F_40" "F_41" "F_42" "F_43" "F_44" "F_45" "F_46" "F_47" "F_48" "F_49"[51] "F_50" "F_51" "F_52" "F_53" "F_54" "F_55" "F_56" "F_57" "F_58" "F_59"[61] "F_60" "F_61" "F_62" "F_63" "F_64" "F_65" "F_66" "F_67" "F_68" "F_69"[71] "F_70" "F_71" "F_72" "F_73" "F_74" "F_75" "F_76" "F_77" "F_78" "F_79"[81] "F_80" "F_81" "F_82" "F_83" "F_84" "F_85" "F_86" "F_87" "F_88" "F_89"[91] "F_90"

Let take a small picture of the data (first 10 time points and the 14 flours) :print(d1[1:10, 1:15])

time F_1 F_2 F_3 F_4 F_5 F_6 F_7 F_8 F_9 F_10 F_11 F_12 F_13 F_141 0 224 226 156 137 136 224 197 112 97 142 138 136 164 1262 2 286 276 202 179 189 227 245 145 139 162 175 171 120 1663 4 348 338 248 221 242 364 293 156 181 180 216 152 164 1344 6 360 394 294 263 295 434 226 211 223 176 182 170 160 2465 8 382 324 220 241 290 496 385 164 229 228 276 180 182 1946 10 312 282 231 210 266 351 238 163 232 179 197 176 156 2117 12 313 438 233 222 267 362 295 210 218 227 211 184 165 2108 14 314 304 235 234 268 373 264 244 296 203 225 208 174 2139 16 315 315 237 246 269 384 277 190 234 215 239 200 183 22010 18 316 326 239 258 270 395 290 199 242 203 253 208 192 227

Plot the Flour no 1 (F_1) :par(mfrow = c(1,2))plot(d1$time, d1$F_1, xlab = "time", ylab = "X(t) : Resistance", main = "Flour no. 1")plot(d1$time, d1$F_1, xlab = "time", type = "l", ylab = "X(t) : Resistance",

main = "Flour no. 1 as a function")

0 100 300

300

400

500

600

700

Flour no. 1

time

X(t

) : R

esis

tanc

e

0 100 300

300

400

500

600

700

Flour no. 1 as a function

time

X(t

) : R

esis

tanc

e

6

A graphical view of all the data :

Plot the Flour no 1 (F_1) :matplot(d1$time, d1[,-c(1)], xlab = "time", ylab = "X(t) : Resistance", type = "l",

col="black", lty=1, main = "All flours")

0 100 200 300 400

100

300

500

700

All flours

time

X(t

) : R

esis

tanc

e

Let also construct the response Q, the cookies quality : 1 = “good”, 0 = “bad”.Q = rep(c(1,0), c(50, 40))print(Q)

[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1[36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0[71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0matplot(d1$time, d1[,-c(1)], xlab = "time", ylab = "X(t) : Resistance", type = "l",

col = Q+1, lty=1, main = "All flours and their quality") # 1 = black, 2 = redlegend("topleft", c("good", "bad"), lty=c(1,1), col =c(1,2), title ="Quality")

7

0 100 200 300 400

100

300

500

700

All flours and their quality

time

X(t

) : R

esis

tanc

e

Quality

goodbad

Remark:

A first difference between the two datasets is that in the last one, data seems to be subject to noise. It isdifficult to accept that the resistance of the dough could have such a local variation. The observed data iscontaminated by a noise ε to be estimated :

Xobserved(t) = X(t) + ε(t).

The functional nature of curves

A curve {X(t), t ∈ [0, T ]} is then represented by a set of observations at the time points:

0 ≤ t1 < t2 < . . . < ti < . . . tp ≤ T,

time t1 t2 . . . ti . . . tpX X(t1) X(t2) . . . X(ti) . . . X(tp)

Some points have must be addressed at this stage:

- Missing data as a second nature

In the examples we provided data is observed in some time points defined by the design : {(ti, X(ti), i ≥ 1}.Between two successive time points data is not observed (because of the technology capability, storage andother limitations). However, between ti and ti+1 the true curve exists but is not observed. Thus, functionaldata is a good example of data with missing observations!

8

- Multicolinearity

Obviously, two successive observtations of X, X(ti) and X(ti+1) are strongly correlated, at least data is nota white noise! Clearly, that yealds to inconsitency in the least square estimation of a linear model using theraw data :

Y = β0 + β1X(t1) + β2X(t2) + . . .+ βpX(tp) + ε.

- The course of dimensionality : “n < p”

In many applications, the number of time points is larger than the number of curves. This represents again asource of inconsistency for linear model estimation!

Examples : OLS, RIDGE and PLS on the raw data

Let try to fit a linear model on the tecator data set in order to predict the fat content from the spectra.

Spectral data.

OLSm0 = lm(Y~., data = d0)summary(m0)

plot(t_index, coef(m0)[-1], xlab = "wavelengths", ylab = "coefficients", main = "Coefficients of the linear model", type = "l")

points(t_index, coef(m0)[-1], type = "p", pch =16, col = ifelse(summary(m0)$coefficients[-1,4]<=0.05, "red", "black"))

abline(0,0, lwd =2, col="blue")

legend("topright", c("p<=0.05", "p>0.05"), col = c("red", "black"), pch =16, title = "Significance")

9

850 900 950 1000 1050

−40

000

020

000

Coefficients of the linear model

wavelengths

coef

ficie

nts

Significance

p<=0.05p>0.05

Remark: Difficulty to interpret : large values of coefficients and irregular sparsity in the range of wevelengths.Even procedures such as AIC selection and Lasso have the same difficulty.

The Ridge model

Ridge regression could be a solution to have a “smooth” set of coefficients.

We will use the glmnet function which a ridge model as a particular case of elastic-net model (alpha = 0):

minβ{‖Y −Xβ‖2 + λ

[(1− α)‖β‖2

2 + α‖β‖1]}

where ‖β‖22 =

p∑1β2i et ‖β‖1 =

p∑1|βi|

Let choose the optimal smoothing parameter lambda by cross-validation:library("glmnet")X = as.matrix(d0[, 1:100])Y=as.vector(d0$Y)cv_ridge = cv.glmnet(X,Y, alpha=0, lambda=seq(0,0.05,0.0001), grouped = FALSE, nfolds=nrow(d0))plot(cv_ridge, main ="Choice of the smoothing parameter")

10

−9 −8 −7 −6 −5 −4 −3

910

1112

Choice of the smoothing parameter

log(Lambda)

Mea

n−S

quar

ed E

rror

100 100 100 100 100 100 100 100 100 100 100

lambda_optimal = cv_ridge$lambda.min

print(lambda_optimal)

[1] 7e-04

Let have a look on the ridge coefficients :m0_ridge = glmnet(X,Y,alpha=0, lambda=lambda_optimal)

plot(t_index, coef(m0_ridge)[-1], xlab = "wavelengths", ylab = "coefficients", main = "Coefficients of the Ridge linear model", type = "l")

points(t_index, coef(m0_ridge)[-1], type = "p", pch =16)

abline(0,0, col = "blue")

11

850 900 950 1000 1050

−40

−20

020

4060

Coefficients of the Ridge linear model

wavelengths

coef

ficie

nts

This is much better in terms of regularity!

PLS regression :

PCR and PLS could be also be a solution to multicolinearity. Let’s illustrate the PLS procedure.library("pls")m0pls = plsr(Y~., scale = FALSE, validation = "LOO", jackknife = TRUE, data = d0)

plot(m0pls, "validation", estimate = c("train", "CV"), legendpos="topright")

12

0 20 40 60 80 100

24

68

1012

Y

number of components

RM

SE

P

trainCV

nb_comp_pls_opt = which.min(RMSEP(m0pls)$val[1, , ])-1

print(paste("Optimal number of PLS components : ", nb_comp_pls_opt))

[1] "Optimal number of PLS components : 19"obsfit =predplot(m0pls, ncomp = nb_comp_pls_opt, which = "validation", asp=1, line=TRUE, main = "Predicted versus observed")points(obsfit, pch=16, col="red")

13

−20 0 20 40 60 80

010

2030

4050

Predicted versus observed

measured

pred

icte

d

The regression coefficients provided by PLS regression are shown in the figure below:coefplot(m0pls, ncom=nb_comp_pls_opt, se.whiskers =TRUE, lwd =2, cex.axis = 0.5, ylab = "PLS regression coefficiets")

0 20 40 60 80 100

−40

00−

2000

020

0040

00

Y

variable

PLS

reg

ress

ion

coef

ficie

ts

Good prediction but evident difficulty to interpret the coefficients (Ridge is better for interpretation).

14

Kneading data

Let have a look also on the kneading data in dataset d1. Recall that curves are stored by columns and thequality of cookies was stored in the vector Q

In this data we have n = 90 observations and p = 241 variables (time points). So n < p and, obvioisly, thelogistic model or the Fisher discriminant score, for example, can not be estimated.

In addition, data is contamined with some noise.

Recall that the Fisher discriminant score

s = β0 + β1X(t1) + . . .+ βpX(tp)

which maximize the between-within variance ratio is given by the coefficients of the multiple regression of Q(after recoding) on {X(t1), . . . X(tp)}n = ncol(d1)-1 # the first column in d1 is the time columnp0 =length(Q[Q==0])/np1= length(Q[Q==1])/nQrec = ifelse(Q==1,sqrt(p0/p1), -sqrt(p1/p0)) # Qrec has mean 0 and variance 1

dfisher = cbind(data.frame(t(d1[,-1])), Qrec)m1_fisher = lm(Qrec~., data = dfisher)

summary(coef(m1_fisher)[-1])

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's-0.14560 -0.03083 -0.00075 -0.00012 0.03141 0.17240 152plot(d1[, 1], coef(m1_fisher)[-1], xlab = "time", ylab = "coefficients", main = "Coefficients of the Fisher score", type = "l")

abline(0,0, lwd =2, col="blue")

0 100 200 300 400

−0.

15−

0.05

0.05

0.15

Coefficients of the Fisher score

time

coef

ficie

nts

15

From the set of 242 coeficients, 152 can not be estimated (NA’s).

The ridge and PLS solutions provide the following estimations.

RidgeX = as.matrix(dfisher[, 1:241])Y=Qreccv_ridge = cv.glmnet(X,Y, alpha=0, lambda=seq(0,10,0.1), grouped = FALSE, nfolds=nrow(dfisher))plot(cv_ridge, main ="Choice of the smoothing parameter")

−2 −1 0 1 2

0.2

0.4

0.6

0.8

1.0

Choice of the smoothing parameter

log(Lambda)

Mea

n−S

quar

ed E

rror

241 241 241 241 241 241 241 241 241 241 241

lambda_optimal = cv_ridge$lambda.min

print(lambda_optimal)

[1] 2.1

The ridge coefficientsm1_ridge = glmnet(X,Y,alpha=0, lambda=lambda_optimal)

plot(d1[,1], coef(m1_ridge)[-1], xlab = "time", ylab = "coefficients", main = "Ridge coefficients for the Fisher score", type = "l")

points(d1[,1], coef(m1_ridge)[-1], type = "p", pch =16)

abline(0,0, col = "blue")

16

0 100 200 300 400

−4e

−04

0e+

002e

−04

Ridge coefficients for the Fisher score

time

coef

ficie

nts

It looks like a curve wiht high instability!

PLS regressionm1pls = plsr(Qrec~., scale = FALSE, validation = "LOO", jackknife = TRUE, data = dfisher)

plot(m1pls, "validation", estimate = c("train", "CV"), legendpos="topright")

17

0 20 40 60 80

050

015

0025

00

Qrec

number of components

RM

SE

P

trainCV

nb_comp_pls_opt = which.min(RMSEP(m1pls)$val[1, , ])-1

print(paste("Optimal number of PLS components : ", nb_comp_pls_opt))

[1] "Optimal number of PLS components : 2"obsfit =predplot(m1pls, ncomp = nb_comp_pls_opt, which = "validation", asp=1, line=TRUE,

main = "Predicted versus observed")points(obsfit, pch=16, col="red")

18

−3 −2 −1 0 1 2 3

−1.

5−

0.5

0.5

1.0

1.5

Predicted versus observed

measured

pred

icte

d

#the ROC curve for the PLS score

library(ROCR)pr=prediction(obsfit[,"predicted"], Q)perf <- performance(pr,"tpr","fpr")plot(perf, main = "The ROC curve for the Fisher score")library(caTools)auc=colAUC(obsfit[,"predicted"], Q)legend("bottomright", paste("AUC=", auc))

19

The ROC curve for the Fisher score

False positive rate

True

pos

itive

rat

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

AUC= 0.9755

The PLS score is given by the regression coefficients:coefplot(m1pls, ncom=nb_comp_pls_opt, intercept=FALSE, se.whiskers =TRUE, lwd =2, cex.axis = 0.5, ylab = "PLS regression coefficiets",xlab = "the 241 time points")

0 50 100 150 200 250

−3e

−04

−2e

−04

−1e

−04

0e+

001e

−04

Qrec

the 241 time points

PLS

reg

ress

ion

coef

ficie

ts

The results is nice both in terms of interpretation and prediction. Local variation of the coefficients could bedue to the noise in the data.

20

Indeed, the noise was not considered in the statistical analysis.

Conclusion.

Despite the possibility to work with functional data as a classical matrix of data, both examples show thata preprocessing step of data is necessary in order to get the functional feature of the observed curves :interpolation or smoothing (denoising) are necessary.

Pre-processing functional data: interpolation and smoothing

This step is necessary in order to get a functional form to the data represented by a curve.

Let us consider that some curve is observed at the time points

0 ≤ t1 < t2 < . . . < ti < . . . tp ≤ T,

time t1 t2 . . . ti . . . tpX X(t1) X(t2) . . . X(ti) . . . X(tp)

Interpolation means that one looks for some function f defined on the same range as X such that

X(ti) = f(ti), ∀i = 1, . . . , p

0 2 4 6 8 10

−1.

0−

0.5

0.0

0.5

1.0

Interpolation

t

f(t)

f(t)X(t_i)

Interpolation is convinient for data observed without error.

Smoothing means, in geeneral, that the observed data is contamined with some noise, thus, the observedvalues are :

X(ti) = f(ti) + εi, ∀i = 1, . . . , p

where f is the unobserved true value of X and εi is some noise.

21

0 2 4 6 8 10

−0.

50.

00.

51.

0

Smoothing

t

f(t)

f(t)X(t_i)

Both procedures look for the estimation of f (and ε in the case of smoothing).

Projection onto a basis of functions.

There are several ways to perform interpolation and smoothing (parametric or not). We will show here themethodology by projection into a finite dimesional space spanned by some basis of functions.

The idea is to suppose that the function f belongs to a space generated by the set of K functions (called“basis” functions) {φi, i = 1, . . . ,K}, with K ≥ 1 the number of basis functions:

f(t) =K∑i=1

αiφi(t), ∀t ∈ [0, T ]

We will present here the cubic spline smoothing which has some teoretical optimal properties.

The cubic B-spline basis functions.

Given the set of time points (called knots) {t1 < . . . , < tp}, a function f is said to be a cubic spline functionif and only if:

• on each interval [ti, ti+1] f is a polynomial of degree 3.

• f has continous second derivatives, i.e. f ′′(t+i ) = f ′′(t−i+1)

If in addition, the second derivatives f ′′(t1) and f ′′(tp) are such that f ′′(t1) = f ′′(tp) = 0 then f is called anatural cubic spline function.

The importance of such these functions is stated by the following result (deBoor 2001):

22

Under the penalized squared error criterion

PENSE =p∑i=1

(X(ti)− f(ti))2 + λ

∫ tp

t1

(f ′′(t))2dt

the function f that minimizes it is a natural cubic spline* with knots {t1 < . . . , < tp}.

Remark:

• note the degree of the spline : 3 (cubic) !

• when λ = 0 then interpolation is performed.

-for smoothing, lambda is determined by cross-validation (see later)

The dimension of space of the cubic spline functions with knots {t1 < . . . , < tp} is p+ 2.

A basis {φi, i = 1, . . . ,K}, with K = p+ 2 for that space can be constructed in a recursive way. This basis isdue to deBoor 1976 and it is called B-splines. A nice reference to it can be found at :

ftp://ftp.cs.wisc.edu/Approx/bsplbasic.pdf

and

https://en.wikipedia.org/wiki/De_Boor%27s_algorithm

The R package “fda” provides the function create.bspline.basis which create this basis.

Let consider un example. Define the time points set :

0, 2, 4, 6, 8, 10, 12, 14

So, p=8, then the dimension of the cubic splines space is K=p+2 = 10. The 10 basis functions are plotedbelow.library(fda)t = seq(0,14, by=2)b = create.bspline.basis(c(0,14), breaks = t, norder = 4) # order = degree+1plot(b, main = "Basis b: the 10 B-spline basis functions (8 knots)")

0 2 4 6 8 10 12 14

0.0

0.2

0.4

0.6

0.8

1.0

Basis b: the 10 B−spline basis functions (8 knots)

23

ftp://ftp.cs.wisc.edu/Approx/bsplbasic.pdf

https://en.wikipedia.org/wiki/De_Boor%27s_algorithm

Let plot of φ1, φ2, φ3, φ4 :c =rep(0,b$nbasis)par(mfrow =c(2,2))for(i in 1:4)

{c[i]=1plot(fd(c,b), main = paste("phi_",i, sep=""))c[i]=0}

0 2 4 6 8 10 12 14

0.0

0.6

phi_1

time

valu

es

0 2 4 6 8 10 12 14

0.0

0.3

0.6

phi_2

timeva

lues

0 2 4 6 8 10 12 14

0.0

0.3

0.6

phi_3

time

valu

es

0 2 4 6 8 10 12 14

0.0

0.4

phi_4

time

valu

es

Once we have a basis {φi, i = 1, . . . ,K} and a set of coefficients {α1, α2, αK}, the function

f(t) =K∑i=1

αiφi(t)

is obtained using the function fd For example :alphas =c(1,-1, 2, 2, 0, -3, 0, -4, 0, 0)f = fd(alphas, b)plot(f, main = "Function expressed in the basis b with coefficients alpha's", xlab ="t", ylab = "f(t)")

24

0 2 4 6 8 10 12 14

−2

−1

01

2

Function expressed in the basis b with coefficients alpha's

t

f(t)

[1] "done"

Interpolate and smooth curves.

Let consider the two datasets (d0 and d1) of spectra, respectively of resistance flours.

Interpolate spectra.#build the basisb0 = create.bspline.basis(c(min(t_index), max(t_index)), breaks = t_index, norder = 4)#100 knots, so K = 102

d0_fd = Data2fd(argvals = t_index, y = t(d0[,-101]), b0) #Data2fd : compute the coefsplot(d0_fd, lty=1, col="black") # this is the functional data object

25

850 900 950 1000 1050

2.0

3.0

4.0

5.0

time

valu

e

[1] "done"str(coef(d0_fd)) # a matrix of alphas by column.

num [1:102, 1:215] 2.62 2.62 2.62 2.62 2.62 ...- attr(*, "dimnames")=List of 2..$ : chr [1:102] "bspl4.1" "bspl4.2" "bspl4.3" "bspl4.4" .....$ : NULL

alphas_curve_1 = coef(d0_fd)[,1]print(alphas_curve_1)

bspl4.1 bspl4.2 bspl4.3 bspl4.4 bspl4.5 bspl4.6 bspl4.72.617760 2.617881 2.618124 2.618581 2.619091 2.619774 2.620672bspl4.8 bspl4.9 bspl4.10 bspl4.11 bspl4.12 bspl4.13 bspl4.14

2.621797 2.623299 2.625046 2.627176 2.629568 2.632391 2.635566bspl4.15 bspl4.16 bspl4.17 bspl4.18 bspl4.19 bspl4.20 bspl4.212.639243 2.643443 2.648166 2.653392 2.659264 2.665771 2.672750bspl4.22 bspl4.23 bspl4.24 bspl4.25 bspl4.26 bspl4.27 bspl4.282.680087 2.687383 2.694363 2.700785 2.706878 2.712743 2.719008bspl4.29 bspl4.30 bspl4.31 bspl4.32 bspl4.33 bspl4.34 bspl4.352.726064 2.734416 2.743993 2.754573 2.765674 2.776812 2.787817bspl4.36 bspl4.37 bspl4.38 bspl4.39 bspl4.40 bspl4.41 bspl4.422.799318 2.811850 2.826782 2.843380 2.861056 2.878756 2.895339bspl4.43 bspl4.44 bspl4.45 bspl4.46 bspl4.47 bspl4.48 bspl4.492.909709 2.921265 2.930331 2.938309 2.947192 2.959184 2.977211bspl4.50 bspl4.51 bspl4.52 bspl4.53 bspl4.54 bspl4.55 bspl4.563.001830 3.033827 3.073220 3.118973 3.168668 3.218435 3.263853bspl4.57 bspl4.58 bspl4.59 bspl4.60 bspl4.61 bspl4.62 bspl4.633.301393 3.329856 3.350002 3.364076 3.374215 3.381805 3.387684bspl4.64 bspl4.65 bspl4.66 bspl4.67 bspl4.68 bspl4.69 bspl4.703.391918 3.394483 3.395231 3.393994 3.390755 3.385688 3.378952bspl4.71 bspl4.72 bspl4.73 bspl4.74 bspl4.75 bspl4.76 bspl4.773.370642 3.360940 3.349977 3.337892 3.324596 3.310303 3.294971

26

bspl4.78 bspl4.79 bspl4.80 bspl4.81 bspl4.82 bspl4.83 bspl4.843.279033 3.262358 3.245456 3.228338 3.210872 3.192974 3.174452bspl4.85 bspl4.86 bspl4.87 bspl4.88 bspl4.89 bspl4.90 bspl4.913.155197 3.134941 3.113541 3.091236 3.068475 3.045864 3.023830bspl4.92 bspl4.93 bspl4.94 bspl4.95 bspl4.96 bspl4.97 bspl4.983.002398 2.981398 2.960710 2.940083 2.919738 2.899646 2.879638bspl4.99 bspl4.100 bspl4.101 bspl4.1022.859643 2.839389 2.825929 2.819200

#let have a look on the the curve 1

par(mfrow=c(1,2))plot(t_index, d0[1,-101], xlab = "wavelengths", ylab = "Absorbance", main ="Spectrum no. 1")

plot(t_index, d0[1,-101], xlab = "wavelengths", ylab = "Absorbance", main ="Interpolation of spectrum no. 1")plot(d0_fd[1], col = "blue", add=TRUE, lwd=2)

850 900 950 1000

2.6

2.8

3.0

3.2

3.4

Spectrum no. 1

wavelengths

Abs

orba

nce

850 900 950 1000

2.6

2.8

3.0

3.2

3.4

Interpolation of spectrum no. 1

wavelengths

Abs

orba

nce

[1] "done"

Smoothing kneading datab1 = create.bspline.basis(c(min(d1$time), max(d1$time)),breaks = d1$time, norder = 4)#K = 241 basis functions

lambda=Infd1_pen = fdPar(b1, 2, lambda) ## penalty in PENSEd1_smooth = smooth.basis(argvals=d1$time, y=as.matrix(d1[, -1]), d1_pen)

Warning in smooth.basis1(argvals, y, fdParobj, wtvec = wtvec, fdnames =fdnames, : lambda reduced to 990397567349.977 to prevent overflow

27

plot(d1_smooth$fd, lty=1, col= "black", main = "Extreme smoothing : lambda =+Inf")

0 100 200 300 400

200

400

600

800

Extreme smoothing : lambda =+Inf

time

valu

e

[1] "done"lambda=10d1_pen = fdPar(b1, 2, lambda) ## penalty in PENSEd1_smooth = smooth.basis(argvals=d1$time, y=as.matrix(d1[, -1]), d1_pen)plot(d1_smooth$fd, lty=1, col= "black", main = "Smoothing : lambda =10")

28

0 100 200 300 400

100

300

500

700

Smoothing : lambda =10

time

valu

e

[1] "done"

Choice of lambda :

– a particular lambda for each curve,

– a common one value of lambda for all curves

Let consider the last option and consider an interval of possible values for lambda,let say [10, 1000]. For eachcurve we will have the cross-valdation value and we will choose the lambda who minimize the mean value ofthe cross-validation value.lambda=seq(10, 100, by=1)cv =c()for(lam in lambda){print(paste("lambda =", lam))d1_pen = fdPar(b1, 2, lam) ## penalty in PENSEd1_smooth = smooth.basis(argvals=d1$time, y=as.matrix(d1[, -1]), d1_pen)cv = c(cv, mean(d1_smooth$gcv))}

[1] "lambda = 10"[1] "lambda = 11"[1] "lambda = 12"[1] "lambda = 13"[1] "lambda = 14"[1] "lambda = 15"[1] "lambda = 16"[1] "lambda = 17"[1] "lambda = 18"

29

[1] "lambda = 19"[1] "lambda = 20"[1] "lambda = 21"[1] "lambda = 22"[1] "lambda = 23"[1] "lambda = 24"[1] "lambda = 25"[1] "lambda = 26"[1] "lambda = 27"[1] "lambda = 28"[1] "lambda = 29"[1] "lambda = 30"[1] "lambda = 31"[1] "lambda = 32"[1] "lambda = 33"[1] "lambda = 34"[1] "lambda = 35"[1] "lambda = 36"[1] "lambda = 37"[1] "lambda = 38"[1] "lambda = 39"[1] "lambda = 40"[1] "lambda = 41"[1] "lambda = 42"[1] "lambda = 43"[1] "lambda = 44"[1] "lambda = 45"[1] "lambda = 46"[1] "lambda = 47"[1] "lambda = 48"[1] "lambda = 49"[1] "lambda = 50"[1] "lambda = 51"[1] "lambda = 52"[1] "lambda = 53"[1] "lambda = 54"[1] "lambda = 55"[1] "lambda = 56"[1] "lambda = 57"[1] "lambda = 58"[1] "lambda = 59"[1] "lambda = 60"[1] "lambda = 61"[1] "lambda = 62"[1] "lambda = 63"[1] "lambda = 64"[1] "lambda = 65"[1] "lambda = 66"[1] "lambda = 67"[1] "lambda = 68"[1] "lambda = 69"[1] "lambda = 70"[1] "lambda = 71"[1] "lambda = 72"

30

[1] "lambda = 73"[1] "lambda = 74"[1] "lambda = 75"[1] "lambda = 76"[1] "lambda = 77"[1] "lambda = 78"[1] "lambda = 79"[1] "lambda = 80"[1] "lambda = 81"[1] "lambda = 82"[1] "lambda = 83"[1] "lambda = 84"[1] "lambda = 85"[1] "lambda = 86"[1] "lambda = 87"[1] "lambda = 88"[1] "lambda = 89"[1] "lambda = 90"[1] "lambda = 91"[1] "lambda = 92"[1] "lambda = 93"[1] "lambda = 94"[1] "lambda = 95"[1] "lambda = 96"[1] "lambda = 97"[1] "lambda = 98"[1] "lambda = 99"[1] "lambda = 100"plot(lambda, cv)

20 40 60 80 100

156

158

160

162

164

lambda

cv

lambda_optimal = lambda[which.min(cv)] # lambda_optimal= 42

31

# fit the data with the optimal value of lambda

d1_pen = fdPar(b1, 2, lambda_optimal) ## penalty in PENSEd1_smooth = smooth.basis(argvals=d1$time, y=as.matrix(d1[, -1]), d1_pen)plot(d1_smooth$fd, lty=1, col= "black", main = paste("Smoothing : lambda_optimal =", lambda_optimal))

0 100 200 300 400

100

300

500

700

Smoothing : lambda_optimal = 42

time

valu

e

[1] "done"

In this functional representation, each flour is represented by a set of 243 coefficients. Since we have a set of90 flours, these coeficient are stored (by column) in the matrix coef(d1_smooth$fd)print(dim(coef(d1_smooth$fd)))

[1] 243 90

Let now represent the kneadnig data before and after smoothing.matplot(d1$time, d1[,-c(1)], xlab = "time", ylab = "X(t) : Resistance", type = "l",

col = Q+1, lty=1, main = "Before smoothing") # 1 = black, 2 = redlegend("topleft", c("good", "bad"), lty=c(1,1), col =c(1,2), title ="Quality")

32

0 100 200 300 400

100

300

500

700

Before smoothing

time

X(t

) : R

esis

tanc

e

Quality

goodbad

plot(d1_smooth$fd, lty=1, col= Q+1, main = "After smoothing")

[1] "done"legend("topleft", c("good", "bad"), lty=c(1,1), col =c(1,2), title ="Quality")

0 100 200 300 400

100

300

500

700

After smoothing

time

valu

e

Quality

goodbad

33

The pre-processing process is now finished. We can use the functional representation to perform

• data visualisation and descriptive statistics (Functional Principal Components Analysis)

• Regression and prediction analysis

• Clustering

34

Documents

An introduction to functional data analysismaths.cnam.fr/IMG/pdf/cnam_df_preda_cle8fb16e.pdf · 2018-04-26 · An introduction to functional data analysis Cristian Preda April, 2018