13
Functional data clustering via piecewise constant nonparametric density estimation Marc Boulle ´ n Orange Labs, 2 Avenue Pierre Marzin, 22300 Lannion, France article info Article history: Received 2 August 2011 Received in revised form 20 January 2012 Accepted 21 May 2012 Available online 7 June 2012 Keywords: Functional data Distributional data Exploratory analysis Clustering Bayesianism Model selection Density estimation abstract In this paper, we present a novel way of analyzing and summarizing a collection of curves, based on piecewise constant density estimation. The curves are partitioned into clusters, and the dimensions of the curves points are discretized into intervals. The cross-product of these univariate partitions forms a data grid of cells, which represents a nonparametric estimator of the joint density of the curves and point dimensions. The best model is selected using a Bayesian model selection approach and retrieved using combinatorial optimization algorithms. The proposed method requires no parameter setting and makes no assumption regarding the curves; beyond functional data, it can be applied to distributional data. The practical interest of the approach for functional data and distributional data exploratory analysis is presented on two real world datasets. & 2012 Elsevier Ltd. All rights reserved. 1. Introduction Functional data analysis [13] relates to data samples where each observation is described by a function or curve, represented by a variable-length set of measure vectors (points). Functional data arise in many domains, such as measurements of the heights of children over a wide range of ages, daily records of precipita- tion at a weather station or hardware monitoring where each curve is a time series related to a physical quantity recorded at a specified sampling rate. Most statistical techniques designed for scalar data have their functional counterpart, including descrip- tive statistics, principal component analysis, supervised classifi- cation. In this paper, we focus on functional data exploratory analysis. One of the key problems with functional data is that of data representation, with a preprocessing task of representing the curves by a fixed set of parameters or proposing a similarity between curves. Fixed size instances n variables representation allows to exploit most standard statistical techniques, whereas similarity provides the basis for clustering methods such as K-means (exploited for example in [46]). This problem has been studied for functional data as well as for time series. A standard approach is to approximate a function using a linear combination of basis functions, such as Fourier series [7], discrete wavelet transform [8], low degree polynomial functions [9] or spline basis functions [1012]. In [13,14], a hidden Markov model (HMM) is exploited as a parametric model of sequential data, and provides a similarity matrix according to the log-likelihood between sequence models and sequences. This similarity matrix is then used to build clusters of sequences, where each cluster is itself represented by a HMM. In [15], the self-organizing map (SOM) clustering algorithm is applied to functional data equipped with a similarity matrix. In [16], both the problem of segmentation of the curves (e.g. piecewise constant or linear) and clustering (K-means or SOM) are treated simultaneously. These approaches require both fixing some function parameters, such as polynomial degrees, the number of basis functions to use, number of segments for the representation of curves and setting the number of clusters for the clustering algorithm. Nonparametric approaches have also been proposed, to better account for the potentially infinitely dimensional models behind functional data. In [17,18], the functional data Y ¼ f ðXÞþ E is summarized using nonparametric regression tech- niques, with a focus on the conditional mode, median and quantiles. Kernel techniques are employed that mainly locally weight the data using smoothing parameters. In [19,20], the problem of density estimation of a random function is consid- ered, by representing a function in the space of the eigenfunc- tions of principal component analysis. This kind of analysis Contents lists available at SciVerse ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition 0031-3203/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2012.05.016 n Tel.: þ33 2 96 05 16 80. E-mail address: [email protected] URL: http://perso.rd.francetelecom.fr/boulle/ Pattern Recognition 45 (2012) 4389–4401

Functional data clustering via piecewise constant nonparametric density estimation

Embed Size (px)

Citation preview

Page 1: Functional data clustering via piecewise constant nonparametric density estimation

Pattern Recognition 45 (2012) 4389–4401

Contents lists available at SciVerse ScienceDirect

Pattern Recognition

0031-32

http://d

n Tel.:

E-m

URL

journal homepage: www.elsevier.com/locate/pr

Functional data clustering via piecewise constant nonparametricdensity estimation

Marc Boulle n

Orange Labs, 2 Avenue Pierre Marzin, 22300 Lannion, France

a r t i c l e i n f o

Article history:

Received 2 August 2011

Received in revised form

20 January 2012

Accepted 21 May 2012Available online 7 June 2012

Keywords:

Functional data

Distributional data

Exploratory analysis

Clustering

Bayesianism

Model selection

Density estimation

03/$ - see front matter & 2012 Elsevier Ltd. A

x.doi.org/10.1016/j.patcog.2012.05.016

þ33 2 96 05 16 80.

ail address: [email protected]

: http://perso.rd.francetelecom.fr/boulle/

a b s t r a c t

In this paper, we present a novel way of analyzing and summarizing a collection of curves, based on

piecewise constant density estimation. The curves are partitioned into clusters, and the dimensions of

the curves points are discretized into intervals. The cross-product of these univariate partitions forms a

data grid of cells, which represents a nonparametric estimator of the joint density of the curves and

point dimensions. The best model is selected using a Bayesian model selection approach and retrieved

using combinatorial optimization algorithms. The proposed method requires no parameter setting and

makes no assumption regarding the curves; beyond functional data, it can be applied to distributional

data. The practical interest of the approach for functional data and distributional data exploratory

analysis is presented on two real world datasets.

& 2012 Elsevier Ltd. All rights reserved.

1. Introduction

Functional data analysis [1–3] relates to data samples whereeach observation is described by a function or curve, representedby a variable-length set of measure vectors (points). Functionaldata arise in many domains, such as measurements of the heightsof children over a wide range of ages, daily records of precipita-tion at a weather station or hardware monitoring where eachcurve is a time series related to a physical quantity recorded at aspecified sampling rate. Most statistical techniques designed forscalar data have their functional counterpart, including descrip-tive statistics, principal component analysis, supervised classifi-cation. In this paper, we focus on functional data exploratoryanalysis.

One of the key problems with functional data is that of datarepresentation, with a preprocessing task of representing thecurves by a fixed set of parameters or proposing a similaritybetween curves. Fixed size instances n variables representationallows to exploit most standard statistical techniques, whereassimilarity provides the basis for clustering methods such asK-means (exploited for example in [4–6]). This problem has beenstudied for functional data as well as for time series. A standard

ll rights reserved.

approach is to approximate a function using a linear combinationof basis functions, such as Fourier series [7], discrete wavelettransform [8], low degree polynomial functions [9] or spline basisfunctions [10–12]. In [13,14], a hidden Markov model (HMM) isexploited as a parametric model of sequential data, and providesa similarity matrix according to the log-likelihood betweensequence models and sequences. This similarity matrix is thenused to build clusters of sequences, where each cluster is itselfrepresented by a HMM. In [15], the self-organizing map (SOM)clustering algorithm is applied to functional data equipped with asimilarity matrix. In [16], both the problem of segmentation of thecurves (e.g. piecewise constant or linear) and clustering (K-meansor SOM) are treated simultaneously. These approaches require bothfixing some function parameters, such as polynomial degrees, thenumber of basis functions to use, number of segments for therepresentation of curves and setting the number of clusters for theclustering algorithm.

Nonparametric approaches have also been proposed, tobetter account for the potentially infinitely dimensionalmodels behind functional data. In [17,18], the functional dataY ¼ f ðXÞþE is summarized using nonparametric regression tech-niques, with a focus on the conditional mode, median andquantiles. Kernel techniques are employed that mainly locallyweight the data using smoothing parameters. In [19,20], theproblem of density estimation of a random function is consid-ered, by representing a function in the space of the eigenfunc-tions of principal component analysis. This kind of analysis

Page 2: Functional data clustering via piecewise constant nonparametric density estimation

M. Boulle / Pattern Recognition 45 (2012) 4389–44014390

reveals new patterns in functional data analysis, such as curvesrepresenting the mean or the mode in a curve dataset. Finally,clustering functional data is related to the problem of clusteringtime series data. In the survey [21], this problem is decomposedin three steps: choice of representation, with features extracteddirectly or indirectly from raw data or from models built fromthe raw data, choice of a similarity measure and choice of aclustering method.

In this paper, we propose a novel exploratory method forfunctional data, based on data grid models [22]. The collectionof curves is represented by a fixed size dataset where eachobservation corresponds to a point of a curve with one catego-rical variable that stores the curve identifier and a finite dimen-sional numerical vector for the point variables. The categoricalvariable is partitioned into groups of curves and each numericalvariable is discretized into intervals. The cross-product of theseunivariate partitions forms a multivariate partition, called datagrid. By counting the frequencies in the multivariate parts(called cells) of this data grid, we obtain a nonparametricestimator of the joint density of all the variables [22]. A modelselection technique based on a Bayesian approach with datadependent prior is applied to obtain an exact evaluation criter-ion for the posterior probability of joint density estimation datagrid models. The prior is data dependent in that it exploits thenumber m of points and aims at modeling the data sampledirectly with a discrete distribution, not the true continuous-valued probability distribution. The best model is retrieved usingcombinatorial optimization algorithms, with a super-linearalgorithmic complexity w.r.t. the number of points. In the caseof functional data, grouping the values of the ‘‘curve identifier’’variable can be interpreted as partitioning the curves into clusters,and discretizing each point variable provides an insightful sum-mary of the curves, with an estimation of the joint density of thedimensions of each curve.

Compared to existing approaches, the benefit of our methodis two-fold. It does not require any parameter, such as the choiceof a family of basis functions, kernel parameters or a number ofclusters, and it does not make any assumption regarding thecurves such as their simplicity [16], smoothness as in regular-ization [23,24] or capacity as in learning theory [25]. Comparedto time series clustering approaches, our method exploits asingle framework for the three-fold problem of choice of repre-sentation, similarity and clustering algorithm. It extends thefunctional data settings and can be applied to any distributionaldata, revealing new insights that have not previously beenconsidered.

The rest of the paper is organized as follows. In Section 2, wepresent the MODL1 approach for data grid models and apply it tojoint density estimation and clustering for functional data. Weillustrate the approach in Section 3 and present experimentalresults on two real world datasets in Section 4, which show whatkind of exploratory analysis can be performed. Finally, we give asummary in Section 5.

2. MODL approach for functional data clustering

In this section, we first summarize the principles of data gridmodels introduced in [22] in the data mining field for supervisedand unsupervised data preparation and show how these modelscan be applied to the problem of functional data clustering. Wethen adapt the approach to the case of functional data and finallydescribe the optimization algorithm.

1 MODL stands for Minimum Optimized Description Length.

2.1. Data grid models for data preparation in data mining

Data mining is ‘‘the non-trivial process of identifying valid,novel, potentially useful, and ultimately understandable patternsin data’’ [26]. Most data mining techniques work on flat tabulardata, with one instance per row and one variable, numerical orcategorical, per column. Supervised data mining aims at predict-ing the value of one target variable given the other explanatoryvariables: the task is classification in case of a categorical targetvariable and regression in case of a target numerical variable.Unsupervised learning aims at discovering clusters in the data,association rules between the variables or at modeling correla-tions or joint density.

Data grid models [27,22] have been introduced for the datapreparation phase of the data mining process [28], which is a keyphase, both time consuming and critical for the quality of theresults. They allow to automatically, rapidly and reliably evaluatethe class conditional probability of any subset of variables insupervised learning and the joint probability in unsupervisedlearning. Data grid models are based on a partitioning of eachvariable into intervals in the numerical case and into groups ofvalues in the categorical case. The cross-product of the univariatepartitions forms a multivariate partition of the representationspace into a set of cells. This multivariate partition, called datagrid, is a piecewise constant nonparametric estimator of theconditional or joint probability. The best data grid is searchedusing a Bayesian model selection approach and efficient combi-natorial algorithms.

2.2. Application to functional data: principle

Let C be a collection of n curves ci,1r irn. Each curveci ¼ ðpijÞ

mi

j ¼ 1 has mi observed values, the curve points. Each pointpij ¼ ðpij1, . . . ,pijdÞ is a vector of finite dimension d. In the rest of thepaper, without loss of generality and to keep the notation simple,we focus on the case where d¼2 and use X and Y for the two pointdimensions. We have ci ¼ ðxij,yijÞ

mi

j ¼ 1.Let us take an example, with two curves c1 and c2, drawn in

Fig. 1, sampled at equidistant values for xA ½0;1� from the functiony¼1 for c1 and from the function y¼ cosðpxÞ for c2. Our sampledataset consists of n¼2 curves with respectively m1 ¼ 4 andm2 ¼ 5 points

c1 : ð0;1Þ,ð13 ,1Þ,ð23,1Þ,ð1;1Þ.ffiffip ffiffip

c2 : ð0;1Þ,ð14 , 2

2 Þ,ð12 ,0Þ,ð34 ,� 2

2 Þ,ð1,�1Þ.

We propose to represent the collection of n curves as a uniquedataset with three variables, C to store the curve identifier, X andY for the point coordinates, and m¼

Pni ¼ 1 mi observations. This is

illustrated in Table 1.Instead of considering a dataset of curves, where each instance

has a variable-length description, we have a dataset of pointsrepresented in tabular format. We then can apply the data gridmodels in the unsupervised setting to estimate the joint densitybetween the three variable p(C, X, Y). The curve variable C isgrouped into clusters of curves, whereas each point dimension X

and Y is discretized into intervals. The cross-product of theseunivariate partitions forms a data grid of cells, which is piecewiseconstant per triplet of curve cluster, X interval and Y interval. AspðX,Y9CÞ ¼ pðC,X,YÞ=pðCÞ, this can also be interpreted as an esti-mator of the joint density between the point dimensions, which isconstant per cluster of curves. This means that similar curves withrespect to the joint density of their point dimensions will tend tobe grouped into the same clusters. We formalize this in nextsection and illustrate it in Section 3.

Page 3: Functional data clustering via piecewise constant nonparametric density estimation

-1

0

1

0 0.2 0.4 0.6 0.8 1

X

Y

C1

C2

Fig. 1. Two sample curves.

Table 1Two curves represented as a unique dataset with

three variables.

C X Y

c1 0 1

c1 1

3

1

c1 2

3

1

c1 1 1

c2 0 1

c2 1

4

ffiffiffi2p

2c2 1

2

0

c2 3

4 �

ffiffiffi2p

2c2 1 �1

M. Boulle / Pattern Recognition 45 (2012) 4389–4401 4391

2.3. Density estimation for functional data

We reformulate the data grid approach in the context offunctional data clustering. A data grid provides a summary ofa collection of curves with a piecewise constant joint densityestimation of the curves and points. The finest representation isthe data itself, with a data grid consisting of one cluster per curveand one interval per point value. The coarsest representation isobtained with a data grid consisting of one single cell containingall the points of all the curves, with the assumption that all thepoints are uniformly distributed in the XnY domain. The issue is tofind a trade-off between the informativeness of the joint densityestimation and its reliability, on the basis of the granularity of thedata grid.

We introduce in Definition 1 a family of functional dataclustering models, based on clusters of curves, intervals for eachpoint dimension, and a multinomial distribution of all the pointson the cells of the resulting data grid.

Definition 1. A functional data clustering model is defined by

a number of clusters of curves, � a number of intervals for each point dimension, � the repartition of the curves into the clusters of curves, � the distribution of the points of the functional dataset on the

cells of the data grid,

� for each cluster of curves, the distribution of the points that

belong to the cluster on the curves of the cluster.

Notation

C: collection of curves � P: point dataset containing all points of C in tabular format � C: curve variable � X, Y: variables for the point dimensions � n¼ 9C9: number of curves � m¼ 9P9: total number of points � kC: number of clusters of curves � kX ,kY : number of intervals for variables X and Y

k¼ kCkXkY : number of cells of the data grid � labelC(i): index of the cluster containing curve i

niC : number of curves in cluster iC � mi: number of points for curve i

miC : cumulated number of points for curves of cluster iC � mjX

: cumulated number of points for interval jX of X

mjY: cumulated number of points for interval jY of Y

miC jX jY: cumulated number of points for cell ðiC ,jX ,jY Þ of the

data grid

We assume that the numbers of curves n and points m areknown in advance and we aim at modeling the joint distributionof the m points on the curve and the point dimensions.

Note. We do not assume that the numbers of points mi percurve are all equal, neither that the points are ordered or at thesame locations, nor that there is a smooth function underlyingcurve data such as yij ¼ xijþEij with errors Eij.

The family of models introduced in Definition 1 is completelydefined by the parameters describing the partition of the curvesinto clusters

kC ,flabelCðiÞg1r irn,

by the numbers of intervals for the point dimensions

kX ,kY ,

by the parameters of the multinomial distribution of the points onthe k cells of the data grid

fmiC jX jYg1r iC rkC ,1r jX rkX ,1r jY rkY

,

and by the parameters of the multinomial distribution of thepoints belonging to each cluster of curves on the curves of thecluster

fmig1r irn:

The numbers of curves per cluster niC are derived from thepartition of the curves into clusters: they do not belong to themodel parameters. Similarly, the cumulated numbers of pointsper cluster of curves miC or per intervals miX and miY can bededuced by adding the frequencies of cells, according to

miC ¼XkX

jX ¼ 1

XkY

jY ¼ 1

miC jX jY,

mjX¼XkC

iC ¼ 1

XkY

jY ¼ 1

miC jX jY,

mjY¼XkC

iC ¼ 1

XkX

jX ¼ 1

miC jX jY:

It is noteworthy that the model parameters for the pointdimensions exploit the ranks of the values in the dataset ratherthan the values themselves. Therefore, any model is invariantw.r.t. any monotonous transformation of the point dimensionsand robust w.r.t. atypical values (outliers).

A functional data clustering model as defined in (1) can beseen as a generative model of the curve points, which corresponds

Page 4: Functional data clustering via piecewise constant nonparametric density estimation

M. Boulle / Pattern Recognition 45 (2012) 4389–44014392

to a piecewise constant joint density estimation of the ranks of X

and Y, per cluster of curves

pðXA intervaljX,YA intervaljY

,CAclusteriC Þ ¼miC jX jY

m: ð1Þ

For a specific curve i belonging to a cluster iC, we have

pðXA intervaljX,YA intervaljY

,C ¼ curveiÞ ¼miC jX jY

m

mi

miC

: ð2Þ

In order to select the best model, we apply a Bayesianapproach, where the best model is found by maximizing theprobability P(M9D) of the model given the data. Using Bayes ruleand since the probability P(D) is constant while varying themodel, this is equivalent to maximizing P(M)P(D9M), where P(M)is the prior distribution on the model parameters and P(D9M) isthe likelihood. We choose the prior described in Definition 2,where the parameters for functional data clustering models arechosen hierarchically and uniformly at each level.

Definition 2. The uniform hierarchical prior for the parameters offunctional data clustering models is defined as follows:

1.

the numbers of clusters kC and of intervals kX ,kY are indepen-dent from each other, and uniformly distributed between1 and n for the curves, between 1 and m for the pointdimensions,

2.

for a given number kC of clusters, every partition of the n

curves into kC clusters are equiprobable,

3. for a model of size ðkC ,kX ,kY Þ, every distribution of the m points

on the k¼ kCkXkY cells of the data grid are equiprobable,

4. for a given cluster of curves, every distribution of the points in

the cluster on the curves of the cluster are equiprobable,

5. for a given interval of X (resp. Y), every distribution of the

ranks of the X (resp. Y) values of points are equiprobable.

Using these assumptions and taking the negative log of theprobabilities, this provides the evaluation criterion given inTheorem 3, which specializes to functional data clustering theunsupervised data grid model general criterion [29].

Theorem 3 (Evaluation criterion). A functional data clustering

model M distributed according to a uniform hierarchical prior is

Bayes optimal if the value of the following criteria is minimal:

cðMÞ ¼ log nþ2 log mþ log Bðn,kCÞ

þ logmþk�1

k�1

� �þXkC

iC ¼ 1

logmiC þniC�1

niC�1

!

þ log m!�XkC

iC ¼ 1

XkX

jX ¼ 1

XkY

jY ¼ 1

log miC jX jY!

þXkC

iC ¼ 1

log miC !�Xn

i ¼ 1

log mi!þXkX

jX ¼ 1

log mjX!þ

XkY

jY ¼ 1

log mjY!

ð3Þ

Proof. Skipped for brevity (see Appendix for details). &

B(n,k) is the number of divisions of n elements into k subsets(with eventually empty subsets). When n¼k, B(n,k) is the Bellnumber. In the general case, B(n,k) can be written as Bðn,kÞ ¼Pk

i ¼ 1 Sðn,iÞ, where S(n,i) is the Stirling number of the second kind[30], which stands for the number of ways of partitioning a set of n

elements into i nonempty subsets.As negative log of probabilities are no other than coding length

[31], criterion (3) can be interpreted according to the minimumdescription length (MDL) approach [32], with the coding length ofthe model plus the coding length of the dataset P given the

model. The first line in Formula (3) relates to the prior distribu-tion of the numbers of cluster kC and of intervals kX and kY, and tothe specification the partition of the curves into clusters. Thesecond line represents the specification of the parameters of themultinomial distribution of the m points on the k cells of the datagrid, followed by the specification of the multinomial distributionof the points of each cluster on the curves of the cluster. The thirdline stands for the likelihood of the distribution of the points onthe cells, by the mean of a multinomial term. The last linecorresponds to the likelihood of the distribution of the points ofeach cluster on the curves of the cluster, followed by the like-lihood of the distribution of the ranks of the X values (resp. Y

values) in each interval.

2.4. Asymptotic properties of the evaluation criterion

We present in this section an asymptotic approximation ofcriterion (3) and provide theoretical insights on why the methodmay work well for functional data clustering.

Null model. Let us first introduce the null model M|, with onesingle cluster of curves, one single interval per point dimension,and one single cell containing all the points. Applying Formula (3),the cost cðM|Þ of the null model (its value according to evaluationcriterion (3)) reduces to

cðM|Þ ¼ log nþ2 log mþ logmþn�1

n�1

� �

þ logm!

m1!m2! . . .mn!þ2 log m!, ð4Þ

which corresponds to the posterior probability of the multinomialmodel for the distribution of the m points on the n curves and theposterior probability of the ranking of the values of each pointdimension. This means that the curves and the dimensions of thepoints are described independently.

Let CM ,CX ,CY be the discretized versions of variables C, X, Y

given a functional data clustering model M

CM ¼ iC3CAClusteriC ,

XM ¼ jX3XA IntervaljX,

YM ¼ jY3YA IntervaljY:

These discrete variables are distributed marginally and jointlyaccording to

CM : piC¼

miC

m

n o,

XM : pjX¼

mjX

m

� �,

YM : pjY¼

mjY

m

� �,

CM ,XM ,YM : piC jX jY¼

miC jX jY

m

� �:

We introduce the notion of contrast of a model in Definition 4.

Definition 4. The contrast of a functional data clustering modelM is defined as the difference between the joint entropy of thevariables CM ,XM ,YM and the sum of their individual entropies

ContrastðMÞ ¼HðCM ,XM ,YMÞ�HðCMÞ�HðXMÞ�HðYMÞ: ð5Þ

Let us notice that the contrast is always less than 0 since thejoint entropy of variables is less than or equal to the sum of theirindividual entropies [33]. This inequality is an equality if and onlyif the variables are statistically independent.

Page 5: Functional data clustering via piecewise constant nonparametric density estimation

M. Boulle / Pattern Recognition 45 (2012) 4389–4401 4393

We present in Theorem 5 our main result regarding theapproximation of criterion (3).

Theorem 5 (Approximation of the evaluation criterion). The eva-

luation criterion of a functional data clustering model M distributed

according to a uniform hierarchical prior can be approximated

according to

9ðcðMÞ�cðM|ÞÞ�aðk,mÞ9obðkC ,k,n,mÞ, ð6Þ

where

aðk,mÞ ¼m ContrastðMÞþðk�1Þ logðmþkÞ ð7Þ

and

bðkC ,k,n,mÞ ¼ n 1þ log1

kC1þ

m

n

� �

� �

þ1

2log mþ2

� �ðkþ2Þþk log k: ð8Þ

Proof. Skipped for brevity (see Appendix for details).

Mainly, the proof relies on a precise asymptotic approximation

of log n! at the order Oð1=nÞ and exploits Jensen’s inequality. The

Contrast term comes from the likelihood terms in criterion (3), on

the basis of the approximation of the log of multinomial terms by

entropy terms. &

Assuming that the number n of curves is fixed, we now exploitthis approximation to derive a list of properties that providetheoretical ground to the method.

Theorem 6. For a functional data clustering model M of fixed granu-

larity ðkC ,kX ,kY Þ, the normalized difference of costs ðcðMÞ�cðM|ÞÞ=m

asymptotically converges to the difference between the joint entropy of

the discretized variables CM ,Xm,YM and the sum of their individual

entropies

limm-1

cðMÞ�cðM|Þ

m¼ ContrastðMÞ: ð9Þ

Theorem 6 states that functional data clustering model allowto asymptotically approximate the joint entropy of the pointvariables, using fine grained functional clustering models. AsHðXM ,YM9CMÞ ¼HðCM ,XM ,YMÞ�HðCMÞ, this allows to approximatethe joint entropy of the curves. In particular, clusters of curves arelikely to have good cost c(M) in case of correlated curves.

Theorem 7. For a set of functional data clustering models M(m) of

nested granularities with the number of cells k(m) such that

limm-1

kðmÞ ¼1 and limm-1

kðmÞlog m

m¼ 0,

the normalized difference of costs ðcðMÞ�cðM|ÞÞ=m can be decom-

posed into a goodness of fit term (10), a regularization term (11) and

an approximation term (12)

cðMÞ�cðM|Þ

m¼ ContrastðMÞ ð10Þ

þamkðmÞlog m

mð11Þ

þ1

mðOðlog mÞþOðkðmÞlog kðmÞÞÞ ð12Þ

with 1=2ramr3=2.

Theorem 7 shows how the Bayesian selection approach works:coarse grained models bring a small regularization term whereasfined grained models better fit the data, leading to potentiallysmaller joint entropy in the goodness of fit term. Overall, theoptimal model granularity results from a balance between

robustness (coarse grained models) and good fit of the data (finegrained models).

In the case of statistically independent variables C, X, Y, thegoodness of fit term is null whereas the regularization termgrows with the granularity of the model: the null model will bethe best one. In case of dependent variables, in particular incase of statistically significant clusters of curves, the goodness offit term is strictly negative, such that an informative modelwith clusters will be asymptotically more probable than thenull model.

Overall, the method is likely to be resilient to spurious clustersand to uncover fine grained clusters in case of correlated curves,owing to a precise approximation of the joint entropy of the curvevariables C, X, Y.

2.5. Optimization algorithm

Functional data clustering models are no other than data gridmodels [29] applied to the case of joint density estimation of thecurve and dimensions of the points. The space of data grid modelsis so large that straightforward algorithms almost surely fail toobtain good solutions within a practicable computational time.Given that criterion (3) is optimal, the design of sophisticatedoptimization algorithms is both necessary and meaningful. Suchalgorithms are described in [29,34]. They finely exploit thesparseness of the data grid and the additivity of the criterion,and allow a deep search in the model space with O(m) memorycomplexity and O ðm

ffiffiffiffiffimp

log mÞ time complexity.In this section, we give an overview of the optimization

algorithms which are fully detailed in [29], and rephrase themusing the functional data terminology. The optimization of a datagrid is a combinatorial problem. The number of possible parti-tions of n curves is equal to the Bell number BðnÞ ¼ ð1=eÞP1

k ¼ 1ðkn=k!Þ, and the number of discretizations of m values is

equal to 2m. Even with very simple models having only twoclusters of curves and two intervals per dimension, the number ofmodels amounts to 2nm2. An exhaustive search through thewhole space of models is unrealistic. We describe in Algorithm1 a greedy bottom up merge (GBUM) heuristic which optimizesthe model criterion (3). The method starts with a fine grainedmodel, with many clusters of curves and many intervals perdimension, up to the maximum model MMax with one curve percluster and one value per interval. It considers all the mergesbetween any clusters of curves or between adjacent intervals forthe dimension variables, and performs the best merge if thecriterion decreases after the merge. The process is reiterated untilno further merge decreases the criterion.

Algorithm 1. Greedy bottom up merge (GBUM) heuristic.

Require: M {Initial solution}

Ensure: Mn,cðMnÞrcðMÞ {Final solution with improved cost}

1: Mn’M

2: while improved solution

3: M0’Mn

4: for all Merge m between two clusters of C or two intervalsof a dimension variable

5: Mþ’Mnþm {Consider merge m for model Mn}

6: if cðMþ ÞocðM0Þ then

7: M0’Mþ

8: end if9: end for

10: if cðM0ÞocðMnÞ then

11: Mn’M0 {Improved solution}12: end if13: end while

Page 6: Functional data clustering via piecewise constant nonparametric density estimation

M. Boulle / Pattern Recognition 45 (2012) 4389–44014394

Each evaluation of the criterion for a data grid model requires2 2

Oðnm Þ time, since the initial model contains up to nm cells

(see Formula (3)) in the case of the maximal model MMax. Eachstep of the algorithm relies on Oðn2Þ evaluations of merges ofclusters of curves and O(2m) evaluation of merges of intervals,and there are at most O(nþ2m) steps, since the model becomesequal to the null model M| once all the possible merges have beenperformed. Overall, the time complexity of the algorithm isOðn4m2þn3m3þnm4Þ using a straightforward implementation ofthe algorithm. However, the method can be optimized inOðm

ffiffiffiffiffimp

log mÞ time, as demonstrated in [29]. The optimizedalgorithm mainly exploits the sparseness of the data, the additiv-ity of the criterion and starts from non-maximal models with pre-and post-optimization heuristics.

Collection of curves represented with datasets of points aresparse. Although a data grid model may contain Oðnm2Þ cells,at most m cells are nonempty. Since the contribution of emptycells is null in the criterion (3), each evaluation of a data gridcan be performed in O(m) time owing to specific algorithmicdata structures. � The additivity of the criterion means that it can be decom-

posed on the hierarchy of the components of the models:variables (curve and dimensions), parts (cluster and intervals),cells. Using this additivity property, all the merges betweenadjacent parts can be evaluated in O(m) time. Furthermore,when the best merge is performed, the only impacted mergesthat need to be reevaluated for the next optimization step arethe merges that share points with the best merge. Since thedataset is sparse, the number of reevaluations of models issmall on average.

� Finally, the algorithm starts from initial fine grained solutions

containing at most O ðffiffiffiffiffimpÞ clusters. Specific preprocessing and

postprocessing heuristics are exploited to locally improve theinitial and final solutions of Algorithm 1 by moving curvesacross clusters and moving interval bounds. The post-optimi-zation algorithms are applied alternatively to each variable(curve or one dimension), for a frozen partition of the othervariables. This allows to keep a O(m) memory complexity andto bound the time complexity by Oðm

ffiffiffiffiffimp

,log mÞ.

Sophisticated algorithmic data structures and algorithms arenecessary to exploit these optimization principles and guaranteea O ðm

ffiffiffiffiffimp

log mÞ time complexity for initial solutions exploitingat most O ð

ffiffiffiffiffimpÞ clusters of curves.

The optimized version of the greedy heuristic is time efficient,but it may fall into a local optimum. This problem is tackled using

Fig. 2. Two sample functions f1 and f2 draw

the variable neighborhood search (VNS) meta-heuristic [35],which mainly benefits from multiple runs of the algorithms withdifferent random initial solutions. The first initial solution isgenerated with minðn,

ffiffiffiffiffim3pÞ random clusters of curves and

ffiffiffiffiffim3p

random X and Y intervals in order to obtain O(m) cells. Next initialsolutions are generated in the neighborhood of previous local optimaby randomly splitting clusters or intervals into subparts. In practice,the main heuristic described in Algorithm 1, with its guaranteed timecomplexity, is used to find a good solution as quickly as possible. TheVNS meta-heuristic is exploited to perform anytime optimization:the more you optimize, the better the solution. In the experimentsconducted throughout the paper, the anytime heuristic was used andstopped after 10 rounds of initialization.

The optimization algorithms summarized above have beenextensively evaluated in [34], using a large variety of artificialdatasets, where the true data distribution is known. Overall, themethod is both resilient to noise and able to detect complex finegrained patterns. It is able to approximate any data distribution,provided that there are enough instances in the train data sample.

3. Illustration on artificial data

Let us consider the two functions introduced in Section 2.2, onthe domain of x values [0,1], with an additive white Gaussiannoise Nð0,sÞ and standard deviation s¼ 0:25

n w

f 1 : y¼ 1þNð0,0:25Þ,

� f 2 : y¼ cosðpxÞþNð0,0:25Þ.

The conditional density dðy9xÞ of f1 and f2 is drawn in Fig. 2.Let us consider a collection of 10 curves generated using

function f1 and 10 curves with function f2. We generate a datasetP of 20 000 points, on average 1000 per curve. Each point is atriple of values with a randomly chosen curve (among 20), arandom x value (on domain [0, 1]) and a y value generatedaccording to the function related to the curve.

We apply our functional data clustering method introduced inSection 2 on random subsets of P of increasing sizes. For verysmall sample sizes, there is not enough data to discover signifi-cant patterns, and our method produces one single cluster ofcurves, with one single interval for the X and Y variables. Withonly 5 points per curve on average, that is 100 points in the wholepoint dataset, our method recovers the underlying pattern andproduces two clusters of 10 curves related to the f1 and f2

functions: the horizontal curves and the decreasing curves (cf. f1

and f2 in Fig. 3). Our method is also a piecewise constant

ith their conditional density dðy9xÞ.

Page 7: Functional data clustering via piecewise constant nonparametric density estimation

Fig. 3. Estimation of the conditional density dðy9xÞ and discovered clusters of curves, with 5 points per curve.

Fig. 4. Estimation of the conditional density dðy9xÞ and discovered clusters of curves, with 1000 points per curve.

2 Dataset available at http://www.math.univ-toulouse.fr/staph/npfda/npfda-

datasets.html.

M. Boulle / Pattern Recognition 45 (2012) 4389–4401 4395

estimator of the joint probability p(C, X, Y) of the three variables C,X, Y, based on both the clusters of curves and the discretization ofthe point dimensions X and Y. In our sample, the C and X variablesare both i.i.d. and independent. We thus have

pðY9X,CÞ ¼pðC,X,YÞ

pðX,CÞ¼

pðC,X,YÞ

pðXÞpðCÞppðC,X,YÞ:

For each cluster of curves, we have a piecewise constantestimation of the conditional probability pðY9CÞ. Let us reuse thenotation of Section 2.3, with miC jX jY

the number of points per cellðiC ,jX ,jY Þ of the data grid, and miC jX

¼PkY

jY ¼ 1 miC jX jYthe number of

points per cluster iC and interval jX. We have

pðYA intervaljY 9XA intervaljX,CAclusteriC Þ ¼

miC jX jY

miC jX

:

We divide these estimated conditional probabilities by the widthof intervaljY

to obtain conditional densities that we draw in Fig. 3,on the same basis as the true conditional densities picturedin Fig. 2.

These estimations are very raw, with only two intervals for theX and Y variables, but they are obtained with only 100 points(5 points per curve) and provide a good summary of the under-lying pattern: horizontal versus decreasing conditional density.

When our method is applied on a dataset of larger size, it stillperfectly recovers the two cluster of curves and provides a refinedversion of the estimated conditional densities. With 1000 pointsper curve on average, that is 20 000 points in the whole pointdataset, the conditional density estimator exploits a joint dis-cretization of the X, Y variables with 9 intervals for X and 12 for Y.This estimation, drawn in Fig. 4, is a good approximation of thetrue conditional densities pictured in Fig. 2.

We also performed the same experiment with the subset of 10curves related to the function f 1 : y¼ 1þNð0,0:25Þ. Whatever thesize of the dataset, the method always returns one single cluster,with one single interval for both the X and Y dimensions. Thesame experiment applied to f2 curves still produces one singlecluster of curves, but several intervals for the X and Y dimensions.This experimentally confirms the theoretical analysis presented inSection 2.4: in the case of f1, the three variables C, X, Y areindependent, such that the shortest way to encode the data is toencode each variable independently. One striking benefit of ourapproach is its robustness: it never produces spurious clusterswhatever the sample size.

More results are available on a richer artificial dataset in [36],which demonstrate empirically that the method is able to approx-imate complex curve datasets provided that there is enough data,and never produces spurious clusters.

4. Experimental results on real data

In this section, we apply the proposed approach on two realdatasets and show what kind of exploratory analysis can beperformed.

4.1. Topex/Poseidon satellite

The first dataset2 detailed in [37] consists of 472 waveformsregistered by the Topex/Poseidon satellite around an area of

Page 8: Functional data clustering via piecewise constant nonparametric density estimation

M. Boulle / Pattern Recognition 45 (2012) 4389–44014396

25 km upon the Amazon river, with a variability originating fromthe differences in the ground type. Each waveform is a curvemeasured at 70 points. Fig. 5 displays 10 curves chosen randomlyfrom the dataset.

10 20 30 40 50 60 700

50

100

150

200

T

V

Fig. 5. Topex/Poseidon dataset: 10 sample curves.

20 40 600

100

200

T

V

Fig. 6. Topex/Poseidon d

The original data comes in a format with one waveform perrow, containing the 70 measures. We first reformat the data as athree-dimensional dataset, as described in Section 2.2. We obtaina point dataset of 472n70¼ 33 040 points with one curve vari-able, the instance of waveform, and two dimension variables, themeasure index (TAf1;2, . . . ,70g) and the measured value(V Af0;1, . . . ,255g). We apply the proposed method and obtain33 clusters, summarized by the conditional probability pðv9tÞ on abivariate discretization 7n4 of the measure dimensions.

Fig. 6 displays a summary of all the clusters. They present avariety of forms, curves with one more or less heavy peak, withsimilar shapes but shifted peaks, flat noised curves with orwithout a stage.

Fig. 7 focuses on four examples of clusters, to better illustratethe kind of summary provided by our approach. Each cluster issummarized according to the conditional probability

pðV A intervaljV9TA intervaljT

,CAclusteriC Þ ¼miC jT jV

miC jT

:

All the curves assigned to each cluster are also drawn. The figureshows the richness of the information retrieved in the probabil-ity-based summary of the clusters, as well as the easy interpreta-tion of the clusters. Although the individual curves are very noisy,the 7n4 bivariate discretization grid provides a simple summary,with a good global fit of all the curves in a cluster. Nonparametricestimation of probability distribution goes far beyond the

p(y|x)0

0.5

1

ataset: all clusters.

Page 9: Functional data clustering via piecewise constant nonparametric density estimation

Fig. 7. Topex/Poseidon dataset: four examples of clusters.

M. Boulle / Pattern Recognition 45 (2012) 4389–4401 4397

standard regression-based approaches, where the expectation ofthe curve target value only is estimated. The method not onlysummarizes the mean shape of the curves in a cluster, but alsothe variability of the curves inside each cluster, without anyassumption regarding constant Gaussian noise (like in ordinaryleast square regression technique) or any kind of parametrichomoscedastic or heteroscedastic noise.

4.2. MNIST handwritten digits

The digit dataset3 detailed in [38] consists of 8-bit gray scaleimages of ‘‘0’’ through ‘‘9’’ digits. The dataset was originallydesigned for the classification task of handwritten digit recogni-tion, with a train set of 60 000 examples and a test set of 10 000examples. Each image is a 28n28 pixel box, with gray level from0 to 255. We consider each image as the picture of a curve, andchose to keep the pixels with gray level above 128 as belonging tothe curves. We exploit both the initial train and test sets andobtain a curve dataset related to handwritten digits, with 70 000curves represented on a two-dimensional space X, Y, leading to apoint dataset P containing about 7.2 million points. Fig. 8 displays100 curves chosen randomly from this curve dataset.

In this experiment, we apply our method as an exploratoryanalysis technique, and use the curve labels (digits) only in asecond phase to evaluate the correlation between the clusters ofcurves and the curve labels. The interest of using this dataset withan exploratory analysis task is multiple.

This is a large dataset, two orders of magnitude above theTopex/Poseidon satellite dataset. This provides a challengingbenchmark to evaluate the scalability of an exploratory ana-lysis technique. � The curves are complex: they look closer to distribution of

points than to functions. Any method assuming a functionalrelation between the point dimensions is likely to fail.

3 Dataset available at http://yann.lecun.com/exdb/mnist.

Whereas this dataset has been extensively used for theclassification task, it has received less attention for the taskof exploratory analysis (see [39] for a deep learning approach).Many questions arise, related to the number and variety of‘‘natural’’ patterns in this dataset, to the correlation betweenthese patterns and the digits, to which digits exhibits a largervariety of shapes and whether they are harder to discriminate.

� As any educated people can be considered as an expert in

handwritten digit recognition, this alleviates the evaluation ofthe understandability of the results of the proposed explora-tory analysis.

We apply the method presented in Section 2 and let theanytime heuristic run during one day, which corresponds toabout 10 rounds of optimization of the main algorithm. Weobtain 568 clusters, summarized on a bivariate grid of size15n21. Since the curves are closer to point distributions ratherthan to functions, we focus on the joint probability pðx,y9cÞ percurve rather than on the conditional probability. Let us reuse thenotation of Section 2.3, with miC jX jY

the number of points for cellðiC ,jX ,jY Þ of the data grid and miC the number of points for clusteriC. We have

pðXA intervaljX,YA intervaljY

9CAclusteriC Þ ¼miC jX jY

miC

:

Fig. 9 displays a summary of a subset of 100 randomly chosenclusters. Each cluster summary is a representation of a jointdistribution, which highlights the dense regions in the bidimen-sional X, Y space. Interestingly, the shapes in the joint distributionspace are very close to digits, and even more readable than theoriginal digit curves, such as those presented in Fig. 8.

Given this proximity of cluster summaries to digit shapes, wedecide to reorganize the clusters according to their majority digit.Table 2 shows the number of clusters and the accuracy per digit.Fig. 10 shows all the clusters for six among the 10 digits, sorted bydecreasing frequency of their majority digit.

Digit ‘‘1’’ is the easiest, with only 35 clusters and an overall98.0% of the curves assigned to the digit ‘‘1’’. The clusters arerelated to few shapes, but with a variety of orientation, thickness

Page 10: Functional data clustering via piecewise constant nonparametric density estimation

101520

510152025

X

Y

Fig. 8. Handwritten digits datasets: 100 sample curves.

Fig. 9. Handwritten digits datasets: 100 sample clusters with their XnY density.

Table 2Handwritten digits datasets: number of clusters and accuracy

per digit.

Digit Cluster number Accuracy

0 65 96.7

1 35 98.0

2 76 96.4

3 74 94.5

4 44 96.1

5 52 95.1

6 51 96.9

7 47 97.4

8 78 84.8

9 46 83.7

M. Boulle / Pattern Recognition 45 (2012) 4389–44014398

and (slight) curvature. Digits ‘‘4’’, ‘‘5’’ and ‘‘7’’ are a bit morecomplex, with 44, 52 and 47 clusters and a probability of correctassignment of 96.1%, 95.1% and 97.4%. Digits ‘‘8’’ and ‘‘9’’ are

clearly the most difficult, with 78 and 46 clusters and a prob-ability of correct assignment of 84.8% and 83.7%. Digit ‘‘8’’ exhibitsthe largest variety of shapes. Although it always consist of twoloops, the variety comes from the thickness of the curve itself, thewidth and orientation of the overall shape and the respective sizeof the upper and lower loop of the ‘‘8’’. It is noteworthy that inthis representation space X, Y without any preprocessing, the‘‘8’’ shapes are not always close to each other and do not evenshare many pixels. Constraining the clustering technique by amaximum number of clusters may have blurred the summariesand hidden potentially informative insights.

To study the correlation between the clusters and the digits,we sort the clusters by decreasing majority frequency and reportthe probability of each digit inside the clusters. The results are

Page 11: Functional data clustering via piecewise constant nonparametric density estimation

Fig. 10. Handwritten digit datasets: clusters with their XnY density, organized according to their majority digit, for digits ‘‘1’’, ‘‘4’’, ‘‘5’’, ‘‘7’’, ‘‘8’’, ‘‘9’’.

4 See http://yann.lecun.com/exdb/mnist for a synthesis with many results and

reference papers.

M. Boulle / Pattern Recognition 45 (2012) 4389–4401 4399

presented in Fig. 11. Each column represents all the clusters withthe same majority digit, sorted by decreasing frequency of thisdigit. Each row represents the distribution of a given cluster onthe digits. For example, the narrowest column ‘‘C1’’ reports thedistribution of digits in each of the 35 clusters assigned to ‘‘1’’,which almost contains 100% of digit ‘‘1’’ (see row ‘‘D1’’). Thelargest column ‘‘C8’’ reports the distribution of digits in the 78clusters assigned to ‘‘8’’, which is a far more difficult digit. The lastclusters have significant percentages of the other digits, withoverall 0.9% of ‘‘0’’, 1.9% of ‘‘2’’, 4.5% of ‘‘3’’, 3.7% of ‘‘5’’, and 1.9% of‘‘9’’. The most difficult digit is ‘‘9’’, but the errors are mainlyrelated to two other digits, with 8.6% of ‘‘4’’ and 5.5% of ‘‘7’’.

Finally, we evaluate the clusters as a preprocessing method fordigit classification, using a two-step method inspired by semi-supervised learning [40]. In a first step, all the train and testexamples are used without the class labels (the digits) to train the

unsupervised clustering model and produce the clusters. In asecond step, each cluster is assigned to its majority class, usingthe train labels only. This class assignment is then used forprediction for the test examples. Using this protocol, where thetest labels are ignored during the learning process, we obtain atest error rate of 5.9%.

Our reformatting of the initial handwritten digit dataset haskept only the pixels with gray level above 128, which results insome unknown information loss. Still, we can compare our testerror rate with that of dedicated classification methods. Thisdataset has been widely used as a benchmark for tens ofclassification methods.4 The test error rate of a basic linear

Page 12: Functional data clustering via piecewise constant nonparametric density estimation

C0

D9

D8

D7

D6

D5

D4

D3

D2

D1

D0

C9C8C7C6C5C4C3C2C1

Fig. 11. Handwritten digits datasets: distribution of digits per cluster.

M. Boulle / Pattern Recognition 45 (2012) 4389–44014400

classifier (one layer neural network) is 12.0% without preproces-sing and 8.4% after deskewing. That of a k-nearest neighborsclassifier with L2 norm and three neighbors is 5.0% withoutpreprocessing and 2.4% with after deskewing. Support vectormachines were also tried with Gaussian kernel or polynomialkernels with degree from 4 to 9; they obtained up to 0.8% errorrate. These first results were reported in a vast comparativebenchmark [38] and have since been improved with up to 0.4%error rate.

Our exploratory analysis method does not compete with themost sophisticated classifiers, but it performs remarkably well fora task it was not intended to and with an incomplete representa-tion. Furthermore, it brings many informative insights w.r.t. thenatural patterns in the dataset, which may suggest new prepro-cessing or normalization techniques to reduce the number ofpatterns and facilitate the task of digit recognition.

Overall, this last experiment on the MNIST handwritten digitdataset has demonstrated the scalability of our approach and itsability to process a complex curve dataset and discover poten-tially interesting patterns.

5. Conclusion

In this paper, we have presented a new exploratory analysismethod for functional data. Instead of considering a functionaldataset as a data sample where the curves are the observationswith variable-length representation, we chose to work with thefixed-size point dataset which instances are the curve points andvariables are both one ‘‘curve identifier’’ categorical variable andthe numerical point variables. The MODL approach based on datagrid models introduced in [22] is applied to the case of functionaldatasets. By clustering the curves and discretizing each pointvariable, the method behaves as a nonparametric estimator of thejoint density of both the curve and point variables. Experimentson two medium size and large real datasets show the benefits ofthe approach, which bring new insights in the exploratoryanalysis task, such as discovering the ‘‘natural’’ granularity ofthe point variables, the number of clusters, the estimation of thejoint density in the point dimensions for each clusters of curves,with potentially multi-modal behavior, beyond the usual func-tional assumption.

Most alternative functional data analysis methods rely onstrong assumptions, such as simple trends, smoothness, equallyspaced observations, and require parameter tuning regarding thechoice of the basis functions in the parametric case or the choiceof the kernel parameters or the distance in the nonparametriccase. On the contrary, our approach is both nonparametric, since

it can fit any functional data and even density data, and para-meter-free, since no user parameter is required. The mainoriginality of the modeling approach is that it is data dependentand non-asymptotic in essence: it aims at modeling the finitefunctional sample directly. The modeling task is then easier, withfinite modeling space and model priors which essentially reduceto counting.

Interestingly, whereas the primary purpose of the method isdensity estimation, it comes with insightful byproducts such asthe clustering of the curves and the discretization of the curvedimensions, which reduces the dimensionality of the curves in afixed-size space. As the method automatically infers the optimalgranularity of the clustering, it tends to build more and moreclusters as the amount of data increases, up to potentially onecluster per curve. In future work, we plan to derive a similarityfrom the model evaluation criterion and organize the clusters intoa hierarchy in order to alleviate the exploratory analysis task.

Appendix A. Supplementary material

Supplementary data associated with this article can be foundin the online version at http://dx.doi.org.10.1016/j.patcog.2012.05.016.

References

[1] D. Bosq, Linear Processes in Function Spaces: Theory and Applications(Lecture Notes in Statistics), Springer, 2000.

[2] J. Ramsay, B. Silverman, Applied Functional Data Analysis: Methods and CaseStudies, Springer-Verlag Inc., 2002.

[3] J. Ramsay, B. Silverman, Functional Data Analysis, Springer Series in Statistics,Springer, 2005.

[4] T. Tarpey, K. Kinateder, Clustering functional data, Journal of Classification 20(2003) 93–114.

[5] J. Peng, H. Muller, Distance-based clustering of sparsely observed stochasticprocesses, with applications to online auctions, The Annals of AppliedStatistics 2 (3) (2008) 1056–1077.

[6] L. Sangalli, P. Secchi, S. Vantini, V. Vitelli, K-mean alignment for curveclustering, Computational Statistics & Data Analysis 54 (5) (2010)1219–1233.

[7] R. Agrawal, C. Faloutsos, A. Swami, Efficient similarity search in sequencedatabases, in: D. Lomet (Ed.), Proceedings of the 4th International Conferenceof Foundations of Data Organization and Algorithms (FODO), Springer Verlag,Chicago, IL, 1993, pp. 69–84.

[8] K. Chan, A. Fu, Efficient time series matching by wavelets, in: ICDE, 1999,pp. 126–133.

[9] F. Chamroukhi, A. Same, G. Govaert, P. Aknin, A hidden process regressionmodel for functional data description. Application to curve discrimination,Neurocomputing 73 (7–9) (2010) 1210–1221.

[10] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning,Springer, 2001.

[11] C. Deboor, A Practical Guide to Splines, Springer, 2001.[12] C. Abraham, P. Cornillon, E. Matzner-Lober, N. Molinari, Unsupervised curve

clustering using b-splines, Scandinavian Journal of Statistics 30 (3) (2003)581–595.

[13] P. Smyth, Clustering sequences with hidden Markov models, in: M. Mozer,M. Jordan, T. Petsche (Eds.), Advances in Neural Information ProcessingSystems, vol. 9, The MIT Press, 1997, pp. 648–654.

[14] P. Smyth, Probabilistic model-based clustering of multivariate and sequentialdata, in: Proceedings of Artificial Intelligence and Statistics, 1999, pp. 299–304.

[15] F. Rossi, B. Conan-Guez, A.E. Golli, Clustering functional data with the SOMalgorithm, in: Proceedings of the ESANN, 2004, pp. 305–312.

[16] G. Hebrail, B. Hugueney, Y. Lechevallier, F. Rossi, Exploratory analysis offunctional data via clustering and optimal segmentation, Neurocomputing/EEG Neurocomputing 73 (7–9) (2010) 1125–1141.

[17] F. Ferraty, P. Vieu, Nonparametric Functional Data Analysis: Theory andPractice, Springer Verlag, 2006.

[18] C. Crambes, L. Delsol, A. Laksaci, Robust nonparametric estimation forfunctional data, Journal of Nonparametric Statistics 20 (7) (2008) 573–598.

[19] T. Gasser, P. Hall, B. Presnell, Nonparametric estimation of the mode of adistribution of random curves, Journal of the Royal Statistical Society 60(1998) 681–691.

[20] G. Delaigle, P. Hall, Defining probability density for a distribution of randomfunctions, Annals of Statistics 38 (2) (2010) 1171–1193.

Page 13: Functional data clustering via piecewise constant nonparametric density estimation

M. Boulle / Pattern Recognition 45 (2012) 4389–4401 4401

[21] T.W. Liao, Clustering of time series data—a survey, Pattern Recognition 38(2005) 1857–1874.

[22] M. Boulle, Data Grid Models for Preparation and Modeling in SupervisedLearning, Microtome Publishing, 2011, pp. 99–130.

[23] A. Tikhonov, V. Arsenin, Solution of Ill-Posed Problems, John Wiley & Sons,1977.

[24] J. Ramsay, Kernel smoothing approaches to nonparametric item character-istic curve estimation, Psychometrika 56 (4) (1991) 611–630.

[25] L. Devroye, L. Gyorfi, G. Lugosi, A Probabilistic Theory of Pattern Recognition,Springer, 1996.

[26] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, Knowledge discovery and datamining: towards a unifying framework, in: KDD, 1996, pp. 82–88.

[27] M. Boulle, Multivariate Data Grid Models for Supervised and UnsupervisedLearning, Technical Report, NSM/R&D/TECH/EASY/TSI/5/MB, France TelecomR&D, /http://perso.rd.francetelecom.fr/boulle/publications/BoulleNTTSI5MB08.pdfS, 2008.

[28] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, R. Wirth,CRISP-DM 1.0: Step-by-Step Data Mining Guide, 2000.

[29] M. Boulle, Recherche d’une representation des donnees efficace pour lafouille des grandes bases de donnees, Ph.D. Thesis, Ecole Nationale Super-ieure des Telecommunications, /http://perso.rd.francetelecom.fr/boulle/publications/BoulleThesis07.pdfS, 2007.

[30] M. Abramowitz, I. Stegun, Handbook of Mathematical Functions, DoverPublications Inc., New York, 1970.

[31] C. Shannon, A Mathematical Theory of Communication, Technical Report 27,Bell Systems Technical Journal, 1948.

[32] J. Rissanen, Modeling by shortest data description, Automatica 14 (1978)

465–471.[33] T. Cover, J. Thomas, Elements of Information Theory, Wiley-Interscience, New

York, NY, USA, 1991.[34] M. Boulle, Bivariate Data Grid Models for Supervised Learning, Technical Report,

NSM/R&D/TECH/EASY/TSI/4/MB, France Telecom R&D, /http://perso.rd.francetelecom.fr/boulle/publications/BoulleNTTSI4MB08.pdfS, 2008.

[35] P. Hansen, N. Mladenovic, Variable neighborhood search: principles andapplications, European Journal of Operational Research 130 (2001) 449–467.

[36] M. Boulle, A Parameter-Free Method for Clustering Functional Data, TechnicalReport, France Telecom R&D, No FT/RD/TECH/11/01/76, 2011.

[37] F. Frappart, S. Calmant, M. Cauhope, F. Seyler, A. Cazenave, Preliminaryresults of Envisat RA-2-derived water levels validation over the AmazonBasin, Remote Sensing of Environment 100 (2) (2006) 252–264.

[38] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied todocument recognition, in: Proceedings of the IEEE, vol. 86, 1998, pp. 2278–2324.

[39] D. Erhan, A. Courville, Y. Bengio, P. Vincent, S. Bengio, Why does unsupervisedpre-training help deep learning? Journal of Machine Learning Research 11

(2010) 625–660.[40] O. Chapelle, B. Scholkopf, A. Zien, Semi-Supervised Learning, MIT Press,

Cambridge, MA, 2006.

Marc Boulle was born in 1965 and graduated from Ecole Polytechnique (France) in 1987 and Sup Telecom Paris in 1989. Currently, he is a Senior Researcher in the datamining research group of France Telecom R&D. His main research interests include statistical data analysis, data mining, especially data preparation and modeling for largedatabases. He developed regularized methods for feature preprocessing, feature selection and construction, model averaging of selective naive Bayes classifiers andregressors.