11
Unsupervised Classification and Analysis of Objects Described by Nonparametric Probability Distributions Richard Emilion MAPMO Laboratory, Orl´ eans University, Orl´ eans, France Received 9 December 2010; revised 26 June 2012; accepted 2 August 2012 DOI:10.1002/sam.11160 Published online in Wiley Online Library (wileyonlinelibrary.com). Abstract: Various objects can be summarily described by probability distributions: groups of raw data, paths of stochastic processes, neighborhoods of an image pixel and so on. Dealing with nonparametric distributions, we propose a method for classifying such objects by estimating a finite mixture of Dirichlet distributions when the observed distributions are assumed to be outcomes of a finite mixture of Dirichlet processes. We prove the consistency of such a classification by using the mutual singularity of two distinct Dirichlet processes and the martingale convergence theorem. Moreover, this consistency allows us to use some standard data analysis and statistical methods for analyzing the class labels of these objects. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012 Keywords: classification; Dirichlet distributions; Dirichlet processes; EM-type algorithms; mixtures; martingale convergence theorem; random distributions 1. INTRODUCTION Analyzing a data set with a huge number of records described by a simple numerical/categorical vector can turn out to be of great complexity as is the case in many data mining problems. However, sometimes it can be more meaningful to group the records into new units according to some interesting user specifications. Analyzing these new units depends on their description which can be for example just the mean of the attribute within the unit but also its range or its probability distribution. Although choosing the mean as a description of a new unit has the advantage of still dealing with a simple attribute on which usual statistical methods can work, it is clear that a lot of information may be lost mainly when the variance within the new unit is large. On the other hand, choosing the range or the probability distribution as descriptors requires some new analysis methods which have to deal with more complex descriptions. This last point is the purpose of what is called symbolic data analysis as introduced by Diday [1] (see also ref. 2). This study is concerned with the probabilistic classifica- tion of objects described by probability distributions. This Correspondence to: Richard Emilion (richard.emilion@univ- orleans.fr) problem is very popular when the objects are described by real vectors by estimating, for example, Gaussian (or expo- nential family) finite mixtures [3], the estimation generally being obtained by using the famous expectation maximiza- tion (EM) algorithm [4] or its variants stochastic expecta tion maximization (SEM) [5], stochastic annealing expecta- tion maximization (SAEM) [6] and Monte Carlo Expecta- tion Maximization (MCEM) [7]. Here, the problem is more complex as the descriptions are themselves distributions which are in addition assumed to be nonparametric, i.e. not belonging to any specific parametric family of distributions. This answers a question posed by Diday [8] that we summarily formalize as follows. Consider a sample of n probability measures d i ,i = 1,...,n, on a fixed measurable space, say V . Briefly, d i P(V ), where P(V ) is defined as the set of all probability measures on V . The first point of the problem consists of proposing a family of parametric probability measures on P(V ), say (D(α)) α , α denoting a parameter to be specified. The second point consists of assuming that the d i ’s are outcomes of a random variable (r.v.), say X, taking values in P(V ) and having for its distribution a finite mixture L r =1 p r Dr ), p r > 0, L r =1 p r = 1, (1) © 2012 Wiley Periodicals, Inc.

Unsupervised classification and analysis of objects described by nonparametric probability distributions

Embed Size (px)

Citation preview

Page 1: Unsupervised classification and analysis of objects described by nonparametric probability distributions

Unsupervised Classification and Analysis of Objects Described byNonparametric Probability Distributions

Richard Emilion

MAPMO Laboratory, Orleans University, Orleans, France

Received 9 December 2010; revised 26 June 2012; accepted 2 August 2012DOI:10.1002/sam.11160

Published online in Wiley Online Library (wileyonlinelibrary.com).

Abstract: Various objects can be summarily described by probability distributions: groups of raw data, paths of stochasticprocesses, neighborhoods of an image pixel and so on. Dealing with nonparametric distributions, we propose a method forclassifying such objects by estimating a finite mixture of Dirichlet distributions when the observed distributions are assumed tobe outcomes of a finite mixture of Dirichlet processes. We prove the consistency of such a classification by using the mutualsingularity of two distinct Dirichlet processes and the martingale convergence theorem. Moreover, this consistency allows us touse some standard data analysis and statistical methods for analyzing the class labels of these objects. © 2012 Wiley Periodicals,Inc. Statistical Analysis and Data Mining, 2012

Keywords: classification; Dirichlet distributions; Dirichlet processes; EM-type algorithms; mixtures; martingale convergencetheorem; random distributions

1. INTRODUCTION

Analyzing a data set with a huge number of recordsdescribed by a simple numerical/categorical vector can turnout to be of great complexity as is the case in manydata mining problems. However, sometimes it can be moremeaningful to group the records into new units according tosome interesting user specifications. Analyzing these newunits depends on their description which can be for examplejust the mean of the attribute within the unit but also itsrange or its probability distribution. Although choosing themean as a description of a new unit has the advantage ofstill dealing with a simple attribute on which usual statisticalmethods can work, it is clear that a lot of informationmay be lost mainly when the variance within the new unitis large. On the other hand, choosing the range or theprobability distribution as descriptors requires some newanalysis methods which have to deal with more complexdescriptions.

This last point is the purpose of what is called symbolicdata analysis as introduced by Diday [1] (see also ref. 2).

This study is concerned with the probabilistic classifica-tion of objects described by probability distributions. This

Correspondence to: Richard Emilion ([email protected])

problem is very popular when the objects are described byreal vectors by estimating, for example, Gaussian (or expo-nential family) finite mixtures [3], the estimation generallybeing obtained by using the famous expectation maximiza-tion (EM) algorithm [4] or its variants stochastic expectation maximization (SEM) [5], stochastic annealing expecta-tion maximization (SAEM) [6] and Monte Carlo Expecta-tion Maximization (MCEM) [7]. Here, the problem is morecomplex as the descriptions are themselves distributionswhich are in addition assumed to be nonparametric, i.e. notbelonging to any specific parametric family of distributions.

This answers a question posed by Diday [8] thatwe summarily formalize as follows. Consider a sampleof n probability measures di, i = 1, . . . , n, on a fixedmeasurable space, say V . Briefly, di ∈ P(V ), where P(V )

is defined as the set of all probability measures on V . Thefirst point of the problem consists of proposing a familyof parametric probability measures on P(V ), say (D(α))α ,α denoting a parameter to be specified. The second pointconsists of assuming that the di’s are outcomes of a randomvariable (r.v.), say X, taking values in P(V ) and having forits distribution a finite mixture

L∑r=1

prD(αr), pr > 0,

L∑r=1

pr = 1, (1)

© 2012 Wiley Periodicals, Inc.

Page 2: Unsupervised classification and analysis of objects described by nonparametric probability distributions

2 Statistical Analysis and Data Mining, Vol. (In press)

i.e. a finite convex combination of elements of the abovefamily. The third and main point consists of proposing amethod for estimating, given the di’s, the parameters of thefinite mixture that are the weights pr and the parametersαr .

The term classification is used here because actually itcan be introduced as a r.v. C taking values in {1, . . . , L},often called class variable, r ∈ 1, . . . , L denoting the labelof class r . It is assumed that the proportions of objects inclass r is pr and that the distribution of X in class r is thecomponent D(αr), which rigorously can be written as

{P(C = r) = pr

P(X|C = r) = D(αr)(2)

where P(X|C = r) stands for the conditional distributionof X given that C takes value r . Obviously, if X and C

satisfy Eq. (2) then necessarily the distribution of X is givenby Eq. (1).

The term unsupervised means that the class variable C

is actually unobserved or latent: we just observe the di’sbut not their class. Therefore, the problem of estimating theparameters of mixture (1) is a particular case of the generalproblem of statistical inference and parametric estimationfor incomplete data. We propose here a solution using thenotion of random distribution (RD) with the celebratedDirichlet process example.

The very interesting notion of RD is used, for example,in a paper of Kingman introducing the famous Poisson-Dirichlet RD [9]. Indeed, in that nice paper, Kingmanrecalls many previous papers where ’people are interestedin describing the probability distribution of objects whichare themselves probability distributions’. This is clearly ourcase.

The use of the Dirichlet process in what is calledBayesian classification first appeared in 1996, in the caseof real vector data [10]. It was followed by, e.g. Bruner,Chan and Lo [11], Ishwaran and Zarepour [12], Ishwaranand James [13]. The case of functional data appears in refs14 and 15 and the case of stochastic process paths in ref.16. All these methods were illustrated with applications invarious domains.

In our case, we deal with observations which arenonparametric probability distributions. This was also thecase of Diday and Vrac [8] who used copula modelsbut their estimation procedure was done only in twodimensions. Our present approach using Dirichlet processesallows us to handle the general case, including a consistencytheorem as the dimension increases to infinity by using themutual singularity of two distinct Dirichlet processes (aconsequence of Korvar and Hollander [17]) and Doob’smartingale convergence theorem [18]. This consistencystudy is crucial as the distributions di’s, i = 1, . . . , n, are

not actually observable, they are just estimable throughestimators in finite dimension.

The paper is organized as follows. In Section 2, webriefly present various examples of objects described byprobability distributions, on which our classification methodwas tested. In Section 3, we explain our classificationmethod based on Dirichlet finite mixtures, and in Section4, we prove the consistency of the classification. Anillustration is presented in Section 5 using a simulateddataset and a real one. The concluding section containssome possible applications of this consistent classificationand a possible extension to the multihistogram case.

2. OBJECTS DESCRIBED BY PROBABILITYDISTRIBUTIONS

2.1. Local Times

Consider a time interval [0, T ] and a stochastic processX = (Xt )0≤t≤T supposed to have a local time

L(T , x) = limε↓0

1

∫ T

01(x−ε,x+ε)(Xs)ds

which measures the amount of time the process spends inthe neighborhood of x. If X is sampled at regular timeintervals, say h, 2h, 3h, . . ., then the simplest estimator ofthe function L(T , .)/T is the histogram of the sampledvalues taken by X. The observed paths are summarilydescribed by such histograms, and the classification ofthese histograms will be considered as an informativeclassification of the paths although the temporal dynamicaspect has been summarized. We have tested this methodin various contexts which are briefly presented in the threefollowing sections.

2.2. Internet Flows

An Internet flow is the traffic of packets of informationfrom a given source node of the Internet network to a givendestination node. In ref. 19, a flow at a specific point isdescribed by its bandwidth (or throughput) at time t whichis defined as the number of packets transmitted during ashort time interval (t − h, t] with fixed length h. In ref.19, h = 5 min was considered and the monitoring wasperformed during T = 24 h over 2000 active flows. Eachflow was then described by the bandwidth histogram erectedover K = 12 suitable adjacent intervals, also called bins.

2.3. Solar Radiation Sequences

In ref. 20, daily solar radiation was monitored at regulartime intervals with h = 1 second at a specific geographic

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 3: Unsupervised classification and analysis of objects described by nonparametric probability distributions

Emilion: Classifying Objects Described by Distributions 3

point. Each daily sequence of measurements from 6 a.m.to 6 p.m. was described by the clearness index histogramwith K = 20 suitable bins. Measurements over 1 year thenyield 365 histograms.

2.4. Wind Speed Sequences

In ref. 21, wind speed modulus (m/s) was monitoredat regular time intervals with h = 1 second at a specificgeographic point. Each sequence of measurements during awindow of T = 10 min was described by the measurementshistogram with K = 11 suitable bins. Sliding windows witha 1-s sliding step then yield around 1,000,000 histograms.

2.5. Local Histograms in an Image

In ref. 22 section 6.2, any neighborhood of an imagepixel provided a histogram of the gray levels within thisneighborhood. An image segmentation algorithm that usesthe classification of such histograms was then proposed.Notice that histograms of gray levels can be replaced bythose of the norm of the gradient vector or by those of anyother pixel description.

2.6. Groups of Raw Data

Notice that in sections 2.2–2.5, raw data of measure-ments were grouped according to some statistical units(time window, pixel neighborhood) fixed by the user. Suchunits can however also appear naturally in daily life situa-tions.

2.6.1. Administrative districts

In ref. 23, UK residents, described by some variablessuch as age (6 ranges), origin (4 ranges), accommodationtype (4 ranges) and so on, are grouped according to 406administrative districts, the new units, which are describedby the histograms of the preceding variables.

2.6.2. Age groups of patients

In ref. 2 (Table 3.10 p. 96), six age groups of femalepatients are each described by the histogram of the grouppatient weights. Many other examples of objects describedby probability distributions can be found in ref. 2.

3. FINITE DIRICHLET MIXTURES

In all the sequel, the following notations will be adopted.Let (�,F,P) be a fixed probability space. Any r.v.

mentioned hereafter will be defined on �. Recall thatthe distribution PX of a r.v. X taking values in ameasurable space is a probability measure on the latter,X ∼ PX. Also recall that a support of a probabilitymeasure μ is any measurable set of probability 1. Astwo supports are μ-almost equal, any support will bedenoted by Supp(μ). On the other hand, assume that μ

is defined on a set, say Y , then it is said that a propertywhich depends on y ∈ Y , say Property(y), hods for μ

almost all (a.a.) y ∈ Y , if μ{y : Property(y) is false} = 0,written as

Property(y)holds for μ − a.a. y ∈ Y.

The objects to be classified will be described by probabilitydistributions on a set V of values which will be assumed tobe a fixed interval of R

d , without loss of generality. In theexamples of Section 2, we have V = [0,M] where M is themaximal bandwidth in the case of Internet flows, M = 1 forthe clearness index in the case of daily solar radiations, M isthe maximal wind speed in the case of wind sequences, andso on.

3.1. Random Distributions

Let P(V ) denote the set of all probability measuresdefined on the Borel σ -algebra BV of V and consider themappings

ϕA : P(V ) → [0, 1] (3)

P → P(A) (4)

for any Borel set A ⊆ [0, 1]. On P(V ), consider theσ−algebra

G = σ {ϕA,A Borelian ⊆ [0, 1]} (5)

generated by the mappings ϕA, i.e. the smallest σ−algebraof P(V ) making any ϕA measurable. Notice that in the caseof a finite set, say V = {1, . . . , K},K ≥ 2, then P(V ) canbe identified as the simplex

TK ={y = (y1, . . . , yK), yj ≥ 0,

K∑j=1

yj = 1

}. (6)

Also, let SK denote the set

SK ={y = (y1, . . . , yK−1), yj ≥ 0,

K−1∑j=1

yj ≤ 1

}. (7)

From ref. 9, we have the following definition:

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 4: Unsupervised classification and analysis of objects described by nonparametric probability distributions

4 Statistical Analysis and Data Mining, Vol. (In press)

DEFINITION 1: An RD is a measurable mapping from(�,F) to (P(V ),G).

A consequence of this definition is that the distri-bution of an RD is a probability measure on P(V ).A standard example when V is a finite set is theDirichlet distribution. For an interesting recent surveyon Dirichlet distributions and Dirichlet processes seeref. 24.

3.2. Dirichlet Distributions D(α1, . . . , αK)

Let α = (α1, . . . , αK) be a vector of positive numbersand let

Xjind∼ γ (αj , 1), j = 1, . . . , K,

be K independent real random variables with gammadistributions. Then, clearly the random vector

X =(

X1∑Kj=1 Xj

, . . . ,XK∑Kj=1 Xj

)(8)

takes values in TK and, from (6), is a RD when V ={1, . . . , K},K ≥ 2.

DEFINITION 2: The distribution of the RD X definedby Eq. (8) is a probability measure on TK known as theDirichlet distribution D(α1, . . . , αK) in dimension K .

This distribution has many very interesting properties andcharacterizations. Notice that since TK is of Lebesgue mea-sure 0 in R

K , the Dirichlet distribution has no density butit can be seen that the random vector

(X1∑K

j=1 Xj

, . . . ,XK−1∑Kj=1 Xj

)

has the following density on SK , known as the Dirichletdensity

D(y|α) = �(α1 + . . . + αK)

�(α1) . . . �(αK)y

α1−11 . . . y

αK−1−1K−1

×(

1 −K−1∑j=1

yj

)αK−1

ISK(y) (9)

where ISKis the indicator function of the set SK .

The Dirichlet density completely determines the Dirichletdistribution as the last component XK∑K

j=1 Xjis equal to 1

minus the sum of the other components.

3.3. Dirichlet Processes D(α)

The generalization of Dirichlet distributions when themeasurable set V need not be finite is due to Ferguson[25]. By convention, a measurable partition B1, . . . , BK isa partition such that each Bi, i = 1, . . . , K, is a measurablesubset.

DEFINITION 3: Let α be a finite nonnegative measureon V . An RD X : � −→ P(V ) is a Dirichlet processwith parameter α, X ∼ D(α), if for every K = 2, 3, . . .

and every measurable partition B1, . . . , BK of V , the ran-dom vector (X(B1), . . . , X(BK)) has a K-dimensionalDirichlet distribution D(α(B1), . . . , α(BK)) with parame-ters α(B1), . . . , α(BK).

The distribution D(α) of a Dirichlet process X is there-fore a probability measure on P(V ). Ferguson [25] provedthe existence of such RDs by using Kolmogorov extensiontheorem [26]. In the finite case where V = {1, . . . , K},K ≥2, since a finite measure α on V is determined by the num-bers α1 = α({1}), . . . , αK = α({K}), it is seen that D(α)

coincides with D(α1, . . . , αK).Among the numerous properties of Dirichlet processes,

we recall the following from Korwar and HollanderTheorem 2.3 and Theorem 2.5 in ref. 17 (pages 708 and709, respectively). They will be used to prove our mainresult in Section 4.3.

THEOREM 1: For two distinct nonatomic nonnegativefinite measures α and β, two Dirichlet processes D(α) andD(β) have disjoint supports.

3.4. Finite Mixtures

A finite Dirichlet mixture is a finite convex combinationof Dirichlet processes

L∑r=1

prD(αr). (10)

The Dirichlet processes D(αr) appearing in Eq. (10) arecalled components of the mixture, the numbers pr whichsatisfy pr > 0 and

∑Lr=1 pr = 1 are the weights of the

components while the integer L ≥ 2 is called the numberof components of the mixture.

Let X be a r.v. having distribution (10), X ∼ ∑Lr=1 pr

D(αr), for a given L but with unknown mixture parameters(i.e., the weights pr and the αr of (10)). Given someobserved outcomes of X (which therefore are distributions)the mixture estimation problem consists of estimating themixture parameters.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 5: Unsupervised classification and analysis of objects described by nonparametric probability distributions

Emilion: Classifying Objects Described by Distributions 5

This estimation problem has a classification aspect byintroducing an unobserved/latent class variable (or mixingvariable), say C : � −→ {1, . . . , L}, and the followinghierarchical model

{P(X|C = r) = D(αr)

P(C = r) = pr(11)

so that the prior probability of class r is pr while Bayesformula shows that the posterior probability of class r giventhe observation X is equal to

trB = P(C = r|X ∈ B) = prD(αr)(B)∑Lk=1 pkD(αk)(B)

(12)

for any Borel subset B ∈ G.An early and very nice study of general mixtures of

Dirichlet processes when the class variable C need notbe discrete was done by Antoniak [27]. More recent andcomplex mixture models can be found in refs 12, 13,28–30. .

3.5. EM-Type Algorithms and Dirichlet Distributions

A mixture of finite-dimensional Dirichlet distributions isidentifiable and can be estimated by applying an estimationmethod of Dirichlet densities [31,32] combined with EM-type algorithms such as EM [4], SEM [5], SAEM [6] andMCEM [7] since such densities are seen to belong to theexponential family, i.e. they can be written as

D(y|θ) = C(θ)e(y) exp(< θ, b(y) >)

where the parameter θ is a vector, e and b are fixed butarbitrary functions and C(θ) is a normalizing factor.

4. CONSISTENT CLASSIFICATION

Our consistency study is based on an arbitrary sequenceof refined partitions which induces a filtration and amartingale.

4.1. Partitions, Filtration and Martingale

Let σl, l = 1, 2, . . . , be a sequence of finite partitions ofV into | σl | nonatomic intervals, respectively, say

σl = (σl,1, . . . , σl,|σl |).

Assume that σl+1 strictly refines σl , i.e. each interval of σl

is the disjoint union of intervals of σl+1, and also assumethat the union of all partitions σl generates the Borelian

σ -algebra of V . A standard example of such a sequence ofpartitions is obtained with the following dyadic intervals:

σl ={[

j − 1

2l,

j

2l

), j = 1, . . . , 2l − 1,

[2l − 1

2l, 1

]},

l = 1, 2, . . . ,

but we may deal with more general sequences than dyadicpartitions.

For any P ∈ P(V ), we will denote P(σl) the probabilityvector

P(σl) = (P (σl,j ), . . . , P (σl,|σl |)) ∈ T|σl |.

Consider the σ -algebra G of P(V ) defined in Eq. (5) andits sub-σ -algebras Gl generated by the mappings ϕσl,j

, j =1, . . . , | σl |, or equivalently by the mappings

ϕσl: P → P(σl)

from P(V ) to T|σl |, i.e.

Gl = σ {ϕσl,j, j = 1, . . . , | σl |} = σ(ϕσl

). (13)

Clearly, the hypotheses on the sequence of partitions implythat (Gl)l is a filtration of G, i.e.

Gl ⊆ Gl+1, σ (∪∞l=1Gl) = G. (14)

Consider now two probability measures T and U on(P(V ),G) and their respective restriction Tl and Ul to thesub-σ -algebra Gl . The following well-known result is easilyproved:

PROPOSITION 1: If Tl is absolutely continuous withrespect to (w.r.t.) Ul for any l = 1, 2, . . ., then thesequence of the Radon-Nykodim derivatives (

dTl

dUl)l is a

L1(P(V ),G, U)-bounded nonnegative martingale w.r.t. thefiltration (Gl)l .

Proof For any G ∈ Gl , we have by definition

Tl (G) =∫

IG

dTl

dUl

dUl =∫

IG

dTl

dUl

dU.

However, as G ∈ Gl+1, we also have

Tl+1(G) =∫

IG

dTl+1

dUl+1dU,

so that Tl+1(G) = Tl (G) implies

∫IG

dTl+1

dUl+1dU =

∫IG

dTl

dUl

dU

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 6: Unsupervised classification and analysis of objects described by nonparametric probability distributions

6 Statistical Analysis and Data Mining, Vol. (In press)

which yields the martingale property

EU(dTl+1

dUl+1|Gl) = dTl

dUl

where EU(.|Gl) denotes the conditional expectation on Gl

w.r.t. the probability measure U. �

4.2. Classification w.r.t. a Partition

Consider n objects, each object i being described by aprobability distribution di, i = 1, . . . , n, on V = [0, 1]. Weassume that these distributions are an observed sample froma RD X : � → P(V ) having for distribution a mixture ofDirichlet processes

X ∼L∑

r=1

prD(αr) (15)

where the parameters αr are distinct nonatomic nonnegativefinite measures on V = [0, 1]. This means that di =X(i)(ω), i = 1, . . . , n, for some ω ∈ �, where

X(i) i.i.d.∼ PX =L∑

r=1

prD(αr).

Let Supp(D(αr)) be the support of the component D(αr).The definition of Dirichlet processes and Eq. (15) thenimply that

di(σl) = (di(σl,j ))j=1,...,|σl |, i = 1, . . . , n,

is a sample from a finite-dimensional RD with distribution

L∑r=1

prD(αr(σl,1), . . . , αr(σl,|σl |)) (16)

which is a finite mixture of Dirichlet distributions indimension | σl | that can be estimated by using thealgorithms mentioned in Section 3.5.

This is the classification method used in the examplesmentioned in Section 1 and we refer to refs 19–22 for thedetails and the results obtained in each case. We brieflyreport that our method identified four classes of Internetflows, four classes of daily solar radiations, three classes ofwind sequences and four classes of pixels.

4.3. Consistency

The question we address now concerns the behav-ior of the classification method described in Section4.2, i.e. the estimation of the parameters in Eq. (16),

if we refine the partitions, dealing with σl+1 insteadof σl , and more generally letting l → +∞. The con-sistency of the classification depends on the followingresult.

THEOREM 2: Let X be a RD as in Eq. (15). Let pk,l

be some weights such that there exists a positive constantc > 0 with

c ≤ pk,l ≤ 1 f or any k, l = 1, 2, . . . . (17)

Then,

tr,l,P = pr,lD(P (σl)|αr(σl))∑Lk=1 pk,lD(P (σl)|αk(σl))

(18)

satisfies

liml→∞

tr,l,P = 1 f or PX − a.a. P ∈ Supp(D(αr)) (19)

liml→∞

tr,l,P = 0 f or PX − a.a. P /∈ Supp(D(αr)). (20)

Before giving the proof, let us comment on the result andshow how to apply it. First, Theorem 2 shows that

liml→+∞

tr,l,P = ISupp(D(αr ))(P ) f or PX − a.a. P

and that the classes we are looking for, up to PX-nullsets,are actually the disjoint supports of the mixture compo-nents, i.e. the support of the Dirichlet processes. Indeed,from a practical and statistical point of view, an EM-typealgorithm is applied to the vectors di(σl), i = 1, . . . , n, inorder to estimate the finite mixture (16) for a given l.Since this mixture is identifiable, the algorithmic proce-dure should provide some weights pr,l estimating the trueweights pr of Eqs. (15) and (16) and some parametersαr,l estimating the Dirichlet parameters αr(σl) of Eq. (16).Then, due to the convergence stated by Theorem 2, comput-ing tr,l,di

for l large enough with αr,l in place of αr(σl) willallow us to guess the class of di : this number will be closeto 1 if and only if di belongs to the class r whose Dirichletparameter is αr,l . In particular, the classes will not changewhen l is increasing. This is what we call consistency ofthe classification.

The condition 0 < c ≤ pr,l ≤ 1, which is required inTheorem 2, is used in some algorithms such as SEM [33]and SAEM [5] because they are reinitialized when anyestimated weight pr,l is too small, meaning that the cor-responding class has too few elements. Note that the αr

become more and more precise as they are evaluated onsmaller and smaller intervals since σl+1 strictly refines σl .Finally, it is worth mentioning that, in SEM and SAEM

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 7: Unsupervised classification and analysis of objects described by nonparametric probability distributions

Emilion: Classifying Objects Described by Distributions 7

algorithms, the pr,l are computed iteratively by simulatingn multinomial vectors

ei = (e1,i , . . . , eL,i), i = 1, . . . , n,

with probability vector (tr,l,di), respectively, each a multi-

nomial vector having all its components equal to 0 but oneequal to 1. Then, pk,r is computed as

pr,l = | {i : ei,r = 1} |n

,

so that we have

E(pr,l) = �ni=1tr,l,di

n

liml→+∞

E(pr,l) = | {i : di ∈ Supp(D(αr))} |n

and, due to the law of large numbers, this limit is close toPX(Supp(D(αr))) = pr for n large enough.

Proof of Theorem 2 Since the nonnegative finitemeasures αr are nonatomic and since the intervals of thepartition are nonatomic, the components of the vector αr(σl)

are strictly positive and all the terms appearing in Eq. (18)are (strictly) positive, so that we may write

1

tr,l,P= 1 +

L∑k=1,k �=r

pk,lD(P (σl)|αk(σl))

pr,lD(P (σl)|αr(σl)). (21)

Now observe that the restriction of the distribution D(α)

of a Dirichlet process to the sub-σ -algebra Gl is a Dirichletdistribution, more precisely:

D(α)|Gl= D(α(σl,1), . . . , α(σl,|σl |)). (22)

Indeed, by Eq. (13), for any G ∈ Gl , there is a measurableO ⊆ T|σl | such that G = ϕ−1

σl(O) and then

D(α)(G) = D(α)(ϕ−1σl

(O))

= D(α(σl,1), . . . , α(σl,|σl |))(O) (23)

because, if P ∼ D(α), then,

ϕσl(P ) = P((σl,1), . . . , α(σl,|σl |)) ∼ D(α(σl,1), . . . ,

α(σl,|σl |))

by definition of a Dirichlet process. This proves Eq. (22).Therefore, the restrictions of the distributions D(αk) andD(αr) to Gl are the Dirichlet distributions D(αk(σl)) andD(αr(σl)), respectively. As their parameters are strictly

positive, these Dirichlet distributions are equivalent andProposition 1 applied to T = D(αk) and U = D(αr), fork �= r , and the martingale theorem [18], then implies thatthe martingale

D(P (σl)|αk(σl))

D(P (σl)|αr(σl))

converges for D(αr)-a.a. P as l → +∞. Observe nowthat the support of the martingale function is includedin Supp(D(αk)) because it is included in Supp(D(αk)|Gl

)

for each l. Moreover, by definition of a Radon-Nykodimderivative, it is also seen that the martingale functionsupport is included in Supp(D(αr)|Gl

) for each l and thus inSupp(D(αr)). As D(αk) and D(αr), for r �= k, have disjointsupports due to Theorem 1 since αk �= αr , the limit of themartingale is necessarily 0 for D(αr)-a.a. P and for anyk �= r . Moreover as Eq. (17) implies

0 ≤ pk,l

pr,l

≤ 1

pr,l

≤ 1

c,

it is seen by (21) that

liml→+∞

1

tr,l,P= 1 for D(αr)-a.a. P ∈ Supp(D(αr)).

Thus,

liml→+∞

tr,l,P = 1 for D(αr)-a.a. P ∈ Supp(D(αr)).

This implies

liml→+∞

tr,l,P = 1 for PX – a.a. P ∈ Supp(D(αr)) (24)

because of Eq. (15) and because the disjointness of thesupports of the mixture components imply that a subsetof Supp(D(αr)) of D(αr)-measure 0 is necessarily of PX-measure 0. Further, as t1,l,P + . . . + tL,l,P = 1 and tj,l,P ≥0, the limit in Eq. (24) implies

liml→+∞

tj,l,P = 0 for PX –a.a. P ∈ Supp(D(αr))

for any j �= r.

In particular,

liml→+∞

tr,l,P = 0 for PX –a.a. P �∈ Supp(D(αr)),

P ∈ ∪Lk=1,k �=rSupp(D(αk))

since (15) clearly implies that Supp(PX) = ∪Lk=1Supp

(D(αk)). This last limit and the limit in Eq. (24) are thosegiven in Eqs (20) and (19) in Theorem 2. �

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 8: Unsupervised classification and analysis of objects described by nonparametric probability distributions

8 Statistical Analysis and Data Mining, Vol. (In press)

Algorithm 1 Procedure DirEstim(x, iter1)

Algorithm 2 Procedure SEMDir(x, iter1, L, iter2, p0, α0)

5. EXPERIMENTS WITH DATA

We now illustrate the preceding results by performingour method on a simulated dataset and on a real one. Ourcodes were written in R v.2.12.1 software language andtested on a standard 32-bit Mac OS X laptop.

5.1. Algorithm

First, we have implemented a procedure for estimating aDirichlet distribution, following an iterative method due toMinka [34]. Only input and output are presented here, fordetails see ref. 34.

Then, the estimation of a mixture of L Dirichlet distribu-tions has been done by using the SEM algorithm as brieflydescribed below:

5.2. Simulated Dataset

We considered a mixture of L = 4 Dirichlet pro-cesses with a mean Beta(3, 5), Beta(4, 7), Beta(5, 9) andBeta(6, 10) distribution, with a precision parameter c1 = 4,c2 = 5, c3 = 6, c4 = 7, and weight 200/1000, 400/1000,

250/1000 and 150/1000, respectively. Then, n = 1000 dis-tributions were drawn from this mixture and r = 2000points were drawn from each of these distributions. Thiscan be done using what is called, a CRP (Chinese restau-rant process) scheme by drawing random Xi, i = 1, . . . , r,

as follows [35]:

⎧⎨⎩

X1 ∼ Beta(a, b)

Xi |X1, . . . , Xi−1 ∼ 1c+i−1

∑i−1j=1 δXj

+ cc+i−1Beta(a, b),

i = 2, . . . , r.

(25)

This means that having simulated X1, . . . , Xi−1, then Xi isone of these i − 1 values with probability 1

c+i−1 or is a newdraw from Beta(a, b) with probability c

c+i−1 . The simula-tion (25) was repeated 200 times with a = 3, b = 5, c = 4,400 times with a = 4, b = 7, c = 5, 250 times with a =5, b = 9, c = 6 and 150 times with a = 6, b = 10, c = 7,yielding an n × r data table which represents the simulatedraw data.

From this table, a smaller n × K table was derived bycomputing for each row i = 1, . . . , n, a histogram of K

bins from the Xi, i = 1, . . . , r, and finally the procedureSEMDir algorithm of Section 5.1, was applied to thistable.

By choosing iter1 = 30 and iter2 = 100, the conver-gence of the weights estimated by the procedure towardthe weights of the mixture (15) was observed from K = 25bins. The response time of the classification procedure as afunction of K (Table 1) and when increasing the number n

of histograms for K = 25 (Table 2) are linear and seem toindicate the tractability of our method. We omit the graph-ical representations which are similar to the ones displayedin the next Section 5.3.

5.3. Real Dataset

We consider the example of daily solar radiation his-tograms described in Section 2.3. We performed the algo-rithm on 720 histograms (almost 2 years of data) with L =4 classes and from K = 10 bins to K = 30 bins. As shown

Table 1. Classifying 1000 histograms with four components.

Bins K = 25 K = 30 K = 35 K = 40 K = 45 K = 50 K = 55

Time (s) 13.3 13.8 14.9 15.6 16.4 17.3 18

Table 2. Classifying n histograms of 25 bins with fourcomponents.

n 400 800 1200 1600 2000 2400 2800 3200 3600

Time (s) 2.9 4.9 6.9 8.9 10.9 13 15.1 17.3 19.3

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 9: Unsupervised classification and analysis of objects described by nonparametric probability distributions

Emilion: Classifying Objects Described by Distributions 9

Table 3. Weight of classes w.r.t. the number of bins.

ClassK = 10

(%)K = 16

(%)K = 20

(%)K = 25

(%)K = 30

(%)

1 7.9 12.6 9.2 8.8 8.92 42.5 34.5 33.2 33.7 33.63 2.5 10.1 5.9 5.5 5.44 47.1 42.7 51.6 52.0 52.1

Table 4. Class transitions when passing from 10 bins to 16.

Class of the same day when K = 16

1 2 3 4

Class of a day whenK = 10

1 48.3% 3.4% 34.5% 13.8%2 0.0% 11.0% 2.6% 86.4%3 0.0% 0.0% 88.9% 11.1%4 18.6% 62.8% 8.7% 9.9%

Table 5. Class transitions when passing from 16 bins to 20.

Class of the same day when K = 20

1 2 3 4

Class of a day whenK = 16

1 55.0% 45.0% 0.0% 0.0%2 0.7% 68.3% 31.0% 0.0%3 24.3% 13.5% 54.1% 8.1%4 1.3% 2.6% 3.2% 93.0%

Table 6. Class transitions when passing from 20 bins to 25.

Class of the same day when K = 25

1 2 3 4

Class of a day whenK = 20

1 57.1% 42.9% 0.0% 0.0%2 0.0% 69.5% 0.0% 30.5%3 0.0% 32.0% 60.0% 8.0%4 0.0% 0.0% 0.0% 100.0%

in ref. 20, it is observed that each result of the algorithmshows that the four classes correspond to some specificweather characteristic for days, namely, very clear, clear,very cloudy or cloudy, denoted by 1, 2, 3 and 4, respec-tively. Significant changes were observed for K = 16 andK = 20 and then very slight changes for K ≥ 20, mainlyin the two largest classes 2 and 4. We have computed theweight of each class (Table 3), the percentage of days of aclass which are assigned to another class when increasingK (Tables 4–6) and we have collected the response timeof the estimation procedure (Table 7) according to K (Fig.1) and to the number of histograms (Fig. 2).As in the case of the simulated data in Section 5.2, the esti-mation times on subsamples are linear w.r.t. the numberof bins and w.r.t. the number of days (Figs 1 and 2), andseem to confirm the tractability of our method. However,

Fig. 1 Estimation time versus number of histogram bins

Table 7. Class transitions when passing from 25 bins to 30.

Class of the same day whenK = 30

1 2 3 4

Class of a day whenK = 25

1 67.1% 32.9% 0.0% 0.0%2 0.0% 89.5% 0.0% 11.5%3 0.0% 22.0% 70.0% 8.0%4 0.0% 0.0% 0.0% 100.0%

when the number of bins increases, the method requires alarger sample. For example, for 25 bins at least 400 daysare needed and for 30 bins at least 500 days. It first can beobserved from Table 3 that the ranking of the class sizesremains identical when K increases. Next, observe fromTable 4 that most of the days (86.4%) of class 2 are clas-sified as days of class 4 when passing from 10 bins to 16and conversely, 62.8% of the days of class 4 are classifiedin class 2, while a majority of the days of the small classes1 and 3 remain in the same class. This may be becauseof the fact that, in reality, some clear days may be con-sidered as cloudy and vice versa. In other words, classesoverlap. Finally, it can be observed that the estimators of thelargest class 4 can be considered as consistent from K = 20(93% of days in this class remain in this class) while thesecond largest class 2 reaches 80%. The other two smallclasses 1 and 3 are more inconsistent: even for K largerthan 25, more than 30% of the days of these classes moveto another class (Tables 6 and 7). This illustrates the condi-tion c < pr (condition (17)) required in Theorem 2 to proveconsistency. Anyway, our method, as do many algorithms,has some problems in identifying small classes.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 10: Unsupervised classification and analysis of objects described by nonparametric probability distributions

10 Statistical Analysis and Data Mining, Vol. (In press)

Fig. 2 Estimation time versus number of daily histograms

6. APPLICATIONS OF THE CLASSIFICATION

As a conclusion, we mention some applications whichcan be developed from the classification consistency resultand we also mention a possible generalization to the mul-tivariate case. For the applications, the idea consists ofreplacing each object by the class label of its correspondingdistribution or by the value of a likelihood function. Notethat the consistency of the classification plays a crucial rolein order to obtain consistent results when applying standardmethods to the class labels or to the values of a likelihoodfunction.

6.1. Class Prediction

The prediction problem of a sequence of objects, as in thecase of paths of stochastic processes, may be very complex.Describing each object by a distribution, classifying thedistributions into a finite number of classes and replacingeach object by the class label of its correspondingdistribution then provide a sequence of class labelsbelonging to a finite set. This sequence of classes can thenbe predicted using some standard prediction models suchas hidden Markov chains or categorial time-series models.

6.2. Frequent Patterns

It can be of interest to find out significant frequentpatterns appearing in the sequence of class labels of Section6.1, as is done in the domain of sequential data mining.

Table 8. Mixture estimation time in seconds.

����BinsDays

K = 10 K = 16 K = 20 K = 25 K = 30

300 5.40 5.90 7.13 — —400 6.78 7.61 8.53 9.58 —500 8.22 8.99 10.19 10.95 11.96600 9.45 10.65 11.90 12.70 13.74720 10.98 12.21 13.90 15.17 15.84

6.3. Machine Learning Methods

Suppose that each of the n objects is described not onlyby one distribution but also by a vector (of fixed size q

say,) of distributions. We then deal with an n × q table ofdistributions so that we can classify each column of dis-tributions and replace the table by an n × q table of classlabels. Standard machine learning methods, such as deci-sion trees, random forests, support vector machine, can thenbe applied to this simplified table for prediction, regressionor discrimination.

Further, it can be thought that instead of using just theclass label of an object, we can use the likelihood functionof the distribution describing the object w.r.t. the Dirichletdistribution of its class.

6.4. Multivariate Case

We finally think that our method can be extended tothe multivariate case. Indeed, we have dealt with a setof objects described by just one variable (for example,Internet throughput or solar radiation). If the objects aredescribed by several variables that have to be takeninto account together (for example, days described bysolar radiation, temperature, wind speed, humidity), wesuggest that multihistograms w.r.t. all the variables bebuilt, i.e. multi-bins associated with a vector of per-centages. Then, a finite mixture of Dirichlet distributions(the number of parameters of each Dirichlet distribu-tion being equal to the number of multi-bins), obviouslystill remains a nice appropriate model, although somesparse data problems generally appear in high dimensionalproblems. The consistency theorem should extend to thiscase.

ACKNOWLEDGMENTS

The author gratefully acknowledges the two refereesfor their constructive suggestions that led to significantimprovements of the paper and also T. Soubdhan formaking available his solar radiation dataset.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 11: Unsupervised classification and analysis of objects described by nonparametric probability distributions

Emilion: Classifying Objects Described by Distributions 11

REFERENCES

[1] E. Diday, The symbolic approach in clustering and relatedmethods of data analysis, In Proceedings of Classificationand Related Methods of Data Analysis, IFCS, H. Bock, ed.North-Holland, 1987.

[2] L. Billard and E. Diday, Symbolic Data Analysis, SanFrancisco, John Wiley & Sons, 2007.

[3] R. E. Quandt and J. B. Ramsey, Estimating mixtures ofnormal distributions and switching regression, J Am StatAssoc 73 (1978), 730–738.

[4] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximumlikelihood from incomplete data via the EM algorithm, J RStat Soc [Ser B] 39 (1977), 1–38.

[5] G. Celeux and J. Diebolt, A stochastic approximationtype EM algorithm for the mixture problem, StochasticsStochastics Rep 41 (1992), 119–134.

[6] B. Delyon, M. Lavielle, and E. Moulines, Convergence of astochastic approximation version of the EM algorithm, AnnStat 27 (1999), 94–128.

[7] G. Wei and M. A. Tanner, A Monte Carlo implementationof the EM algorithm and the poor’s man data augmentationalgorithms, J Am Stat Assoc 85(411) (1990), 699–704.

[8] E. Diday and M. Vrac, Mixture decomposition of distribu-tions by copulas in the symbolic data analysis framework,Discrete Appl Math J 127 (2005), 271–284.

[9] J. F. Kingman, Random discrete distributions, J R Stat Soc[Ser B] 37 (1975), 1–22.

[10] L. J. Brunner and A. Lo, Bayesian Classification, 1999.Available: http://www.utstat.utoronto.ca/ brunner/papers/BayesClass.pdf [Accessed date November 4, 2011].

[11] L. J. Brunner, A. T. Chan, and A. Lo, Weighted Chi-nese Restaurant Processes and Bayesian Mixture Mod-els, HKUST ISMT Department, Technical Report, Rev.1.1,1996 & 1998.

[12] H. Ishwaran and M. Zarepour, Markov Chain Monte Carloin approximate Dirichlet and beta two-parameter processhierarchical models, Biometrika 87(2) (2000), 371–390, .

[13] H. Ishwaran and L. James, Approximate Dirichlet processcomputing in finite normal mixtures: smoothing and priorinformation, J Comput Graph Stat 11(3) (2002), 1–26, .

[14] S. Ray and B. Mallick, Functional clustering by Bayesianwavelet methods, J R Stat Soc [Ser B] 68 (2006), 305–332.

[15] K. P. Lennox, D. B. Dahl, M. Vannucci, R. Day, and J.W. Tsai, A Dirichlet process mixture of hidden Markovmodels for protein structure prediction, Ann Appl Stat 4(2010), 916–942.

[16] E. Jackson, M. Davy, and W. Fitzgerald, UnsupervisedClassifications of Functions with Dirichlet Process Mixturesof Gaussian Processes. CUED/F-ING Technical Report 562,2006.

[17] R. Korwar and M. Hollander, Contributions to the theoryof Dirichlet processes, Ann Stat 1 (1973), 706–711.

[18] J. L. Doob, Stochastic Processes, New York, John Wiley &Sons, 1953.

[19] A. Soule, K. Salamatian, N. Taft, R. Emilion, andK. Papagiannaki, Flow classification by histograms, InProceedings of Sigmetrics, New York, 2004. Avail-able: http://www.univ-orleans.fr/mapmo/membres/emilion/publ/Sigm.pdf [Accessed date March 10, 2012].

[20] T. Soubdhan, R. Emilion, and R. Calif, Classificationof daily solar radiation distributions using a mixture ofDirichlet distribution, Solar Energy 83 (2009), 1056–1063.

[21] R. Calif, R. Emilion, T. Soubdhan, and R. Blonbou,Classification of wind speed distributions using a mixtureof Dirichlet mixtures, Renewable Energy 36 (2011),3091–3097.

[22] R. Emilion and D. Pasquignon, Random Distributions inImage Analysis, 2007. Available: http://www.univ-orleans.fr/mapmo/membres/emilion/publ/rdi.pdf [Accessed dateMarch 10, 2012].

[23] J. P. Aboa and R. Emilion, Decision tree for probabilisticdata, In Proceedings of Dawak, 2000, Lecture Notes inComputer Science No. 1874, 393–398.

[24] B. A. Frigyik, A. Kapila, and M. R. Gupta, Introduction tothe Dirichlet Distribution and Related Processes, Universityof Washington, Department of Electrical Engineering,UWEE Technical Report -2010-0006, 2010.

[25] T. S. Ferguson, A Bayesian analysis of some nonparametricproblems, Ann Stat 1 (1973), 209–230.

[26] A. N. Kolmogorov, Foundations of the theory of probability(2nd ed.). New York, Chelsea Publishing Company, 1956.

[27] C. E. Antoniak, Mixtures of Dirichlet processes, Ann Stat2 (1974), 1152–1174.

[28] F. Caron, M. Davy, A. Doucet, E. Duflos, and P. Vanheeghe,Bayesian inference for dynamic models with Dirichletprocess mixtures, International Conference on InformationFusion, Florence, Italia, IEEE, 2006.

[29] D. B. Dunson and A. H. Herring, Semiparametric Bayesianlatent trajectory models, In Proceedings ISDS DiscussionPaper 16, Duke University, Durham, NC, USA, 2006.

[30] A. E. Gelfand, A. Kottas, and S. N. MacEachern, Bayesiannonparametric spatial modeling with Dirichlet processmixing, J Am Stat Assoc 100 (2005), 1021–1035.

[31] G. Ronning, Maximum likelihood estimation of Dirichletdistributions, J Stat Comput Simul 32 (1989), 215–221.

[32] A. Narayanan, Algorithm AS 266: maximum likelihoodestimation of the parameters of the Dirichlet distribution,Appl Stat 40 (1991), 365–374.

[33] G. Celeux and J. Diebolt, The SEM algorithm: a probabilis-tic teacher algorithm derived from the EM algorithm for themixture problem, Comput Stat Quat 2(1) (1985), 73–82.

[34] T. Minka, Estimating a Dirichlet distribution, MicrosoftResearch, Cambridge, Technical Report, 2003.

[35] D. Blackwell and J. B. MacQueen, Ferguson distributionsvia Polya urn schemes, Ann Stat 1 (1973), 353–355.1

Statistical Analysis and Data Mining DOI:10.1002/sam