8
Robust Estimation of Gaussian Mixtures from Noisy Input Data Shaobo Hou Aphrodite Galata School of Computer Science The University of Manchester Abstract We propose a variational bayes approach to the problem of robust estimation of gaussian mixtures from noisy input data. The proposed algorithm explicitly takes into account the uncertainty associated with each data point, makes no assumptions about the structure of the covariance matri- ces and is able to automatically determine the number of the gaussian mixture components. Through the use of both synthetic and real world data examples, we show that by in- corporating uncertainty information into the clustering al- gorithm, we get better results at recovering the true distri- bution of the training data compared to other variational bayesian clustering algorithms. 1. Introduction Standard EM-based clustering algorithms [7] assume that input data points are all equally important and the ef- fect of noise or measurement errors is often ignored and not explicitly modeled during model estimation. However, this is not always a valid assumption since the input data can be corrupted by measurement errors. Uncertainties can also be introduced by additional transformations on the data such as dimensionality reduction. In particular, in the case of non- linear transformations, it is no longer safe to assume that the uncertainties are uniform across the data. If the level of un- certainties can be quantified, it makes sense to incorporate them into the clustering algorithm to improve the estimation of the true data distribution. In this paper we propose a novel algorithm for learning a mixture of Gaussians that takes into account the uncer- tainties of the input data. In our formulation we assume the uncertainty on a data point can be modelled by a multivari- ate Gaussian distribution and is independent from the other data points. Intuitively, this allows a data point with large uncertainty to exert less influence on the mixture compo- nents, than a data point with smaller uncertainty. The opti- mal mixture model that represents the data with uncertain- ties is found using a variational bayesian algorithm that au- tomatically chooses the appropriate number of components in the mixture model. We show that by taking into account the uncertainty of information, our algorithm performs bet- ter at estimating the correct number of clusters and recov- ering the true distribution of the training data compared to other variational bayesian clustering algorithms [6, 3]. The proposed algorithm is evaluated on a number of synthetic and real data sets and is shown to improve the results of various pattern recognition tasks such as motion segmenta- tion and partitioning of microarray gene expressions. 2. Related Work Previously, researchers have looked into the problem of incorporating uncertainty information into the field of model fitting and have developed algorithms such as To- tal Least Square [1] which assumes the noise on all data points are drawn from the same uncertainty distribution, and Fundamental Numerical Scheme [5] which allows different data points to be associated with different uncertainty dis- tributions. It has also been studied in the context of support vector machine classification [2]. Various researchers have also investigated the problem of unsupervised clustering of data with uncertainty. Chaudhuri and Bhowmik [4] proposed a modified K-means algorithm which assumes uniform uncertainties such that the true po- sition of a data point can be anywhere within a hypersphere centred on its observed position. Kumar and Patel [11] also proposed generalisations of K-means and hierarchical clus- tering to handle zero-mean Gaussian measurement uncer- tainty. However their formulation is simply based on intu- ition and is not probabilistically well principled. Handman and Govaert [8] proposed an EM [7] clustering algorithm which modelled non-identically distributed uncertainty as rectangular error zones. More recently in Bioinformatics, in order to group genes with similar expression patterns from microarray experi- ments, Liu et al. [12] proposed a probabilistic clustering algorithm for estimating the maximum likelihood mixture of Gaussians with spherical covariance matrices from data with zero-mean Gaussian measurement errors represented 978-1-4244-2243-2/08/$25.00 ©2008 IEEE

[IEEE 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) - Anchorage, AK, USA (2008.06.23-2008.06.28)] 2008 IEEE Conference on Computer Vision and Pattern Recognition

Embed Size (px)

Citation preview

Page 1: [IEEE 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) - Anchorage, AK, USA (2008.06.23-2008.06.28)] 2008 IEEE Conference on Computer Vision and Pattern Recognition

Robust Estimation of Gaussian Mixtures from Noisy Input Data

Shaobo Hou Aphrodite Galata

School of Computer ScienceThe University of Manchester

Abstract

We propose a variational bayes approach to the problemof robust estimation of gaussian mixtures from noisy inputdata. The proposed algorithm explicitly takes into accountthe uncertainty associated with each data point, makes noassumptions about the structure of the covariance matri-ces and is able to automatically determine the number ofthe gaussian mixture components. Through the use of bothsynthetic and real world data examples, we show that by in-corporating uncertainty information into the clustering al-gorithm, we get better results at recovering the true distri-bution of the training data compared to other variationalbayesian clustering algorithms.

1. IntroductionStandard EM-based clustering algorithms [7] assume

that input data points are all equally important and the ef-fect of noise or measurement errors is often ignored and notexplicitly modeled during model estimation. However, thisis not always a valid assumption since the input data can becorrupted by measurement errors. Uncertainties can also beintroduced by additional transformations on the data such asdimensionality reduction. In particular, in the case of non-linear transformations, it is no longer safe to assume that theuncertainties are uniform across the data. If the level of un-certainties can be quantified, it makes sense to incorporatethem into the clustering algorithm to improve the estimationof the true data distribution.

In this paper we propose a novel algorithm for learninga mixture of Gaussians that takes into account the uncer-tainties of the input data. In our formulation we assume theuncertainty on a data point can be modelled by a multivari-ate Gaussian distribution and is independent from the otherdata points. Intuitively, this allows a data point with largeuncertainty to exert less influence on the mixture compo-nents, than a data point with smaller uncertainty. The opti-mal mixture model that represents the data with uncertain-ties is found using a variational bayesian algorithm that au-

tomatically chooses the appropriate number of componentsin the mixture model. We show that by taking into accountthe uncertainty of information, our algorithm performs bet-ter at estimating the correct number of clusters and recov-ering the true distribution of the training data compared toother variational bayesian clustering algorithms [6, 3]. Theproposed algorithm is evaluated on a number of syntheticand real data sets and is shown to improve the results ofvarious pattern recognition tasks such as motion segmenta-tion and partitioning of microarray gene expressions.

2. Related WorkPreviously, researchers have looked into the problem

of incorporating uncertainty information into the field ofmodel fitting and have developed algorithms such as To-tal Least Square [1] which assumes the noise on all datapoints are drawn from the same uncertainty distribution, andFundamental Numerical Scheme [5] which allows differentdata points to be associated with different uncertainty dis-tributions. It has also been studied in the context of supportvector machine classification [2].

Various researchers have also investigated the problem ofunsupervised clustering of data with uncertainty. Chaudhuriand Bhowmik [4] proposed a modified K-means algorithmwhich assumes uniform uncertainties such that the true po-sition of a data point can be anywhere within a hyperspherecentred on its observed position. Kumar and Patel [11] alsoproposed generalisations of K-means and hierarchical clus-tering to handle zero-mean Gaussian measurement uncer-tainty. However their formulation is simply based on intu-ition and is not probabilistically well principled. Handmanand Govaert [8] proposed an EM [7] clustering algorithmwhich modelled non-identically distributed uncertainty asrectangular error zones.

More recently in Bioinformatics, in order to group geneswith similar expression patterns from microarray experi-ments, Liu et al. [12] proposed a probabilistic clusteringalgorithm for estimating the maximum likelihood mixtureof Gaussians with spherical covariance matrices from datawith zero-mean Gaussian measurement errors represented

978-1-4244-2243-2/08/$25.00 ©2008 IEEE

Page 2: [IEEE 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) - Anchorage, AK, USA (2008.06.23-2008.06.28)] 2008 IEEE Conference on Computer Vision and Pattern Recognition

by diagonal covariance matrices. In their method, the pa-rameters of the mixture components are updated using gra-dient descent based optimisation, and Bayesian InformationCriterion (BIC) [14] is used to determine the appropriatenumber of components. Our method is more general anddoes not make any assumption about the structure of the co-variance matrices of the mixture components and the mea-surement uncertainties. The variational bayesian model se-lection used by our method is well principled and handlesdatasets of various sizes whereas BIC is an asymptotic re-sult which may not be able to handle small datasets well.

Sun et al.[16] proposed a generalised Expectation Max-imisation (GEM) algorithm for estimating the maximumlikelihood mixture of Student t-distributions from astro-physical datasets with measurement errors and thus improv-ing the detection of peculiar quasars as statistical outliers.Our algorithm is similar to theirs in that we both solvethe intractable integration in the resulting formulation usingvariational approximation. However, while their algorithmreturns the maximum likelihood solution for a fixed num-ber of components, our algorithm finds the optimal num-ber of components during a single clustering run. We alsoplace priors on the parameters of the mixture components toprevent singularity and components with degenerate shapesfrom forming.

As the level of uncertainties on the input data increases,maximum likelihood algorithms such as [12] [16] tend to re-turn mixtures with some very narrow components which arecollapsing onto a point or a hyperplane. The t-distributionis somewhat more robust to this effect than the Gaussiandistribution, it can still be affected by it. Although this isnot an issue if the estimated mixture model is only going tobe used in some discriminative tasks, it does however, posea serious problem if it is meant for generative tasks whichrequires sampling and synthesising from the learnt distri-bution. This is not a problem in our clustering algorithmbecause we place priors on the parameters of the mixturecomponents which constrains the shape of the componentsto prevent them from collapsing.

3. Learning mixture of Gaussians in the pres-ence of errors

Our aim is to estimate a mixture of K Gaussian com-ponents which best represents the distribution of a set of Ndata points X = {x1, . . . ,xN} that have been observed in aD dimensional space, where the optimal number of compo-nent K is unknown. It is assumed that each data point xn isa noisy measurement of its true position and is drawn froma Gaussian distribution N (xn|tn,Cn). The mean tn is theunknown true position of the data point and the covariancematrix Cn is the known uncertainty on the measurement.

Under a latent variable model, we associate each

Figure 1. Mixture of Gaussians with Uncertainty

data point xn with a binary latent variable zn ={zn1, . . . , znk}K

k=1, in which only one element is set to oneto indicate that a data point tn was generated from that com-ponent and all other elements of the latent variable are equalto zero. Given the latent variables Z = {z1, . . . , zN} andT = {t1, . . . , tN}, the conditional distribution of X can bewritten as:

p(X|C,T) =N∏

n=1

N (xn|tn,Cn) (1)

And the conditional distributions of T and Z are given by:

p(T|Z,µµµ,ΛΛΛ) =N∏

n=1

K∏k=1

N (tn|µµµk,ΛΛΛ−1k )znk (2)

p(Z|πππ) =N∏

n=1

K∏k=1

πznk

k (3)

where µµµk and ΛΛΛk are the mean vector and precision matrixof the kth Gaussian component and πππ = {π1, . . . , πK} arethe mixing coefficients for the components.

We place a Normal-Wishart prior on the mean and preci-sion of the Gaussians components as shown by Bishop [3]:

p(µµµ,ΛΛΛ) =K∏

k=1

N (µµµk|m0, (β0ΛΛΛk)−1)W(ΛΛΛk|W0, ν0) (4)

where m0 is usually set to zero and β0 is set to a very smallvalue. W(ΛΛΛ|W, ν) is a Wishart distribution with scale ma-trix W and ν degrees of freedom.

The model described can be represented as a directedgraph in figure 3. Assuming uniform prior on C, the jointdistribution over all variables conditioned on πππ is:

p(X,C,T,Z,µµµ,ΛΛΛ|πππ) = p(X|C,T)p(T|Z,µµµ,ΛΛΛ)×p(Z|πππ)p(µµµ|ΛΛΛ)p(ΛΛΛ) (5)

3.1. Variational Inference

Variational inference is a framework for computing ana-lytical approximation to naturally intractable posterior dis-tribution, by restricting the range of functions used for ap-proximation. Given that X denotes the set of observable

Page 3: [IEEE 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) - Anchorage, AK, USA (2008.06.23-2008.06.28)] 2008 IEEE Conference on Computer Vision and Pattern Recognition

variables, H denotes the set of latent variables and param-eters and q(H) is a variational distribution over H, the logmarginal probability of X is:

ln p(X) = ln∫p(X,H)dH

= ln∫q(H)

p(X,H)q(H)

dH

≥∫q(H) ln

p(X,H)q(H)

dH

L(q) =∫q(H) ln

p(X,H)q(H)

dH (6)

whereL(q) is a strict lower bound on ln p(X) and is derivedusing Jensen’s inequality. The difference between ln p(X)and L(q) is the Kullback-Leibler divergence between q andp, which is always positive.

Assuming that the q distribution factorises with respectto the disjoint groups of variables in H, such that q(H) =∏M

i=1 qi(Hi), then the optimal solution for the jth fac-torised distribution is given by the expectation of the logof the joint distribution over all the other q distributions [3]:

ln q∗j (Hj) = Ei6=j [ln p(X,H)] + const. (7)

3.2. Variational Bayes mixture of Gaussians withUncertainty

In our case, the set of latent variables is {T,Z,µµµ,ΛΛΛ} andits variational distribution can be factorised as:

q(H) = q(T|Z)q(Z)q(µµµ,ΛΛΛ) (8)

which can be further factorised:

q(µµµ,ΛΛΛ) =K∏

k=1

q(µµµk|ΛΛΛk)q(ΛΛΛk) (9)

q(T|Z) =N∏

n=1

K∏k=1

q(tn|znk = 1)znk (10)

The optimal solution for q∗(tn|znk = 1) is the posteriordistribution of the true position tn with uncertainty Cn andconditioned on the kth mixture component:

q∗(tn|znk = 1) = N (tn|k|rn|k,En|k) (11)

where the mean and the covariance are computed as:

rn|k = En|k(C−1n xn + νkWkmk) (12)

En|k = (C−1n + νkWk)−1 (13)

The optimal solution for q∗(Z) is computed as:

q∗(Z) =N∏

n=1

K∏k=1

rznk

nk (14)

rnk =ρnk∑K

j=1 ρnj

= E[znk] (15)

where rnk is the responsibility of the kth component for thenth data point and:

ln ρnk = lnπk +12

E[ln |ΛΛΛk|]−D

2ln(2π)

− 12{Dβ−1

k + νk(rn|k −mk)T Wk(rn|k −mk)

+ νkTr(En|kWk)} −D2

ln(2π)− 12

ln |Cn|

− 12{(rn|k − tn)T C−1

n (rn|k − tn) + Tr(En|kC−1n )}

+D

2+D

2ln(2π) +

12

ln |En|k|(16)

and ψ(.) is the digamma function and E[ln |ΛΛΛk|] =∑Di=1 ψ

(νk+1−i

2

)+D ln 2 + ln |Wk|.

The optimal solution for q∗(µµµk|ΛΛΛk) is similar to thederivation in [3] and is computed as:

q∗(µµµk|ΛΛΛk) = N (µµµk|mk, (βkΛΛΛk)−1) (17)

where:

βk = β0 +Nk (18)

mk =1βk

(β0m0 +Nkrk) (19)

rk =1Nk

N∑n=1

E[znk]rn|k (20)

Nk =N∑

n=1

E[znk] (21)

and the optimal solution for q∗(ΛΛΛk) is computed as:

q∗(ΛΛΛk) = W(ΛΛΛk|Wk, νk) (22)

where:

W−1k = W−1

0 +NkSk

+β0Nk

β0 +Nk(rk −m0)(rk −m0)T (23)

νk = ν0 +Nk (24)

Sk =1Nk

N∑n=1

E[znk](rn|k − rk)(rn|k − rk)T

+1Nk

N∑n=1

E[znk]En|k (25)

Finally, since no priors are placed on the mixing coeffi-cients πππ, it is not considered as latent variable and is insteadoptimised as parameters:

πk =1N

N∑n=1

rnk (26)

Page 4: [IEEE 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) - Anchorage, AK, USA (2008.06.23-2008.06.28)] 2008 IEEE Conference on Computer Vision and Pattern Recognition

Following the approach suggested by Corduneanu andBishop [6] for choosing the correct number of mixturecomponents, we initialise the mixture with a large numberof components and remove any whose mixing coefficientshave been driven to zero during the optimisation.

3.2.1 Variational Lower Bound

The optimal mixture model is found by repeatedly updat-ing the variational distributions of T, Z, µµµ and ΛΛΛ and thenoptimising the mixture coefficients πππ. The convergence ofthe iterative update is monitored by computing the varia-tional lower bound L(q), which cannot decrease after eachupdate. For our mixture model, the lower bound defined byequation (6) is computed as:

L(q) = E[ln p(X,C,T,Z,µµµ,ΛΛΛ|πππ)]− E[ln q(T,Z,µµµ,ΛΛΛ)]= E[ln p(X|C,T)] + E[ln p(T|Z,µµµ,ΛΛΛ)]

+E[ln p(Z|πππ)] + E[ln p(µµµ,ΛΛΛ)]− E[ln q(T|Z)]−E[ln q(Z)]− E[ln q(µµµ,ΛΛΛ)]

where:

E[ln p(X|C,T)] =

− 12

N∑n=1

K∑k=1

E[znk]{(rn|k − xn)T C−1

n (rn|k − xn)}

− 12

N∑n=1

K∑k=1

E[znk]{Tr(En|kC−1

n ) + ln |Cn|+ D ln(2π)}

(27)

E[ln p(T|Z,µµµ,ΛΛΛ)] =12

K∑k=1

Nk{E[ln |ΛΛΛk|]−Dβ−1k

−νkTr(SkWk)− νkTr(EkWk)

−νk(rk −mk)T Wk(rk −mk)−D ln(2π)}

(28)

E[ln p(Z|πππ)] =N∑

n=1

K∑k=1

rnk lnπk (29)

E[ln p(µµµ,ΛΛΛ)] =12

K∑k=1

{D ln(β0

2π) + E ln |ΛΛΛk| −

Dβ0

βk

−β0νk(mk −m0)T Wk(mk −m0)}+K{−ν02

ln |W0|

−ν0D2

ln 2− D(D − 1)4

lnπ −D∑

i=1

ln Γ(ν0 − 1 + i

2)}

+ν0 −D − 1

2

K∑k=1

E ln |ΛΛΛk| −12

K∑k=1

νkTr(W−10 Wk)

(30)

E[ln q(T|Z)] = −12

N∑n=1

K∑k=1

E[znk]{D +D ln(2π)

+ ln |En|k|}(31)

E[ln q(Z)] =N∑

n=1

K∑k=1

rnk ln rnk (32)

E[ln q(µµµ,ΛΛΛ)] =12

K∑k=1

{E[ln |ΛΛΛk|] +D ln(βk

2π)−D

−νk

2ln |Wk| −

νkD

2ln 2− D(D − 1)

4lnπ

−D∑

i=1

ln Γ(νk − 1 + i

2) +

νk −D − 12

E[ln |ΛΛΛk|]−νkD

2}

(33)

4. ResultsWe have tested our clustering algorithm on both syn-

thetic and real data sets to demonstrate the benefit gainedby taking into account of the uncertainty information.

4.1. Synthetic Examples

Two synthetic datasets proposed by Corduneanuand Bishop [6] were used to test the performanceof our clustering algorithm. The first data set con-tains 900 data points sampled from a mixture of threeGaussian components with means [0,−2], [0, 0] and[0, 2] and the same covariance [2, 0; 0, 0.2]. The sec-ond data set contains 600 data points sampled froma mixture of five Gaussian components with means[0, 0], [3,−3], [3, 3], [−3, 3] and [−3,−3] and covariances[1, 0; 0, 1], [1, 0.5; 0.5, 1], [1,−0.5;−0.5, 1], [1, 0.5; 0.5, 1]and [1,−0.5;−0.5, 1].

For each data point tn, we generate the covariancematrix of uncertainty about the data point as Cn =[|xn|/10, 0; 0, |yn|/5] ∗ λ and sample a noisy data point xn

from N (tn,Cn). λ is an experiment parameter for con-trolling the overall scale of noise. The noisy data pointsare then clustered using the variational bayesian clusteringalgorithm from [6] and the results compared with our clus-tering algorithm which also takes into account the uncer-tainty information. The noisy input data is shown in figure2(a) and 2(d) and the corresponding clustering results areshown in figure 2(b) and 2(e). The clustering algorithmswere initialised with a mixture of 20 components and whileboth were able to find the correct number of mixture compo-nents, the mixtures estimated by our algorithm were closerto the ground truth, as can be seen from visual inspection ofthe clusters and from the symmetric KL divergence scorescomputed with the ground truth mixtures. Figure 2(c) and

Page 5: [IEEE 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) - Anchorage, AK, USA (2008.06.23-2008.06.28)] 2008 IEEE Conference on Computer Vision and Pattern Recognition

−6 −4 −2 0 2 4 6 8−6

−4

−2

0

2

4

6

(a) Noisy input data with uncertainty

−4 −2 0 2 4 6

−4

−3

−2

−1

0

1

2

3

4

5

(b) Without uncertainty, sKL = 0.5029. With un-certainty, sKL = 0.2531. Noise Scale = 1.5

0 0.5 1 1.5 2 2.5 3 3.50

0.2

0.4

0.6

0.8

1

1.2

1.4

Noise Scale Factor

Sym

metr

ic K

L D

iverg

ence

(c) Symmetric KL Divergence

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

10

(d) Noisy input data with uncertainty

−8 −6 −4 −2 0 2 4 6 8

−6

−4

−2

0

2

4

6

8

(e) Without uncertainty, sKL = 0.4413. With un-certainty, sKL = 0.1665. Noise Scale = 1.5

0 0.5 1 1.5 2 2.5 3 3.50

0.2

0.4

0.6

0.8

1

1.2

1.4

Noise Scale Factor

Sym

metr

ic K

L D

iverg

ence

(f) Symmetric KL Divergence

Figure 2. In the left column, red ellipses represent the uncertainty on the point it is centred on; for clarity, only the uncertainty in one ofevery ten data points is shown. In the middle column, the black ellipses represent the ground truth, the red ellipses represent clusteringwithout uncertainty and the green ellipses represent clustering with uncertainty. In the right column, the red plot represents clusteringwithout uncertainty and the green plot represents clustering with uncertainty.

2(f) show our clustering algorithm gives consistently lowersymmetric KL score as λ varies from 0 to 3. Even in thesimplifying case of negligible uncertainty and the extremecase of very large uncertainty, the resulting mixture is atleast as good as the ones obtained without taking uncer-tainty into account.

4.2. Motion Segmentation

We applied our clustering algorithm to the problem ofmulti-body motion segmentation, which is an importantpreprocessing step for many computer vision applications.We tracked a number of feature points through a video se-quence using the Kanade-Lucas-Tomasi (KLT) tracker [15].The uncertainty of each tracked feature point was evalu-ated using the method described by Nicklels and Hutchin-son [13] for estimating uncertainty in SSD based featuretracking, for which KLT tracking is an example.

We perform motion segmentation for the tth frame byclustering the velocities of tracked features at frame t andt − 1, therefore xn = [[f t

n − f t−1n ], [f t−1

n − f t−2n ]], where

f tn denotes the position of the nth tracked feature at thetth frame. The uncertainty on f t

n is denoted by the ma-trix Ct

n and it follows that the uncertainty on xn is Cn =[Ct

n,0;0,Ct−1n ].

Figure 3 shows the motion segmentation result on a wellknown sequence of two persons walking past each other.For this sequence, the features tracked on the cloth of theperson on the left is often not well localised along the Yaxis due to its stripey nature and similarly some featurestracked on the cloth of the person on the right is not welllocalised due to its lack of texture. The figure shows thatour algorithm was able to appropriately take into accountthe uncertainty on feature positions and find a mixture withthree components to represent the motion vectors, corre-sponding to the two moving persons and the background.Occasionally, a fourth component will be found which usu-ally corresponds to some outliers such as features that havelatched onto the wrong image patches because the originalfeatures have disappeared due to occlusion. In comparison,if the uncertainty information were not taken into account,

Page 6: [IEEE 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) - Anchorage, AK, USA (2008.06.23-2008.06.28)] 2008 IEEE Conference on Computer Vision and Pattern Recognition

(a) Original video frames

(b) Visualisation of the uncertainty on the tracked features, the green ellipses represent features tracked in the current frame and thered ellipses represent the features tracked in the previous frame

(c) Segmentation results obtained using standard variational bayes clustering algorithm [6] without taking into account the uncertaintiesof the tracked features.

(d) Segmentation result obtained when uncertainties have been taken into account

(e) Segmentation results obtained using Kanatani and Sugaya’s method in [10]

Figure 3. Comparison of motions segmentation results on a well known sequence of two persons walking past each other. Frame 5, 7, 9,11 and 13 are shown here. The coloured circles only serves to distinguish between different clusters in the same frame and although thevisualisation try to make the assignment of colours to clusters consistent from frame to frame, this is not guaranteed. The figure is bestviewed in colour and please refer to the supplemental material for the full size version and videos

then a lot of features that were not well localised will beincorrectly segmented.

We also compared our method to the multi-stage opti-misation based segmentation method by Kanatani and Sug-aya [10] which is known to give very accurate results. Al-though their method is designed to operate on the trajectoryof tracked features over the entire sequence, this is not pos-sible for the sequence in figure 3 as one person is occludedby the other person for part of the sequence. So in our ex-periment, we test their method at each frame on featuresthat have been successfully tracked for the last four frames,

therefore the input data consists of feature positions fromframe t− 4 to frame t. We also provided their method withthe information that there are three objects in the scene. Itcan be seen that although our approach only uses a simpletransformation model and velocity information from the lasttwo frames, there are very few feature points that are incor-rectly segmented and even they can be eliminated if a longertemporal window is used.

We also tested our algorithm on the three datasets usedin [10], though in this case, because only features that areconsistently tracked are provided, i.e. those with low track-

Page 7: [IEEE 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) - Anchorage, AK, USA (2008.06.23-2008.06.28)] 2008 IEEE Conference on Computer Vision and Pattern Recognition

ing uncertainty, the improvement to motion segmentation isnot as significant. For these experiments we try to use fea-ture velocity information from as many frames as possibleby concatenating velocity vectors from multiple frames intoa single feature vector and cluster that instead.

One particular problem with our approach to motion seg-mentation is that sometimes the moving objects are muchsmaller than the background and consequently the back-ground features significantly outnumbers the foregroundfeatures, causing too few components to be selected ortoo many features assigned to the background componentwhich has a very large prior probability. This can be miti-gated by placing a strong prior on the mixing coefficients πππand encouraging them to remain close to 1

K .

4.3. Clustering Microarray Gene Expression

Clustering algorithms are important tools in Bioinfor-matics for analysing microarray gene expression data andgrouping genes with similar expression patterns [12]. Sincemicroarray experiments are complex multiple stages pro-cedures, the acquired data can exhibit high levels of mea-surement errors which are introduced in the various stagesof the experiments. Therefore it is important to explicitlytake into account of these measurement errors during clus-tering to make the algorithm more robust to noise. Liu etal. [12] presented an clustering algorithm which estimates amixture of Gaussians with spherical covariance matrix andtakes into account of zero-mean Gaussian measurement er-rors represented by diagonal covariance matrices.

We applied our clustering algorithm to the six groupdataset with 10 conditions and seven group dataset with 10conditions from [12]. Table 1 and 2 shows the comparisonof average adjusted rand index [9] of data partitioning ob-tained by our algorithm and the algorithm described in [12],the top row shows the variance of the zero-mean Gaussiannoise added to the datasets:

In all experiments, we assume that the number of classesis unknown and need to estimated along with the mixture.Tables 1 and 2 shows that when our algorithm is restricted toestimating spherical Gaussian components, it achieves com-parable results to those obtained by Liu et al [12]. If diago-nal Gaussian components were estimated instead, then sig-nificant improvements can be obtained. Our algorithm alsohas the advantage that the appropriate number of mixturecomponents is found during the clustering process which ismore efficient than estimating mixtures with different num-ber of components and then choose the best one using BIC.

5. ConclusionIn this paper, we presented a variational bayes solution

to the problem of estimating a mixture of Gaussians fromnoisy input data with known uncertainty distributions. The

algorithm was tested on both synthetic datasets and realdatasets. We have shown that by incorporating uncertaintyinformation into the clustering algorithm, we can improvethe results of pattern recognition tasks such as multi-bodymotion segmentation of feature point trajectories and parti-tioning of microarray gene expressions. In all experiments,we also assumed that the correct number of mixture com-ponents is unknown and model selection is done within thevariational bayesian framework [6].

AcknowledgementsThanks to Xuejun Liu and Magnus Rattray for providing

the Microarray Gene Expression data.

References[1] In S. V. Huffel and P. Lemmerling, editors, Total Least

Squares and Errors-in-Variables Modeling. Kluwer Aca-demic Publishers, Dordrecht, 2002.

[2] J. Bi and T. Zhang. Support vector classification with inputdata uncertainty. In Advances in Neural Information Pro-cessing Systems 17, pages 161–168, 2005.

[3] C. M. Bishop. Pattern Recognition and Machine Learning.Springer, 2006.

[4] B. B. Chaudhuri and P. R. Bhowmik. An approach of cluster-ing data with noisy or imprecise feature measurement. Pat-tern Recogn. Lett., 19(14):1307–1317, 1998.

[5] W. Chojnacki, M. J. Brooks, A. van den Hengel, and D. Gaw-ley. On the fitting of surfaces to data with covariances. IEEETransactions on Pattern Analysis and Machine Intelligence,22(11):1294–1303, 2000.

[6] A. Corduneanu and C. M. Bishop. Variational bayesianmodel selection for mixture distribution. In Proc. ArtificalIntelligence and Statistics, 2001.

[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximumlikelihood from incomplete data via the em algorithm. Jour-nal of the Royal Statistical Society. Series B (Methodologi-cal), 39:1–38, 1977.

[8] H. Handman and G. Govaert. Mixture model clustering ofuncertain data. In Proc. Fuzzy Systems, pages 879–884,2005.

[9] L. Hubert and P. Arabie. Comparing partitions. Journal ofClassification, 2:193–218, 1985.

[10] K. Kanatani and Y. Sugaya. Multi-stage optimization formulti-body motion segmentation. In Australia-Japan Ad-vanced Workshop on Computer Vision, 2003.

[11] M. Kumar and N. R. Patel. Clustering data with measure-ment errors. Technical report, 1998.

[12] X. Liu, K. K. Lin, B. Andersen, and M. Rattray. Includ-ing probe-level uncertainty in model-based gene expressionclustering. BMC Bioinformatics, 8:98, 2007.

[13] K. Nickels and S. Hutchinson. Estimating uncertainty in ssd-based feature tracking. Image Vision Comput., 20(1):47–58,2002.

[14] G. Schwarz. Estimating the dimension of a model. The An-nals of Statistics, 6(2):461–464, 1978.

Page 8: [IEEE 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) - Anchorage, AK, USA (2008.06.23-2008.06.28)] 2008 IEEE Conference on Computer Vision and Pattern Recognition

(a) Segmentation result of the first dataset (30 frames) obtained by clustering the concatenation of feature velocities at every other frame.

(b) Segmentation result of the second dataset (17 frames) obtained by clustering the concatenation of feature velocities at every frame.

(c) Segmentation result of the third dataset (100 frames) obtained by clustering the concatenation of feature velocities at every 5th frame foreach 50 frame segment. The velocity of a feature at frame t is also computed with respect to its position at t− 5th frame

Figure 4. Motion segmentation results on the three datasets used by Kanatani and Sugaya [10]. For each dataset, the top row shows theresult when tracking uncertainty is not taken into account and bottom row shows the result when it is.

Table 1. Average Adjusted Rand Index for 6 group dataset0.01 0.1 0.2

WITHOUT UNCERTAINTY 0.6048± 0.0462 0.6173± 0.0398 0.6167± 0.0204PUMACLUST 0.7474± 0.0308 0.7123± 0.0482 0.6882± 0.0301

OUR ALGORITHM (SPHERICAL) 0.7527± 0.0363 0.7065± 0.0406 0.6905± 0.0291OUR ALGORITHM (DIAGONAL) 0.8036± 0.0428 0.7543± 0.0558 0.7488± 0.0307

Table 2. Average Adjusted Rand Index for 7 group dataset

0.0 0.01 0.1WITHOUT UNCERTAINTY 0.5440± 0.0428 0.5463± 0.0441 0.5552± 0.0473

PUMACLUST 0.6980± 0.0464 0.6866± 0.0428 0.6532± 0.0570OUR ALGORITHM (SPHERICAL) 0.6990± 0.0415 0.6982± 0.0475 0.6467± 0.0442OUR ALGORITHM (DIAGONAL) 0.7380± 0.0451 0.7357± 0.0456 0.6824± 0.0557

[15] J. Shi and C. Tomasi. Good features to track. In Proc. CVPR,pages 593 – 600, 1994.

[16] J. Sun, A. Kaban, and S. Raychaudhury. Robust mixtures inthe presence of measurement errors. In Proc. ICML, pages847–854, 2007.