1 Scalable Model-based Clustering for Large Databases ...mlwong/journal/pami2005.pdf · Scalable Model-based Clustering for Large ... 3.5 4 4.5 Attribute 1 Attribute 2 (a) DS1 (It

1

Scalable Model-based Clustering for Large

Databases Based on Data Summarization

Huidong Jin∗, Man-Leung Wong+, and Kwong-Sak Leung‡

(∗) Corresponding author. H.-D. Jin is with Division of Mathematical and Information Sciences, Commonwealth Scientific

and Industrial Research Organisation, Australia. The corresponding address is GPO Box 664, Canberra, ACT 2601, Australia.

Email: [email protected]. Phone: +61 2 62167258. Fax: +61 2 62167111.

(+) M.-L. Wong is with Department of Computing and Decision Sciences, Lingnan University, Tuen Mun, Hong Kong. E-mail:

[email protected]. Phone: +852 26168093. Fax: +852 28922442.

(‡) K.-S. Leung is with Department of Computer Science and Engineering, the Chinese University of Hong Kong, Shatin,

N.T., Hong Kong. E-mail: [email protected]. Phone: +852 26098408. Fax: +852 26035024.

2

Abstract

The scalability problem in data mining involves the development of methods for handling large

databases with limited computational resources such as memory and computation time. In this paper, two

scalable clustering algorithms, bEMADS and gEMADS, are presented based on the Gaussian mixture

model. Both summarize data into subclusters and then generate Gaussian mixtures from their data

summaries. Their core algorithm EMADS is defined on data summaries and approximates the aggregate

behavior of each subcluster of data under the Gaussian mixture model. EMADS is provably convergent.

Experimental results substantiate that both algorithms can run several orders of magnitude faster than

expectation-maximization with little loss of accuracy.

Index Terms

Scalable clustering, Gaussian mixture model, expectation-maximization, data summary, maximum

penalized likelihood estimate

I. INTRODUCTION

It is a challenge to discover valuable patterns, such as clusters, from large databases with

limited memory and computation time. A data mining algorithm is said to be scalable when its

running time grows linearly or sub-linearly with data size, given computational resources such as

main memory [1]–[5]. It bridges the gap between the limited computational resources and large

databases. Due to its wide applications, scalable clustering has drawn much attention recently [2],

[4], [6]–[10]. Model-based clustering techniques assume a record xi ∈ <D (i = 1, · · · , N)

is drawn from a K-component mixture model Φ with probability p(xi|Φ) =K∑

k=1

[pkφ(xi|θk)].

The component density φ(xi|θk) indicates cluster k; and pk is the prior on cluster k (pk > 0

andK∑

k=1

pk = 1). In the Gaussian mixture model, each component is a multivariate Gaussian

3

distribution with parameter θk consisting of a mean vector µk and a covariance matrix Σk.

φ(xi|θk) =exp

−12(xi − µk)

T Σ−1k (xi − µk)

(2π)D2 |Σk|

12

. (1)

Given Φ, a crisp clustering is obtained by assigning each record xi to cluster k where its posterior

probability is maximal, i.e., k = arg maxlplφ(xi|θl). Among many clustering techniques [4],

[8], [11], model-based clustering techniques have attracted much research interest [2], [6], [7],

[9], [10], [12]. They have solid probabilistic foundations [12]–[16], and can handle clusters of

various shapes and complicated databases [14], [16], [17]. They, especially the Gaussian mixture

model, have been successfully applied to various real applications [17]–[20]. Due to its theoretical

and practical significance, we focus on scalable clustering based on the Gaussian mixture model

hereafter.

Expectation-Maximization (EM) effectively estimates maximum likelihood parameter values

of a mixture model. Given the number of clusters K, the traditional EM algorithm for the

Gaussian mixture model iteratively estimates the parameters to maximize log-likelihood L(Φ) =

log

[N∏

i=1

p (xi|Φ)

]as follows.

1) E-step: Given the mixture model parameters at iteration j, compute the membership prob-

ability t(j)ik :

t(j)ik =

[p

(j)k φ

(xi

∣∣∣u(j)k , Σ

(j)k

)]/

K∑

l=1

[p

(j)l φ

(xi

∣∣∣u(j)l , Σ

(j)l

)]. (2)

2) M-step: Given t(j)ik , update the mixture model parameters for k = 1, · · · , K:

p(j+1)k =

N∑i=1

t(j)ik /N, (3)

µ(j+1)k =

N∑i=1

(t(j)ik xi

)/(Np

(j+1)k

), (4)

Σ(j+1)k =

N∑i=1

[t(j)ik

(xi − µ

(j+1)k

)(xi − µ

(j+1)k

)T]

/(Np

(j+1)k

). (5)

4

EM normally generates more accurate results than do hierarchical model-based clustering

and the incremental EM algorithm [9]. In each iteration, EM scans the whole set, prohibiting

it from handling large databases [2]. Though some attempts have been made to speed up the

algorithm, EM and its extensions are still computationally expensive for large databases [2],

[6], [7]. For example, the lazy EM algorithm [6] evaluates the significance of each record at

scheduled iterations and then proceeds for multiple iterations using only the significant records.

But its speedup factor is less than 3. Moore [7] used a KD-tree to cache sufficient statistics of

interesting data regions and then applied EM to the KD-tree nodes. His algorithm only suits

very low-dimensional data sets [7]. The Scalable EM (SEM) algorithm [2] uses Extended EM

(ExEM) to identify compressible data regions, and then only caches their sufficient statistics

before loading next batch of data. It invokes ExEM many times, hence its speedup factor is less

than 10 [2]. Moreover, it is not easy to show if its core algorithm ExEM converges or not [10].

In this paper, we propose two scalable clustering algorithms that can run several orders of

magnitude faster than EM. Moreover, there is little loss of accuracy. They can generate much

more accurate results than other scalable model-based clustering algorithms. Their basic idea is to

categorize a data set into subclusters and then generate a mixture from their summary statistics

by a specifically designed EM algorithm — EMADS (EM Algorithm for Data Summaries).

EMADS can approximate the aggregate behavior of each subcluster under the Gaussian mixture

model. Thus, EMADS can effectively generate good Gaussian mixtures.

The rest of the paper is organized as follows. The two proposed algorithms, bEMADS and

gEMADS, are outlined in Section II. EMADS is developed in Section III. In Section IV,

experimental results are presented for both real and synthetic data sets, followed by concluding

comments in Section V.

5

II. TWO SCALABLE MODEL-BASED CLUSTERING ALGORITHMS

Our model-based clustering techniques are motivated by the following observations. In scalable

clustering, a group of similar records usually needs to be handled as an object in order to save

computational resources. In model-based clustering, a component density function essentially

determines clustering results. A new one may be defined to remedy the possible accuracy loss

caused by the trivial treatment of groups of records. For example, it can be defined on their

summary statistics to approximate the aggregate behavior of groups of records under the original

density function. Finally, its associated clustering algorithm, e.g., one derived from the general

EM algorithm [14], can effectively generate a good mixture from the summary statistics.

Our scalable clustering algorithms have the following two phases.

1) A data set is partitioned into mutually exclusive subclusters, and only their summary

statistics are cached in main memory in order to work within restricted memory. Each

subcluster contains data records that are similar to one another.

2) a Gaussian mixture is generated from these summary statistics directly using a specific EM

algorithm, EMADS. EMADS will be derived in Section III based on a pseudo mixture

model corresponding to the Gaussian mixture model.

As the summary statistics of subclusters are the only information passed from phase 1 to

phase 2, they play a crucial role in the clustering quality. Note that a Gaussian distribution

contains mean and covariance information. We include the zeroth, first, and second moments of

a subcluster of records in the summary statistics as follows.

Definition 1: The data summary for subcluster m is a triplet DSm = nm, νm, Γm(m = 1,

· · · , M where M indicates the number of subclusters), where nm is the number of its members;

νm =

Pxi∈DSm

xi

nmis its mean; and Γm =

Pxi∈DSm

xixTi

nmis the mean of the cross products of the

6

−0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Attribute 1

Attr

ibut

e 2

(a) DS1 (It is the first synthetic data set, and 10% samples are

plotted).

−0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Attribute 1

Attr

ibut

e 2

(b) A Gaussian mixture generated by gEMADS using the 16*16

grid structure.

Fig. 1. Illustration of gEMADS and bEMADS on the fist synthetic data DS1. An “o” and its associated ellipse represent

a generated Gaussian component. A “+” and its associated dashed ellipse indicates an original Gaussian component. Data

summaries and records are indicated by “*” and “·” respectively.

subcluster of data records. xi ∈ DSm indicates xi belongs to subcluster m.

We now outline two data summarization procedures for phase 1. Both of them read the

data only once and sum up similar records into data summaries according to the definitions of

subclusters. Both attempt to generate good data summaries using the restricted main memory. The

grid-based data summarization procedure partitions a data set by imposing a multidimensional

grid structure in the data space, and then incrementally sums up the records within a cell into its

associated data summary. That is, the records within a cell form a subcluster. For simplicity, each

attribute is partitioned into equal-width segments by grids. For example, for the data illustrated in

Fig. 1(a), we partition each attribute into 16 segments and obtain 197 data summaries, as shown

in Fig. 1(b). This grid structure is termed as 16*16 hereafter. To operate within the given main

memory, we only store data summaries for the non-empty cells in a data summary array: DS-

7

0.5 1 1.5 2 2.5

0.5

1

1.5

2

2.5

Latitude

Long

itude

(a) The California Housing Data.

0.5 1 1.5 2 2.5

0.5

1

1.5

2

2.5

Latitude

Longitu

de

(b) A Gaussian mixture generated by bEMADS from the 783 data

summaries generated by BIRCH.

Fig. 2. Illustration of gEMADS and bEMADS on the California Housing Data in the scaled Latitude-Longitude space. An “o”

and its associated ellipse represent a generated Gaussian component. Data summaries and records are indicated by “*” and “·”

respectively.

array. A hash function is used to index these cells. For high-dimensional data, the grid structure

is adaptively determined in order to better use the given main memory [10].

If Euclidean distance is used to define the similarity between two records, we may employ

existing distance-based clustering techniques, say, BIRCH, to generate subclusters. BIRCH uses

the Clustering Features (CF) and a CF-tree to summarize cluster representations [4]. It scans the

data set to build an initial in-memory CF-tree, which can be viewed as a multilevel compression

of the data set that tries to preserve its inherent clustering structure. It then applies a hierarchical

agglomerative clustering algorithm to cluster these leaf nodes [4]. If clusters are not spherical

in shape, such as the ones in Fig. 2(a), BIRCH does not perform well since it uses the notion

of radius to control the boundary of a cluster [2]. It was modified to generate data summaries in

our implementation [10]. The 783 data summaries generated by BIRCH for the data in Fig. 2(a)

are plotted in Fig. 2(b). Compared with the hash indexing in the grid-based data summarization

8

procedure, the BIRCH’s data summarization procedure uses a tree indexing. It can makes better

use of memory while its counterpart is simpler for implementation and manipulation.

Cooperating the BIRCH’s and the grid-based data summarization procedures with EMADS in

phase 2, we construct two scalable model-based clustering algorithms, bEMADS and gEMADS,

respectively.

III. EMADS

Before deriving EMADS, we first see the aggregate behavior of each subcluster under the

Gaussian mixture model. Since the similar records within a subcluster have similar membership

probability vectors t(j)ik , their aggregate behavior can be approximated. If t

(j)ik is approximated by

r(j)mk for xi in subcluster m, we may rewrite Eq.(5) in the M-step of EM as

Np(j+1)k · Σ(j+1)

k =N∑

i=1

[t(j)ik

(xi − µ

(j+1)k

)(xi − µ

(j+1)k

)T]

(6)

=M∑

m=1

∑xi∈DSm

[t(j)ik

(xi − µ

(j+1)k

)(xi − µ

(j+1)k

)T]

(7)

≈M∑

m=1

r(j)mknm

[(Γm − νmνT

m) +(νm − µ

(j+1)k

)(νm − µ

(j+1)k

)T]

. (8)

Intuitively, approximating(Γm − νmνT

m

)in Eq.(8) with the cross product of a vector, say, δmδT

m

≈ (Γm − νmνT

m

), we then can treat this vector δm in the same way as the vector

(νm − µ

(j+1)k

).

We let δm be the first covariance vector of the matrix (Γm − νmνTm), i.e., δm =

√λmcm

where cm is the component vector corresponding to the largest eigenvalue λm of the ma-

trix. The first covariance vector δm closely approximates the matrix in the sense that δm =

arg miny

∥∥(Γm − νmνT

m

)− yyT∥∥ [10]. Here ‖·‖ indicates the Frobenius norm. We call sm =

nm, νm, δm the simplified data summary for the mth subcluster. Similar to the item (xi −

µk)T Σ−1

k (xi − µk) in Gaussian density, we can insert another item δTmΣ−1

k δm into our new

density function. The item may reflect the aggregate behavior, say, the variance information, of

9

the subcluster in the density function. In addition, each record is inaccessible when we compute

over subclusters, we replace xi in subcluster m with νm. This gives us a pseudo density function

based only on the subcluster to which a record xi belongs.

Definition 2: For xi ∈ DSm, its probability under the pseudo probability density function

ψ having the same parameter θk = (µk, Σk) as the kth Gaussian component is

ψ(xi ∈ DSm |θk )4= ψ(sm|θk) =

exp−1

2

[δTmΣ−1

k δm + (νm−µk)T Σ−1

k (νm−µk)]

(2π)D2 |Σk|

12

. (9)

This density function is not a genuine density function, and is mainly designed for our

algorithm derivation and analyses. The value of every xi in the D-dimensional data space under

this density function is positive, and normally smaller than the one under the kth Gaussian

component. The difference may be quite large as Σk is degenerate, i.e., |Σk| ≈ 0. But it

is normally insignificant, especially when the subcluster is not too skew and its granularity

is reasonably small as empirically shown in Section IV-C. By including the first covariance

vector δm, this pseudo density function unifies the covariance (or distribution) information of

a subcluster under the Gaussian density and the aggregate behavior of the subcluster in the

traditional EM algorithm. On the one hand, if a subcluster of data distribute along with the

principle components of Σk, e.g., δm is parallel with the first covariance vector of Σk, then

δTmΣ−1

k δm is relatively small, and the pseudo density function is relatively large. It accords with

the Gaussian density under which data in the denser region have higher probabilities. One the

other hand, its associated algorithm derived from the general EM algorithm [14], EMADS,

approximates the aggregate behavior indicated by Eq.(8) using Eq.(15) since δmδTm is one of

the best approximations of(Γm − νmνT

m

). This point is also substantiated by the experiments in

Section IV-C. Thus, the pseudo density function is practicable from the computational viewpoint.

Based on Eq.(9), a pseudo mixture model Ψ is easily constructed. The probability for xi ∈ DSm

10

is

p(xi ∈ DSm|Ψ) , p(sm |Ψ) =K∑

k=1

[pkψ(sm|µk, Σk)] . (10)

The pseudo mixture model Ψ has the same parameters as the Gaussian mixture model Φ.

The pseudo model approximates the aggregate behavior of each subcluster under Φ. Thus,

we can get good Gaussian mixtures Φ through finding good estimates of Ψ. To filter out

degenerate mixtures [13], i.e., |Σk| ≈ 0 for some k, we choose a conjugate prior of each

covariance matrix Σk. The conjugate prior for Σ−1k is a Wishart distribution Wk

(Σ−1

k

∣∣αk, Ωk

)=

|Ωk|αk2 |Σ−1

k |αk−D−1

2 exp

−

tr(ΩkΣ−1k )

2

!2

αkD2 π

(D−1)D4

DQd=1

Γ

αk+1−d

2

where constant αk and matrix Ωk are parameters. Then we have

a penalized log-likelihood to measure the fitness of the mixture over the subclusters.

Lp(Ψ) = L(Ψ) + log p(Ψ) =M∑

m=1

nm log

K∑

k=1

[pkψ(sm|µk, Σk)]

+

K∑

k=1

log Wk

(Σ−1

k

∣∣ αk, Ωk

).

(11)

Then, through finding maximum penalized likelihood estimates of the pseudo model, we get

good Gaussian mixtures. EMADS, described in Algorithm 1, can calculate maximum penalized

likelihood estimates iteratively. It is derived based on the general EM algorithm [14]. The basic

idea is to view cluster labels of subclusters as missing values and associate this incomplete-data

problem with a complete-data problem for which the maximum penalized likelihood estimate is

computationally tractable.

If xi ∈ DSm is from cluster k, its zero-one indicator vector zm = [z1m, · · · , zKm]T equals 0 ex-

cept zki = 1. The complete data vector yi, augmented by zm, is[xT

i , zTm

]T . The likelihood of the

N complete records is Lc (y|Ψ) , Lc (y1, · · · ,yN |Ψ) =N∏

i=1

[p (xi ∈ DSm |zm, Ψ) p (zm|Ψ)]=

N∏i=1

[ψ(xi|θk)pk]=N∏

i=1

K∏k=1

[ψ (xi|θk) pk]zkm=

M∏m=1

K∏k=1

[ψ(sm|θk)pk]zkmnm . The incomplete-data log-

likelihood L(Ψ) is obtained from Lc (y|Ψ) by integrating over all possible y where x is

11

Algorithm 1: (EMADS)

1) Initialization: Set parameters αk and Ωk for the prior Wishart distribution, set iteration

j = 0, and initialize the parameters in the mixture model: p(j)k (> 0), µ

(j)k and Σ

(j)k such that

K∑k=1

p(j)k = 1 and Σ

(j)k is symmetric and positive definite (k = 1, · · · , K).

2) E-step: Given the mixture Ψ(j), compute the membership probability r(j)mk for sm:

r(j)mk = p

(j)k ψ

(sm

∣∣∣u(j)k , Σ

(j)k

)/

K∑

l=1

[p

(j)l ψ

(sm

∣∣∣u(j)l , Σ

(j)l

)]. (12)

3) M-step: Given r(j)mk, update the mixture model parameters using sm for k = 1, · · · , K:

p(j+1)k =

1

N

M∑m=1

(nmr

(j)mk

), (13)

µ(j+1)k =

M∑m=1

(nmr

(j)mkνm

)/

M∑m=1

(nmr

(j)mk

)=

M∑m=1

(nmr

(j)mkνm

)/(Np

(j+1)k

), (14)

Σ(j+1)k =

M∑m=1

nmr

(j)mk

[δmδT

m +(νm−µ

(j+1)k

)(νm−µ

(j+1)k

)T]

+ Ωk

Np(j+1)k + (αk −D − 1)

. (15)

4) Termination: if∣∣Lp

(Ψ(j+1)

)− Lp

(Ψ(j)

)∣∣ ≥ ε∣∣Lp

(Ψ(j)

)∣∣, then set j to j + 1 and go to

step 2.

embedded,

L(Ψ) , log p(x|Ψ) =

∫log Lc (y|Ψ) dz =

M∑m=1

[nm log p(sm|Ψ)] . (16)

As discussed above, we maximize the penalized likelihood, i.e., Lp (Ψ, x) = L(Ψ) +

log p(Ψ). According to the general EM algorithm, we calculate the Q-function in the E-step,

i.e., the expected complete-data posterior conditional on the current parameter value Ψ(j) and

12

x (which is replaced by s) as follows.

Q(Ψ; Ψ(j)

)= E

[log Lc (y|Ψ) + log p(Ψ)

∣∣x, Ψ(j)]

(17)

4= E

[log Lc (y|Ψ) + log p(Ψ)

∣∣s, Ψ(j)]

(18)

=M∑

m=1

nm

K∑

k=1

E[Zkm|s, Ψ(j)

][log pk + log ψ(sm|µk, Σk)]

+

K∑

k=1

log Wk

(Σ−1

k |αk, Ωk

)

=M∑

m=1

nm

K∑

k=1

r(j)mk [log pk + log ψ(sm|µk, Σk)]

+

K∑

k=1

log Wk

(Σ−1

k |αk, Ωk

). (19)

Here Zkm is a random variable corresponding to zkm and r(j)mk is the posterior probability that

subcluster m belongs to component k. Based on the Bayes’ rule, r(j)mk

4= E

[Zkm

∣∣s, Ψ(j)]

=

pΨ(j)

(Zkm = 1 |s)

=p(j)k ψ

sm

µ(j)k ,Σ

(j)k

KP

l=1p(j)l ψ

sm

µ(j)l ,Σ

(j)l

=p(j)k ψ

sm

µ(j)k ,Σ

(j)k

p(sm|Ψ(j) )

. This leads to Eq.(12).

Now we maximize Q(Ψ; Ψ(j)

)with respect to Ψ. We introduce a Lagrange multiplier λ to

handle the constraintK∑

k=1

pk = 1. Differentiating Q(Ψ; Ψ(j)

)−λ

(K∑

k=1

pk − 1

)with respect to pk

and setting these derivatives to 0, we haveM∑

m=1

(nmr

(j)mk

)1pk−λ = 0 for k = 1, · · · , K. We then

sum up these K equations to get

λ

K∑

k=1

pk =K∑

k=1

M∑m=1

(nmr

(j)mk

)=

M∑m=1

(nm

K∑

k=1

r(j)mk

)= N. (20)

This leads to λ = N , and then Eq.(13). Differentiating Q(Ψ; Ψ(j)

)with respect to µk and

equating the partial derivative to zero gives

∂Q(Ψ; Ψ(j))

∂µk

=M∑

m=1

[nmr

(j)mkΣ

−1k (νm − µk)

]= 0.

This gives the re-estimation of µk as in Eq.(14). For the parameters Σk, we first have following

partial derivatives,

∂ log ψ(sm |µk, Σk )

∂Σ−1k

=1

2

2Σk − diag(Σk)− 2

[δmδT

m + (νm−µk)(νm−µk)T]

+diag(δmδT

m + (νm−µk)(νm−µk)T)

(21)

∂ log Wk

(Σ−1

k

∣∣ αk, Ωk

)

∂Σ−1k

=1

2(αk −D − 1) (2Σk − diag(Σk))− [2Ωk − diag(Ωk)] .(22)

13

Taking the derivative with respect to Σ−1k on Q

(Ψ; Ψ(j)

), we get

∂Q(Ψ; Ψ(j)

)

∂(Σ−1

k

) =1

2[2Ak − diag(Ak)]

where

Ak =M∑

m=1

nmr

(j)mk

Σk −

[δmδT

m + (νm−µk) (νm−µk)T]

+ [(αk −D − 1)Σk − Ωk] .

Setting the derivative to zero, i.e., 2Ak − diag(Ak) = 0, implies that Ak = 0. This leads to

Eq.(15).

Our EMADS can directly and effectively generate good Gaussian mixtures from the data

summaries due to its good approximation to the EM algorithm for the Gaussian mixture model

and the elaborate inclusion of the first covariance vector of each subcluster in the pseudo density

function. EMADS also can save some main memory by using the first covariance vectors rather

than the full covariance matrices [10]. Similar to EM, EMADS is easy to implement because only

4 main equations (Eqs.(12)-(15)) are involved. Furthermore, EMADS can also surely terminate

as supported by the following theorem.

Theorem 1: Assume αk > D+1 and λmin(Ωk) ≥ ζ > 0 for any k, where λmin(Ωk) indicates

Ωk’s smallest eigenvalue. Then, the penalized log-likelihood Lp(Ψ) for EMADS converges to a

value L∗p.

Proof: First, we prove the feasibility of EMADS. By induction, we show that Σ(j)k in

Eq.(15) is always symmetric and positive definite, and p(j)k and r

(j)mk are positive. This is correct

in the initialization where Σ(0)k is symmetric and positive definite, and p

(0)k is positive. According

to Eq.(12), r(0)mk is positive too. If Σ

(j)k is symmetric and positive definite, and p

(j)k and r

(j)mk

are positive in iteration j, then we prove that this is true in iteration j + 1. Both δmδTm and

(νm−µ

(j+1)k

)(νm−µ

(j+1)k

)T

are symmetric positive semi-definite matrices. Note λmin(Ωk) ≥

14

ζ > 0 and Eq.(15), Σ(j+1)k must be symmetric and positive definite. Moreover,

λmin

(Σ

(j+1)k

)>

λmin(Ωk)

N + αk −D − 1≥ ζ

N + αk −D − 1> 0.

Clearly, r(j+1)mk > 0 according to Eq.(12), and so is p

(j+1)k .

We then prove the non-decrease of the penalized likelihood value. As we can see in the

derivation above, the Q-function value doesn’t decrease, i.e., Q(Ψ(j+1); Φ(j)

) ≥ Q(Ψ(j); Φ(j)

).

In addition, we derive EMADS following the general EM algorithm, thus EMADS is its instance.

The non-decrease of the Q-function value can lead to the non-decrease of the penalized likelihood

value, i.e., Lp(Ψ(j+1)) ≥ Lp(Ψ

(j)) [16].

We finally show that the penalized likelihood is bounded. From∂Wk(Σ−1

k |αk,Ωk)∂Σ−1

k

= 0, the

maximum of the Wishart distribution, denoted by Wmaxk , is reached as Σk = Ωk

αk−D−1. We

rewrite the penalized log-likelihood as

Lp(Ψ) = L(Ψ)+log p(Ψ) =M∑

m=1

nm log

(K∏

k=1

Wk

(Σ−1

k |αk, Ωk

)) 1

N K∑

k=1

(pkψ(sm|µk, Σk))

.

For each k, we have the following inequality:[

K∏

k=1

Wk

(Σ−1

k |αk, Ωk

)] 1

N

pkψ(sm|µk, Σk) ≤[ck

∣∣Σ−1k

∣∣N+αk−D−1

2 exp

(−D

2

(|Ωk|∣∣Σ−1

k

∣∣) 1D

)] 1N

,(23)

where ck =|Ωk|

αk2 (2π)−

ND2

Qj 6=k

W maxj

2αkD

2 π(D−1)D

4DQ

d=1Γ

αk+1−d

2

is a constant. The right-hand side of Eq.(23) is positive,

and reaches its maximum as∣∣Σ−1

k

∣∣ =

(|Ωk|

1D

N+αk−D−1

) DD−1

. Especially, |Σk| ≥ (λmin(Σk))D >

(ζ

N+αk−1−D

)D

. The right-hand side of Eq.(23) is not greater than (ck)1N

(ζ

N+αk−1−D

)D(N+αk−D−1)

−2N

and then has an upper bound. So is Lp(Ψ).

Thus, Lp

(Ψ(j)

)converges monotonically to a value L∗p. This completes the proof.

We can easily set αk and Ωk to satisfy the requirements of Theorem 1. In our experiments,

e.g., αk = ζ + D + 1 and Ωk = ζ ∗ U where ζ is positive and U is a D ∗ D unit matrix. We

shall see a reasonable setting ζ affects little on the performance of EMADS in Section IV-A.

15

IV. EXPERIMENTAL RESULTS

To highlight the performance of bEMADS and gEMADS, we compare them with several

model-based clustering algorithms, such as EM which is the traditional EM algorithm for the

Gaussian mixture model, sampEM which is EM working on 5% random samples, bWEM and

gWEM which are WEM (Weighted EM) working on the data summaries generated by the two

data summarization procedures respectively but without considering the covariance information.

WEM can be viewed as a simplified EMADS with δm = 0 in Eqs.(12)-(15) and (9). The last two

algorithms can be interpreted as density-biased-sampling model-based clustering techniques [8],

[10].

All the algorithms were coded in MATLAB and ran on a Sun Enterprise E4500 server. The

results were reported based on 10 independent runs. The data summarization procedures were

set to generate at most 4,000 subclusters, while there is no restriction on the amount of the

memory used by both EM and sampEM [10]. Though gEMADS performs as well as bEMADS

(especially for low-dimensional data), it is mainly used to examine the approximation of EMADS

to EM in Section IV-C.

We do not include SEM [2] in our detailed experimental comparison in this paper. The core

algorithm of SEM, ExEM, is derived in a heuristic way, and it is not easy to show whether

it converges or not [10]. However, these algorithms we considered, such as EM, WEM, and

EMADS, are provably convergent. Furthermore, SEM invokes ExEM to identify the compressible

regions of data in the memory, and then compress these regions and read in more data. In order

to squash the whole data set into the memory, ExEM has to be invoked many times, and this

leads to SEM’s speedup factor being smaller than 10 with respect to EM [2]. As a comparison,

our scalable algorithms can run more than 200 times faster than EM [5]. In addition, if the

16

TABLE I

PERFORMANCE OF FOUR ALGORITHMS ON THREE REAL DATA SETS. FOR THE FOREST COVERTYPE DATA, EM RUNS ON

15% SAMPLES. N , D, K , AND M INDICATE THE NUMBERS OF RECORDS, ATTRIBUTES, CLUSTERS, AND SUBCLUSTERS,

RESPECTIVELY.

Data sets N D K M Measures bEMADS bWEM EM* sampEM log-likelihood -3.083 ± 0.011 -3.278 ± 0.053 -3.078* ± 0.017 -3.086 ± 0.018

time(Sec.) 7,985.5 ±3,635.2 6,039.7 ±1,313.5 173,672.5±80,054.2 49,745.8±10,328.9

log-likelihood 7.517 ± 0.191 6.882 ± 0.153 7.682 ± 0. 15 9 6.776 ± 0.239

time(Sec.) 3,232.4 ± 525.8 3,488.6 ± 317.7 16,405.5 ± 2,906.2 1,433.9 ± 514.7

log-likelihood -0.741 ± 0.004 -0.744 ± 0.005 -0.740 ± 0.0 04 -0.743 ± 0.006

time(Sec.) 4,283.9 ±1,040.1 4,281.8 ±1,413.4 4 95,056 .4 ± 87,312 . 1 16 , 359.4 ± 5 , 873.3 3836

California Housing Data

299285 3 10 Census-Income Database

3186

20640 8 7 2907

Forst CoverType Data

581012 5 15

attributes are independent within each cluster, i.e., the covariance matrix Σk of each Gaussian

distribution is a diagonal matrix, the clustering accuracy generated by ExEM is significantly

worse than EM and EMACF as shown in our previous study [15]. EMACF is a simplified

version of EMADS where data summaries are replaced with clustering features. EMACF can

only run when attributes are independent within each cluster [15].

A. Performance on Three Real Data Sets

We first examine the performance of bEMADS on three real data sets which are downloaded

from the UCI KDD Archive (http://kdd.ics.uci.edu/). The performance of bEMADS, bWEM,

EM, and sampEM is summarized in Table I, where ζ is set to zero, i.e., no penalty item is

applied to the likelihood function. These numbers of clusters were set based on some preliminary

experiments [10]. Both the average and standard deviation values of log-likelihood and execution

time are listed.

For the Forest CoverType Data, EM cannot generate a good mixture after running for 200

hours, we then run it on 15% random samples, and denote it as sampEM(15%). In fact, even

sampEM(15%) and sampEM take about 48.2 and 13.8 hours. On average, bEMADS takes about

17

2.2 hours. It runs 21.7 and 6.2 times faster than sampEM(15%) and sampEM respectively. The

average log-likelihood value of the Gaussian mixtures generated by bEMADS is −3.083, which

lies between the value of −3.086 for sampEM and the value of −3.078 for sampEM(15%). The

one-tailed t-Test does not indicate there is statistically significant difference among them at the

0.05 level. Though bWEM runs a bit faster than bEMADS, the one-tailed t-Test indicates that

its log-likelihood value of −3.278 is significantly lower than its three counterparts at the 0.05

level.

The performance of the four algorithms with different ζ , which specifies the prior distribution

for the covariance matrices, is presented in Fig. 3. Similar to Table I, the log-likelihood value,

rather than the penalized one, of the generated Gaussian mixtures is used in order to make a

fair comparison. The line on each bar indicates the standard deviation among 10 runs. For these

6 different ζ values, the log-likelihood of the Gaussian mixtures generated by bEMADS varies

from −3.098 to −3.079, and its standard deviation varies from 0.010 to 0.026. There is no

significant difference. Except that bWEM and sampEM generate worse results as ζ = 10, the

performance of all the four algorithms is stable with reasonable ζ . Thus, we set ζ = 0 hereafter.

For the California Housing Data, a 7-component Gaussian mixture generated by bEMADS

can identify its cluster structure in the scaled Latitude-Longitude space as shown in Fig. 2(b).

For this 8-dimension data set, EM spends 16,405.5 seconds, which is about 5.1 times longer

than bEMADS. Though the clustering accuracy of 7.517 for bEMADS is slightly lower than the

value of 7.682 for EM, this difference is not significant. For this moderate data set, sampEM

runs faster than bEMADS, but its average log-likelihood value is significantly lower than that

of bEMADS. The log-likelihood of bWEM is also significantly lower than that of bEMADS,

18

-3.55

-3.45

-3.35

-3.25

-3.15

-3.05

-2.95

0 0.001 0.01 0.1 1 10 Ave.

Log-

likel

ihoo

d

bEMADS sampEM(15%) sampEM(5%) bWEM

Fig. 3. The average log-likelihood and its standard deviation of the four algorithms for the Forest CoverType Data with different

ζ.

though bEMADS and bWEM spend similar execution time.

For the Census-Income Database, EM and sampEM spend about 137.5 and 4.5 hours respec-

tively, which are 115.6 and 3.8 times longer than that of bEMADS. The average log-likelihood

values of bEMADS, EM, and sampEM are −0.741, −0.740, and −0.743, respectively. The one-

tailed t-Test does not show significant difference exists at the 0.05 level. Though bWEM runs as

fast as bEMADS, it generate the worst mixtures with the log-likelihood value of −0.744, which

is significantly lower than that of EM.

B. Performance on Synthetic Data

To examine the performance of bEMADS, we also generated 7 synthetic data sets according

to some random Gaussian mixtures. Among these data sets, the number of records N varies from

108,000 to 1,100,000, the number of attributes D is from 2 to 5, and the number of clusters K

varies from 9 to 20 [10]. All of these parameters are listed in Table II where M indicate the

number of sub-clusters generated. The first data set DS1 is illustrated in Fig. 1(a). The clustering

accuracy is used to measure a generated Gaussian mixture. It indicates the proportion of records

19

TABLE II

THE PARAMETERS OF SEVEN SYNTHETIC DATA SETS. N , D, K , AND M INDICATE THE NUMBER OF DATA RECORDS, THE

DATA DIMENSIONALITY, THE NUMBER OF CLUSTERS, AND THE NUMBER OF SUBCLUSTERS, RESPECTIVELY.

Data Set N D K M

DS1 108,000 2 9 2,986

DS2 500,000 2 12 2,499 DS3 1,100,000 2 20 3,818

DS4 200,000 2 10 2,279 DS5 200,000 3 10 3,227 DS6 240,000 4 12 3,982

DS7 280,000 5 14 2,391

that are correctly clustered by the mixture with respect to the clustering result based on the

original one [9].

Fig. 4 illustrates the performance of bEMADS, bWEM, EM, and sampEM. Among the 7 data

sets, bEMADS generates the most accurate results on the 2nd and the 3rd ones. On average, the

clustering accuracy values of bEMADS, bWEM, EM, and sampEM are 89.6%, 84.7%, 90.6%,

and 86.3%, respectively. Though EM generates slightly more accurate clustering results than

bEMADS does, the one-tailed paired t-Test indicates the difference is not significant at the 0.05

level. The average difference between bEMADS and bWEM is 4.9%, and it is significant at

the 0.05 level. The average difference between bEMADS and sampEM is 3.3%, which is also

significant. As shown in Fig. 4(b), both bEMADS and bWEM spend several thousand seconds.

Compared with EM, bEMADS runs 27.8 to 95.7 times faster. Compared with sampEM, bEMADS

runs 1.9 to 16.1 times faster. Hence, bEMADS greatly outperforms the other three algorithms

in terms of execution time and/or clustering accuracy. The scalability examination of bEMADS

in [5] also substantiates this point.

20

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

1.05

DS1 DS2 DS3 DS4 DS5 DS6 DS7 Ave. Data set

Clu

ster

ing

accu

racy

bEMADS EM bWEM sampEM

(a) Average clustering accuracy and standard deviation

100

1000

10000

100000

1000000

DS1 DS2 DS3 DS4 DS5 DS6 DS7 Ave. Data set

Exe

cutio

n tim

e (in

sec

onds

)

bEMADS

bWEM

EM

sampEM

(b) Execution time

Fig. 4. Performance of bEMADS, bWEM, EM, and sampEM for the 7 synthetic data sets

C. Approximation Examination

Finally we examine EMADS’ approximation to the aggregate behavior of each subcluster in

EM mainly using gEMADS. Fig. 5 illustrates, on three different aspects, to what degree EMADS

approximates EM on the subclusters of DS1 using 10 different grid structures and the subclusters

of the Forest CoverType Data. The grid structures determine subcluster shape and size. Some

grid structures are quite skew, e.g., the cell width is 7.2 times larger than the height in the 12*86

grid structure.

In Fig. 5(a), P(EM) denotes the average probability of a subcluster of records under the

best Gaussian mixture we have; P(EMADS) denotes the probability of the subcluster under the

21

pseudo mixture model in Eq.(10); P(WEM) is the probability of the subcluster mean νm under the

Gaussian mixture. P(EM) represents an aggregate aspect of each subcluster under the Gaussian

mixture. As plotted in Fig. 5(a), P(EMADS) is only a bit smaller than P(EM) on average. The

paired t-Test indicates that P(EMADS) is not significantly different from P(EM) at the 0.05 level

except for the 12*86 grid structure for DS1. Note that the average of P(WEM) is sometimes

larger, and sometimes smaller, than that of P(EM). Moreover, P(WEM) is significantly different

from P(EM) for the 12*12 grid structure for DS1 and for the real data.

In Fig. 5(b), R(EM) denotes the average of the membership probability vectors ti = [ti1, · · · , tiK ]

in Eq.(2) of a subcluster; R(EMADS) corresponds to its membership probability rm according

to Eq.(12); and R(WEM) corresponds to rm in Eq.(12) with δm = 0. R(EMADS) is very close

to R(EM). On average, it is closer to R(EM) than R(WEM) except for the last two skew grid

structures. There is no significant difference between R(EMADS) and R(EM) at the 0.05 level

except for the 12*12, 16*64, and 12*86 grid structures.

For the M-step, we should examine the closeness of two covariance re-estimations according

to Eqs.(15) and (5) respectively. For simplicity, we investigate the closeness between δmδTm and

the matrix(Γm − νmνT

m

). The ratios between them are illustrated in Fig. 5(c) for the 11 data

summarization results. These ratios are averagely greater than 82.8%. Especially, for the last three

skew grid structures, they are greater than 97.1%. In other words, most covariance information

of a matrix is embedded in its first covariance vector.

Thus, EMADS’ approximation to EM is acceptable if the subclusters are not too skew and

their granularity is reasonably small. This point is also supported by the sensitivity examination

of EMADS in [5]. These substantiate the promising accuracy of bEMADS and gEMADS.

22

56*56 48*48 40*40 32*32 24*24 16*16

12*12

24*43 16*64 12*86 Forest

-0.06

-0.04

-0.02

0.00

0.02

0.04

Grid structure/data set

Diff

eren

ce

P(EMADS)-P(EM) P(WEM) - P(EM)

(a) Mixture model density approximation.

-0.04

-0.02

0.00

0.02

0.04

56*56 48*48 40*40 32*32 24*24 16*16 12*12 24*43 16*64 12*86 Forest

Data structures/data set

Diff

eren

ce

||R(EMADS) - R(EM)||

||R(WEM) - R(EM)||

(b) The E-step approximation.

0.75

0.80

0.85

0.90

0.95

1.00

56*56 48*48 40*40 32*32 24*24 16*16 12*12 24*43 16*64 12*86 Forest Grid structure/data set

Rat

io

||the cross product of the first covariance vector||/||covariance matrix||

(c) Covariance matrix approximation.

Fig. 5. EMADS’ approximation to EM on data summaries of DS1 using 10 different grid structures and the Forest CoverType

Data.

V. CONCLUSION

Through sophisticated manipulation of summary statistics, we have established two scalable

clustering algorithms, bEMADS and gEMADS, based on the Gaussian mixture model. The

main novelties are the pseudo component density function for data summaries and its associated

algorithm EMADS (Expectation-Maximization Algorithm for Data Summaries). EMADS em-

bodies the cardinality, mean, and covariance information of each subcluster to generate Gaussian

mixtures. It theoretically converges. The experimental results have shown that bEMADS runs one

or two orders of magnitude faster than the traditional EM (Expectation-Maximization) algorithm

with little or no loss of accuracy. It, using comparable computational resources, has generated

significantly more accurate clustering results than existing model-based clustering algorithms.

Using gEMADS and bEMADS, we have illustrated the EMADS’ good approximation to the

23

aggregate behavior of each subcluster in EM.

We are interested in exploring the potential for more suitable pseudo density functions for the

Gaussian distribution. For example, more covariance vectors may be included into the pseudo

density function in the same way as the first covariance vector. Another future research issue

is to efficiently determine the optimal number of clusters for large databases. The underlying

idea of the paper is applicable to scale-up other finite mixture models, say, a mixture of Markov

chains.

ACKNOWLEDGMENTS

This work was submitted when Dr. Jin was with Lingnan University, Hong Kong. It appeared

previously in ICDM’03 [5]. The authors would like to thank the editor and the reviewers for

their constructive comments and suggestions, and thank T. Zhang, R. Ramakrishnan, M. Livny,

and V. Ganti for their BIRCH code. It was partially supported by RGC Grants CUHK 4212/01E

and LU 3009/02E of Hong Kong.

REFERENCES

[1] J. Han and M. Kamber, Data Mining: Concepts and Techniques. San Francisco, CA, USA: Morgan Kaufmann Publishers,

2001.

[2] P. Bradley, U. Fayyad, and C. Reina, “Clustering very large databases using EM mixture models,” in Proceedings of 15th

International Conference on Pattern Recognition, vol. 2, 2000, pp. 76–80.

[3] V. Ganti, J. Gehrke, and R. Ramakrishnan, “Mining very large databases,” IEEE Computer, vol. 32, no. 8, pp. 38–45, Aug.

1999.

[4] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: A new data clustering algorithm and its applications,” Data Mining

and Knowledge Discovery, vol. 1, no. 2, pp. 141–182, 1997.

[5] H.-D. Jin, M.-L. Wong, and K.-S. Leung, “Scalable model-based clustering by working on data summaries,” in Proceedings

of Third IEEE International Conference on Data Mining (ICDM 2003), Melbourne, USA, Nov. 2003, pp. 91–98.

24

[6] B. Thiesson, C. Meek, and D. Heckerman, “Accelerating EM for large databases,” Machine Learning, vol. 45, pp. 279–299,

2001.

[7] A. Moore, “Very fast EM-based mixture model clustering using multiresolution KD-trees,” in Advances in Neural

Information Processing Systems 11, 1999, pp. 543–549.

[8] C. Palmer and C. Faloutsos, “Density biased sampling: An improved method for data mining and clustering,” in Proceedings

of the 2000 ACM SIGMOD, 2000, pp. 82–92.

[9] M. Meila and D. Heckerman, “An experimental comparison of model-based clustering methods,” Machine Learning,

vol. 42, no. 1/2, pp. 9–29, 2001.

[10] H.-D. Jin, “Scalable model-based clustering algorithms for large databases and their applications,” Ph.D. the-

sis, the Chinese University of Hong Kong, Hong Kong, Aug. 2002, See errata, codes, and data in

http://www.cmis.csiro.au/Warren.Jin/PhDthesisWork.htm.

[11] P. A. Pantel, “Clustering by committee,” Ph.D. dissertation, University of Alberta, Canada, 2003.

[12] M. Figueiredo and A. K. Jain, “Unsupervised learning of finite mixture models.” IEEE Trans. Pattern Anal. Machine

Intell., vol. 24, no. 3, pp. 381–396, Mar. 2002.

[13] S. Wang, D. Schuurmans, F. Peng, and Y. Zhao, “Learning mixture models with the latent maximum entropy principle,”

in Proceedings of the Twentieth International Conference on Machine Learning. Washington, DC, USA: AAAI Press,

2003, pp. 784–791.

[14] A. Dempster, N. Laird, and D. Rubin, “Maximum-likelihood from incomplete data via the EM algorithm,” Journal of the

Royal Statistical Society Series B, vol. 39, pp. 1–38, 1977.

[15] H.-D. Jin, K.-S. Leung, M.-L. Wong, and Z.-B. Xu, “Scalable model-based cluster analysis using clustering features,”

Pattern Recognition, vol. 38, no. 5, pp. 637–649, May 2005.

[16] G. McLachlan and T. Krishnan, The EM Algorithm and Extensions. New York: John Wiley & Sons, Inc., 1997.

[17] P. Cheeseman and J. Stutz, “Bayesian classification (AutoClass): Theory and results,” in Advances in Knowledge Discovery

and Data Mining, U. Fayyad and et al., Eds., Menlo Park, CA, USA, 1996, pp. 153–180.

[18] B. J. Frey and N. Jojic, “Transformation-invariant clustering using the EM algorithm,” IEEE Trans. Pattern Anal. Machine

Intell., vol. 25, no. 1, pp. 1–17, 2003.

[19] C. Fraley, “Algorithms for model-based Gaussian hierarchical clustering,” SIAM Journal on Scientific Computing, vol. 20,

no. 1, pp. 270–281, Jan. 1999.

[20] J. Shanmugasundaram, U. Fayyad, and P. Bradley, “Compressed data cubes for OLAP aggregate query approximation on

continuous dimensions,” in Proceedings of the Fifth ACM SIGKDD, San Diego, CA, USA, 1999, pp. 223–232.

25

Biography

Huidong Jin received his B.Sc. degree from the Department of Applied Mathematics

in 1995, and his M.Sc. degree from the Institute of Information and System Sciences in 1998,

both from Xi’an Jiaotong University, P.R. China. In 2002, he got his Ph.D degree in Computer

Science and Engineering from the Chinese University of Hong Kong, Shatin, Hong Kong.

He is currently with Division of Mathematical and Information Sciences, CSIRO, Australia.

His research interests are data mining, health informatics, and intelligent computation. He has

authored and co-authored over 15 papers in these areas. He is a member of the ACM and the

IEEE.

Man-Leung Wong is an associate professor at the Department of Computing and

Decision Sciences of Lingnan University, Tuen Mun, Hong Kong. Before joining the university,

he worked as an assistant professor at the Department of Systems Engineering and Engineering

Management, the Chinese University of Hong Kong and the Department of Computing Science,

Hong Kong Baptist University. He worked as a research engineer at the Hypercom Asia Ltd.

in 1997. His research interests are evolutionary computation, data mining, machine learning,

knowledge acquisition, and approximate reasoning. He has authored and co-authored over 50

papers and 1 book in these areas. He received his B.Sc., M.Phil., and Ph.D in computer science

from the Chinese University of Hong Kong in 1988, 1990, and 1995, respectively. He is a

member of the IEEE and the ACM.

26

Kwong-Sak Leung received his BSc (Eng.) and PhD degrees in 1977 and

1980, respectively, from the University of London, Queen Mary College. He worked as a senior

engineer on contract R&D at ERA Technology and later joined the Central Electricity Generating

Board to work on nuclear power station simulators in England. He joined the Computer Science

and Engineering Department at the Chinese University of Hong Kong in 1985, where he is

currently professor and chairman of the Department.

Dr Leung’s research interests are in soft computing including evolutionary computation, neural

computation, probabilistic search, information fusion and data mining, fuzzy data and knowledge

engineering. He has published over 180 papers and 2 books in fuzzy logic and evolutionary

computation. He has been chair and member of many program and organizing committees of

international conferences. He is in the Editorial Board of Fuzzy Sets and Systems and an associate

editor of International Journal of Intelligent Automation and Soft Computing. He is a senior

member of the IEEE, a chartered engineer, a member of IEE and ACM and a fellow of HKCS

and HKIE.

Documents

1 Scalable Model-based Clustering for Large Databases ...mlwong/journal/pami2005.pdf · Scalable Model-based Clustering for Large ... 3.5 4 4.5 Attribute 1 Attribute 2 (a) DS1 (It