Fuzzy Clustering - Simon Fraser University...Jian Pei: CMPT 459/741 Clustering (4) 2 Fuzzy C-Means (FCM) Select an initial fuzzy pseudo-partition, i.e., assign values to all the w

Jian Pei: CMPT 459/741 Clustering (4) 1

Fuzzy Clustering

•  Each point xi takes a probability wij to belong to a cluster Cj

•  Requirements – For each point xi,

– For each cluster Cj

11

=∑=

k

jijw

mwm

iij


Fuzzy C-Means (FCM)

Select an initial fuzzy pseudo-partition, i.e., assign values to all the wij

Repeat Compute the centroid of each cluster using the fuzzy

pseudo-partition Recompute the fuzzy pseudo-partition, i.e., the wij

Until the centroids do not change (or the change is below some threshold)


Critical Details

•  Optimization on sum of the squared error (SSE):

•  Computing centroids: •  Updating the fuzzy pseudo-partition

– When p=2

∑∑= =

=k

j

m

iji

pijk cxdistwCCSSE

1 1

21 ),(),,( …

∑∑==

=m

i

pij

m

ii

pijj wxwc

11

/

∑=

−−=k

q

pqi

pjiij cxdistcxdistw

1

11

211

2 )),(/1()),(/1(

∑=

=k

qqijiij cxdistcxdistw

1

22 ),(/1),(/1


Choice of P

•  When p à 1, FCM behaves like traditional k-means

•  When p is larger, the cluster centroids approach the global centroid of all data points

•  The partition becomes fuzzier as p increases


Effectiveness


Mixture Models

•  A cluster can be modeled as a probability distribution – Practically, assume a distribution can be

approximated well using multivariate normal distribution

•  Multiple clusters is a mixture of different probability distributions

•  A data set is a set of observations from a mixture of models


Object Probability

•  Suppose there are k clusters and a set X of m objects – Let the j-th cluster have parameter θj = (µj, σj) – The probability that a point is in the j-th cluster is

wj, w1 + …+ wk = 1 •  The probability of an object x is

∑=

=Θk

jjjj xpwxprob

1)|()|( θ

∏∑∏= ==

=Θ=Θm

i

k

jjijj

m

ii xpwxprobXprob

1 11

)|()|()|( θ


Example

2

2

2)(

21)|( σ

µ

σπ

−−

=Θx

i exprob

)2,4()2,4( 21 =−= θθ

8)4(

8)4( 22

221

221)|(

−−

+−

+=Θxx

eexprobππ


Maximal Likelihood Estimation

•  Maximum likelihood principle: if we know a set of objects are from one distribution, but do not know the parameter, we can choose the parameter maximizing the probability

•  Maximize

– Equivalently, maximize

∏=

−−

=Θm

j

x

i exprob1

2)(2

2

21)|( σ

µ

σπ

∑=

−−−

−=Θm

i

i mmxXprob1

2

2

log2log5.02

)()|(log σπσµ


EM Algorithm

•  Expectation Maximization algorithm Select an initial set of model parameters Repeat

Expectation Step: for each object, calculate the probability that it belongs to each distribution θi, i.e., prob(xi|θi)

Maximization Step: given the probabilities from the expectation step, find the new estimates of the parameters that maximize the expected likelihood

Until the parameters are stable


Advantages and Disadvantages

•  Mixture models are more general than k-means and fuzzy c-means

•  Clusters can be characterized by a small number of parameters

•  The results may satisfy the statistical assumptions of the generative models

•  Computationally expensive •  Need large data sets •  Hard to estimate the number of clusters


Grid-based Clustering Methods

•  Ideas – Using multi-resolution grid data structures – Using dense grid cells to form clusters

•  Several interesting methods – CLIQUE – STING – WaveCluster


CLIQUE

•  Clustering In QUEst •  Automatically identify subspaces of a high

dimensional data space •  Both density-based and grid-based


CLIQUE: the Ideas

•  Partition each dimension into the same number of equal length intervals – Partition an m-dimensional data space into non-

overlapping rectangular units •  A unit is dense if the number of data points

in the unit exceeds a threshold •  A cluster is a maximal set of connected

dense units within a subspace


CLIQUE: the Method

•  Partition the data space and find the number of points in each cell of the partition –  Apriori: a k-d cell cannot be dense if one of its (k-1)-d

projection is not dense •  Identify clusters:

–  Determine dense units in all subspaces of interests and connected dense units in all subspaces of interests

•  Generate minimal description for the clusters –  Determine the minimal cover for each cluster


Sala

ry

(10,

000)

age

Vac

atio

n

30 50

20 30 40 50 60 age

5 4

3 1

2 6

7 0

Vaca

tion

(wee

k)

20 30 40 50 60 age

5 4

3 1

2 6

7 0

CLIQUE: An Example


CLIQUE: Pros and Cons

•  Automatically find subspaces of the highest dimensionality with high density clusters

•  Insensitive to the order of input – Not presume any canonical data distribution

•  Scale linearly with the size of input •  Scale well with the number of dimensions •  The clustering result may be degraded at

the expense of simplicity of the method


Bad Cases for CLIQUE

Parts of a cluster may be missed

A cluster from CLIQUE may contain noise


Dimensionality Reduction

•  Clustering a high dimensional data set is challenging –  Distance between two points could be dominated by

noise •  Dimensionality reduction: choosing the informative

dimensions for clustering analysis –  Feature selection: choosing a subset of existing

dimensions –  Feature construction: construct a new (small) set of

informative attributes


Variance and Covariance

•  Given a set of 1-d points, how different are those points? – Standard deviation: – Variance:

•  Given a set of 2-d points, are the two dimensions correlated? – Covariance:

1

)(1

2

−

−=∑=

n

XXs

n

ii

1

)(1

2

2

−

−=∑=

n

XXs

n

ii

1

))((),cov( 1

−

−−=∑=

n

YYXXYX

n

iii


Principal Components

Art work and example from http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf


Step 1: Mean Subtraction

•  Subtract the mean from each dimension for each data point

•  Intuition: centralizing the data set


Step 2: Covariance Matrix

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

=

),cov(),cov(),cov(

),cov(),cov(),cov(),cov(),cov(),cov(

21

22212

12111

nnnn

n

n

DDDDDD

DDDDDDDDDDDD

C

!"#""

!!


Step 3: Eigenvectors and Eigenvalues

•  Compute the eigenvectors and the eigenvalues of the covariance matrix –  Intuition: find those direction invariant vectors as

candidates of new attributes – Eigenvalues indicate how much the direction

invariant vectors are scaled – the larger the better for manifest the data variance


Step 4: Forming New Features

•  Choose the principal components and forme new features –  Typically, choose the top-k components


New Features

NewData = RowFeatureVector x RowDataAdjust

The first principal component is used

Clustering in Derived Space


Y

XO

- 0.707x + 0.707y

Spectral Clustering


cluster the original data

ij[ ]

Data Affinity matrixk eigenvectors of A

A = f(W)Av = \lamda v

Clustering in thenew space

Computing the leading Projecting back to

W

Affinity Matrix

•  Using a distance measure where σ is a scaling parameter controling how fast the affinity Wij decreases as the distance increases

•  In the Ng-Jordan-Weiss algorithm, Wii is set to 0


Wij = e� dist(oi,oj)

�

w

Clustering

•  In the Ng-Jordan-Weiss algorithm, we define a diagonal matrix such that

•  Then, •  Use the k leading eigenvectors to form a

new space •  Map the original data to the new space and

conduct clustering Jian Pei: CMPT 459/741 Clustering (4) 30

Dii =nX

j=1

Wij

A = D�12WD�

12

Is a Clustering Good?

•  Feasibility – Applying any clustering methods on a uniformly

distributed data set is meaningless •  Quality

– Are the clustering results meeting users’ interest? – Clustering patients into clusters corresponding

various disease or sub-phenotypes is meaningful – Clustering patients into clusters corresponding to

male or female is not meaningful


Major Tasks

•  Assessing clustering tendency – Are there non-random structures in the data?

•  Determining the number of clusters or other critical parameters

•  Measuring clustering quality


Uniformly Distributed Data

•  Clustering uniformly distributed data is meaningless

•  A uniformly distributed data set is generated by a uniform data distribution


504CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS

Figure 10.21: A data set that is uniformly distributed in the data space.

• Measuring clustering quality. After applying a clustering method on adata set, we want to assess how good the resulting clusters are. A numberof measures can be used. Some methods measure how well the clustersfit the data set, while others measure how well the clusters match theground truth, if such truth is available. There are also measures thatscore clusterings and thus can compare two sets of clustering results onthe same data set.

In the rest of this section, we discuss each of the above three topics.

10.6.1 Assessing Clustering Tendency

Clustering tendency assessment determines whether a given data set has a non-random structure, which may lead to meaningful clusters. Consider a dataset that does not have any non-random structure, such as a set of uniformlydistributed points in a data space. Even though a clustering algorithm mayreturn clusters for the data, those clusters are random and are not meaningful.

Example 10.9 Clustering requires non-uniform distribution of data. Figure 10.21shows a data set that is uniformly distributed in 2-dimensional data space.Although a clustering algorithm may still artificially partition the points intogroups, the groups will unlikely mean anything significant to the applicationdue to the uniform distribution of the data.

“How can we assess the clustering tendency of a data set?” Intuitively, wecan try to measure the probability that the data set is generated by a uniformdata distribution. This can be achieved using statistical tests for spatial ran-domness. To illustrate this idea, let’s look at a simple yet effective statisticcalled Hopkins Statistic.

The Hopkins Statistic is a spatial statistic that tests the spatial random-ness of a variable as distributed in a space. Given a data set, D, which isregarded as a sample of a random variable, o, we want to determine how faraway o is from being uniformly distributed in the data space. We calculate theHopkins Statistic as follows:

Hopkins Statistic

•  Hypothesis: the data is generated by a uniform distribution in a space

•  Sample n points, p1, …, pn, uniformly from the space of D

•  For each point pi, find the nearest neighbor of pi in D, let xi be the distance between pi and its nearest neighbor in D


xi = minv2D

{dist(pi, v)}

Hopkins Statistic

•  Sample n points, q1, …, qn, uniformly from D •  For each qi, find the nearest neighbor of qi

in D – {qi}, let yi be the distance between qi and its nearest neighbor in D – {qi}

•  Calculate the Hopkins Statistic H


yi = minv2D,v 6=qi

{dist(qi, v)}

H =

nPi=1

yi

nPi=1

xi +nP

i=1yi

Explanation

•  If D is uniformly distributed, then and would be close to each other, and thus H would be round 0.5

•  If D is skewed, then would be substantially smaller, and thus H would be close to 0

•  If H > 0.5, then it is unlikely that D has statistically significant clusters


nX

i=1

yi

nX

i=1

xi

nX

i=1

yi

Finding the Number of Clusters

•  Depending on many factors – The shape and scale of the distribution in the

data set – The clustering resolution required by the user

•  Many methods exist – Set , each cluster has points on

average – Plot the sum of within-cluster variances with

respect to k, find the first (or the most significant turning point)


k =

rn

2

p2n

A Cross-Validation Method •  Divide the data set D into m parts •  Use m – 1 parts to find a clustering •  Use the remaining part as the test set to test

the quality of the clustering – For each point in the test set, find the closest

centroid or cluster center – Use the squared distances between all points in the

test set and the corresponding centroids to measure how well the clustering model fits the test set

•  Repeat m times for each value of k, use the average as the quality measure


Measuring Clustering Quality

•  Ground truth: the ideal clustering determined by human experts

•  Two situations – There is a known ground truth – the extrinsic

(supervised) methods, comparing the clustering against the ground truth

– The ground truth is unavailable – the intrinsic (unsupervised) methods, measuring how well the clusters are separated


Quality in Extrinsic Methods •  Cluster homogeneity: the more pure the

clusters in a clustering are, the better the clustering

•  Cluster completeness: objects in the same cluster in the ground truth should be clustered together

•  Rag bag: putting a heterogeneous object into a pure cluster is worse than putting it into a rag bag

•  Small cluster preservation: splitting a small cluster in the ground truth into pieces is worse than splitting a bigger one


Bcubed Precision and Recall

•  D = {o1, …, on} – L(oi) is the cluster of oi given by the ground truth

•  C is a clustering on D – C(oi) is the cluster-id of oi in C

•  For two objects oi and oj, the correctness is 1 if L(oi) = L(oj) ßà C(oi) = C(oj), 0 otherwise


Bcubed Precision and Recall

•  Precision

•  Recall


508CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS

one, denoted by o, belong to the same category according to ground truth.Consider a clustering C2 identical to C1 except that o is assigned to acluster C′ ̸= C in C2 such that C′ contains objects from various categoriesaccording to ground truth, and thus is noisy. In other words, C′ in C2 isa rag bag. Then, a clustering quality measure Q respecting the rag bagcriterion should give a higher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).

• Small cluster preservation. If a small category is split into small piecesin a clustering, those small pieces may likely become noise and thus thesmall category cannot be discovered from the clustering. The small clus-ter preservation criterion states that splitting a small category into piecesis more harmful than splitting a large category into pieces. Consider anextreme case. Let D be a data set of n + 2 objects such that, accord-ing to the ground truth, n objects, denoted by o1, . . . , on, belong toone category and the other 2 objects, denoted by on+1, on+2, belong toanother category. Suppose clustering C1 has three clusters, C1={o1, . . . ,on}, C2={on+1}, and C3={on+2}. Let clustering C2 have three clusters,too, namely C1={o1, . . . , on−1}, C2={on}, and C3={on+1, on+2}. Inother words, C1 splits the small category and C2 splits the big category.A clustering quality measure Q preserving small clusters should give ahigher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).

Many clustering quality measures satisfy some of the above four criteria.Here, we introduce the BCubed precision and recall metrics, which satisfy allof the above criteria.

BCubed evaluates the precision and recall for every object in a clusteringon a given data set according to the ground truth. The precision of an objectindicates how many other objects in the same cluster belong to the same cat-egory as the object. The recall of an object reflects how many objects of thesame category are assigned to the same cluster.

Formally, let D ={o1, . . . , on} be a set of objects, and C be a clusteringon D. Let L(oi) (1 ≤ i ≤ n) be the category of oi given by ground truth,and C(oi) be the cluster ID of oi in C. Then, for two objects, oi and oj ,(1 ≤ i, j,≤ n, i ̸= j), the correctness of the relation between oi and oj inclustering C is given by

Correctness(oi, oj) ={ 1 if L(oi) = L(oj)⇔ C(oi) = C(oj)

0 otherwise.(10.28)

BCubed precision is defined as

Precision BCubed =

n∑

i=1

∑

oj :i̸=j,C(oi)=C(oj)

Correctness(oi, oj)

∥{oj|i ̸= j, C(oi) = C(oj)}∥n

. (10.29)

10.6. EVALUATION OF CLUSTERING 509

BCubed recall is defined as

Recall BCubed =

n∑

i=1

∑

oj :i̸=j,L(oi)=L(oj)

Correctness(oi, oj)

∥{oj|i ̸= j, L(oi) = L(oj)}∥n

. (10.30)

Intrinsic Methods

When the ground truth of a data set is not available, we have to use an intrinsicmethod to assess the clustering quality. In general, intrinsic methods evaluatea clustering by examining how well the clusters are separated and how compactthe clusters are. Many intrinsic methods take the advantage of a similaritymetric between objects in the data set.

The silhouette coefficient is such a measure. For a data set D of nobjects, suppose D is partitioned into k clusters, C1, . . . , Ck. For each object o∈ D, we calculate a(o) as the average distance between o and all other objectsin the cluster to which o belongs. Similarly, b(o) is the minimum averagedistance from o to all clusters to which o does not belong. Formally, supposeo ∈ Ci (1 ≤ i ≤ k), then

a(o) =

∑

o′∈Ci,o≠o′dist(o, o′)

|Ci|− 1(10.31)

and

b(o) = minCj :1≤j≤k,j ̸=i

{

∑

o′∈Cjdist(o, o′)

|Cj |}. (10.32)

The silhouette coefficient of o is then defined as

s(o) =b(o)− a(o)

max{a(o), b(o)} . (10.33)

The value of the silhouette coefficient is between −1 and 1. The value ofa(o) reflects the compactness of the cluster to which o belongs. The smallerthe value is, the more compact the cluster is. The value of b(o) capturesthe degree to which o is separated from other clusters. The larger b(o) is,the more separated o is from other clusters. Therefore, when the silhouettecoefficient value of o approaches 1, the cluster containing o is compact and ois far away from other clusters, which is the preferable case. However, whenthe silhouette coefficient value is negative (that is, b(o) < a(o)), this meansthat, in expectation, o is closer to the objects in another cluster than to theobjects in the same cluster as o. In many cases, this is a bad case, and shouldbe avoided.

To measure the fitness of a cluster within a clustering, we can compute theaverage silhouette coefficient value of all objects in the cluster. To measure thequality of a clustering, we can use the average silhouette coefficient value of allobjects in the data set. The silhouette coefficient and other intrinsic measures

Silhouette Coefficient

•  No ground truth is assumed •  Suppose a data set D of n objects is partitioned

into k clusters, C1, …, Ck •  For each object o,

– Calculate a(o), the average distance between o and every other object in the same cluster – compactness of a cluster, the smaller, the better

– Calculate b(o), the minimum average distance from o to every objects in a cluster that o does not belong to – degree of separation from other clusters, the larger, the better


Silhouette Coefficient

•  Then

•  Use the average silhouette coefficient of all objects as the overall measure


b(o) = minCj :o 62Cj

{

Po

02Cjdist(o, o0)

|Cj

| }

a(o) =

Po,o

02Ci,o0 6=odist(o, o0)

|Ci

|� 1

s(o) =

b(o)� a(o)max{a(o), b(o)}

Multi-Clustering

•  A data set may be clustered in different ways –  In different subspaces, that is, using different

attributes – Using different similarity measures – Using different clustering methods

•  Some different clusterings may capture different meanings of categorization – Orthogonal clusterings

•  Putting users in the loop Jian Pei: CMPT 459/741 Clustering (4) 45

To-Do List

•  Read Chapters 10.5, 10.6, and 11.1 •  Find out how Gaussian mixture can be used

in SPARK MLlib •  (for thesis-based graduate students only)

Learn LDA (Latent Dirichlet allocation) by yourself


Documents

Fuzzy Clustering - Simon Fraser University...Jian Pei: CMPT 459/741 Clustering (4) 2 Fuzzy C-Means (FCM) Select an initial fuzzy pseudo-partition, i.e., assign values to all the w