97
1 Machine Learning Lecture # 11 Clustering

Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

  • Upload
    others

  • View
    24

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

1

Machine Learning

Lecture # 11Clustering

Page 2: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Cluster Analysis

Page 3: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

What is Cluster Analysis?

• Cluster: a collection of data objects

– Similar to one another within the same cluster

– Dissimilar to the objects in other clusters

• Cluster analysis

– Finding similarities between data according to the

characteristics found in the data and grouping similar data

objects into clusters

Page 4: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

What is Cluster Analysis?

• Clustering analysis is an important human activity

• Early in childhood, we learn how to distinguish between cats and dogs

• Unsupervised learning: no predefined classes

• Typical applications

– As a stand-alone tool to get insight into data distribution

– As a preprocessing step for other algorithms

Page 5: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Clustering

• Hard vs. Soft

– Hard: same object can only belong to single cluster

– Soft: same object can belong to different clusters

Page 6: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Clustering

• Flat vs. Hierarchical

– Flat: clusters are flat

– Hierarchical: clusters form a tree

• Agglomerative

• Divisive

Page 7: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Clustering: Rich Applications and Multidisciplinary Efforts

• Pattern Recognition

• Spatial Data Analysis

– Create thematic maps in GIS by clustering feature spaces

– Detect spatial clusters or for other spatial mining tasks

• Image Processing

• Economic Science (especially market research)

• WWW

– Document classification

– Cluster Weblog data to discover groups of similar access patterns

Page 8: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Quality: What Is Good Clustering?

• A good clustering method will produce high quality clusters with

– high intra-class similarity

(Similar to one another within the same cluster)

– low inter-class similarity

(Dissimilar to the objects in other clusters)

Page 9: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

What is Cluster Analysis?

• Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Page 10: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Similarity and Dissimilarity Between Objects

• Distances are normally used to measure the similarity or

dissimilarity between two data objects

• Some popular ones include: Minkowski distance:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-

dimensional data objects, and q is a positive integer

• If q = 1, d is Manhattan distance

qq

pp

qq

jx

ix

jx

ix

jx

ixjid )||...|||(|),(

2211

||...||||),(2211 pp j

xi

xj

xi

xj

xi

xjid

Page 11: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Similarity and Dissimilarity Between Objects (Cont.)

• If q = 2, d is Euclidean distance:

• Also, one can use weighted distance, parametric

Pearson correlation, or other disimilarity measures

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

Page 12: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Major Clustering Approaches

• Partitioning approach:

– Construct various partitions and then evaluate them by some criterion, e.g., minimizing

the sum of square errors

– Typical methods: k-means, k-medoids, CLARANS

• Hierarchical approach:

– Create a hierarchical decomposition of the set of data (or objects) using some criterion

– Typical methods: Hierarchical, Diana, Agnes, BIRCH, ROCK, CAMELEON

• Density-based approach:

– Based on connectivity and density functions

– Typical methods: DBSACN, OPTICS, DenClue

Page 13: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Clustering Approaches

1. Partitioning Methods

2. Hierarchical Methods

3. Density-Based Methods

Page 14: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Partitioning Algorithms: Basic Concept

• Given a k, find a partition of k clusters that optimizes the chosen

partitioning criterion

• k-means and k-medoids algorithms

• k-means (MacQueen’67): Each cluster is represented by the center of the cluster

• k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each

cluster is represented by one of the objects in the cluster

Page 15: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

The K-Means Clustering Method

• Given k, the k-means algorithm is

implemented in four steps:1. Partition objects into k nonempty subsets

2. Compute seed points as the centroids of the clusters of the

current partition (the centroid is the center, i.e., mean

point, of the cluster)

3. Assign each object to the cluster with the nearest seed

point

4. Go back to Step 2, stop when no more new assignment

Page 16: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

K-means Clustering

Page 17: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

K-means Clustering

Page 18: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

K-means Clustering

Page 19: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

K-means Clustering

Page 20: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

K-means Clustering

Page 21: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

The K-Means Clustering Method

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassignreassign

Page 22: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Example

• Run K-means clustering with 3 clusters (initial centroids: 3, 16, 25) for at least 2 iterations

Page 23: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Example

• Centroids:3 – 2 3 4 7 9 new centroid: 5

16 – 10 11 12 16 18 19 new centroid: 14.33

25 – 23 24 25 30 new centroid: 25.5

Page 24: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Example

• Centroids:5 – 2 3 4 7 9 new centroid: 5

14.33 – 10 11 12 16 18 19 new centroid: 14.33

25.5 – 23 24 25 30 new centroid: 25.5

Page 25: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

In class Practice

• Run K-means clustering with 3 clusters (initial centroids: 3, 12, 19) for at least 2 iterations

Page 26: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Hierarchical Clustering

• Two main types of hierarchical clustering– Agglomerative:

• Start with the points as individual clusters

• At each step, merge the closest pair of clusters until only one cluster (or k clusters) left

Matlab: Statistics Toolbox: clusterdata,

which performs all these steps: pdist, linkage, cluster

– Divisive:

• Start with one, all-inclusive cluster

• At each step, split a cluster until each cluster contains a point (or there are k clusters)

• Traditional hierarchical algorithms use a similarity or distance matrix– Merge or split one cluster at a time

– Image segmentation mostly uses simultaneous merge/split

Page 27: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Hierarchical clustering

• Agglomerative (Bottom-up)– Compute all pair-wise pattern-pattern similarity

coefficients

– Place each of n patterns into a class of its own

– Merge the two most similar clusters into one• Replace the two clusters into the new cluster

• Re-compute inter-cluster similarity scores w.r.t. the new cluster

– Repeat the above step until there are k clusters left (k can be 1)

Page 28: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Hierarchical clustering

• Agglomerative (Bottom up)

Page 29: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Hierarchical clustering• Agglomerative (Bottom up)

• 1st iteration1

Page 30: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Hierarchical clustering• Agglomerative (Bottom up)

• 2nd iteration1 2

Page 31: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Hierarchical clustering• Agglomerative (Bottom up)

• 3rd iteration

1 23

Page 32: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Hierarchical clustering• Agglomerative (Bottom up)

• 4th iteration

1 23

4

Page 33: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Hierarchical clustering• Agglomerative (Bottom up)

• 5th iteration

1 23

4

5

Page 34: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Hierarchical clustering• Agglomerative (Bottom up)

• Finally k clusters left

1 23

4

6 9

5

7

8

Page 35: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Hierarchical clustering

• Divisive (Top-down)

– Start at the top with all patterns in one cluster

– The cluster is split using a flat clustering algorithm

– This procedure is applied recursively until each pattern is in its own singleton cluster

Page 36: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Hierarchical clustering

• Divisive (Top-down)

Page 37: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Hierarchical Clustering: The Algorithm

• Hierarchical clustering takes as input a set of points • It creates a tree in which the points are leaves and the internal

nodes reveal the similarity structure of the points.– The tree is often called a “dendogram.”

• The method is summarized below:

Place all points into their own clusters

While there is more than one cluster, do

Merge the closest pair of clusters

The behavior of the algorithm depends on how “closest pair

of clusters” is defined

Page 38: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Hierarchical Clustering: Example

A

B

EF

C D

A B E FC D

This example illustrates single-link clustering in

Euclidean space on 6 points.

Page 39: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Hierarchical Clustering

• Produces a set of nested clusters organized as a hierarchical tree

• Can be visualized as a dendrogram

– A tree like diagram that records the sequences of merges or splits

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5

Page 40: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Strengths of Hierarchical Clustering

• Do not have to assume any particular number of clusters– Any desired number of clusters can be obtained by

‘cutting’ the dendogram at the proper level

Page 41: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Hierarchical Clustering: Merging Clusters

Single Link: Distance between two clusters

is the distance between the closest points.

Also called “neighbor joining.”

Average Link: Distance between clusters is

distance between the cluster centroids.

Complete Link: Distance between clusters

is distance between farthest pair of points.

Page 42: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Similarity?

• MIN

• MAX

• Group Average

• Distance Between Centroids

• Other methods driven by an objective function– Ward’s Method uses squared error

Proximity Matrix

Page 43: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

• MIN

• MAX

• Group Average

• Distance Between Centroids

• Other methods driven by an objective function– Ward’s Method uses squared error

Page 44: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

• MIN

• MAX

• Group Average

• Distance Between Centroids

• Other methods driven by an objective function– Ward’s Method uses squared error

Page 45: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

• MIN

• MAX

• Group Average

• Distance Between Centroids

• Other methods driven by an objective function– Ward’s Method uses squared error

Page 46: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

• MIN

• MAX

• Group Average

• Distance Between Centroids

• Other methods driven by an objective function

Page 47: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

An example

Let us consider a gene measured in a set of 5 experiments: A,B,C,D and E. The values measured in the 5 experiments are:

A=100 B=200 C=500 D=900 E=1100

We will construct the hierarchical clustering of these values using Euclidean distance, centroid linkage and an agglomerative approach.

50

Page 48: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

An example

SOLUTION:

• The closest two values are 100 and 200=>the centroid of these two values is 150.

• Now we are clustering the values: 150, 500, 900, 1100

• The closest two values are 900 and 1100

=>the centroid of these two values is 1000.

• The remaining values to be joined are: 150, 500, 1000.

• The closest two values are 150 and 500

=>the centroid of these two values is 325.

• Finally, the two resulting subtrees are joined in the root of the tree.

51

Page 49: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

An example:Two hierarchical clusters of the expression values of a single gene measured

in 5 experiments.

52

100A

200B

500C

900D

1100E

100A

200B

500C

900D

1100E

The dendograms are identical: both diagrams show that:•A is most similar to B•C is most similar to the group (A,B)•D is most similar to E

In the left dendogram A and E are plotted far from each otherIn the right dendogram A and E are immediate neighbors

THE PROXIMITY IN A HIERARCHICAL CLUSTERING DOES NOT NECESSARILYCORRESPOND TO SIMILARITY

Page 50: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

What Is the Problem of the K-Means Method?

• The k-means algorithm is sensitive to outliers !

– Since an object with an extremely large value may substantially distort

the distribution of the data.

• K-Medoids: Instead of taking the mean value of the object in a

cluster as a reference point, medoids can be used, which is the

most centrally located object in a cluster.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 51: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

Page 52: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

Page 53: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

Page 54: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

57

The K-Medoids Clustering Method

• Find representative objects, called medoids, in

clusters

• Medoids are located in the center of the

clusters.

– Given data points, how to find the medoid?

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 55: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

58

Categorical Values

• Handling categorical data: k-modes (Huang’98)

– Replacing means of clusters with modes

• Mode of an attribute: most frequent value

• Mode of instances: for an attribute A, mode(A)= most frequent value

• K-mode is equivalent to K-means

– Using a frequency-based method to update modes of clusters

– A mixture of categorical and numerical data: k-prototype method

Page 56: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

X1 2 6

X2 3 4

X3 3 8

X4 4 7

X5 6 2

X6 6 4

X7 7 3

X8 7 4

X9 8 5

X10 7 6

K-mediods example

Page 57: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

• Initialize k mediods

• Let us assume c1 = (3,4) and c2 = (7,4)

• Calculate distance so as to associate each data object to its nearest medoid.

Page 58: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

i c1

Data objects (Xi)

Cost (distance)

1 3 4 2 6 3

3 3 4 3 8 4

4 3 4 4 7 4

5 3 4 6 2 5

6 3 4 6 4 3

7 3 4 7 3 5

9 3 4 8 5 6

10 3 4 7 6 6

i c2

Data objects (Xi)

Cost (distance)

1 7 4 2 6 7

3 7 4 3 8 8

4 7 4 4 7 6

5 7 4 6 2 3

6 7 4 6 4 1

7 7 4 7 3 1

9 7 4 8 5 2

10 7 4 7 6 2

Cluster1 = {(3,4)(2,6)(3,8)(4,7)}Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}

Page 59: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

i O′Data objects (Xi)

Cost (distance)

1 7 3 2 6 8

3 7 3 3 8 9

4 7 3 4 7 7

5 7 3 6 2 2

6 7 3 6 4 2

8 7 3 7 4 1

9 7 3 8 5 3

10 7 3 7 6 3

i c1

Data objects (Xi)

Cost (distance)

1 3 4 2 6 3

3 3 4 3 8 4

4 3 4 4 7 4

5 3 4 6 2 5

6 3 4 6 4 3

8 3 4 7 4 4

9 3 4 8 5 6

10 3 4 7 6 6

• Select one of the nonmedoids O′. Let us assume O′ = (7,3)• Now the medoids are c1(3,4) and O′(7,3)

• Do not change the mediod as S > 0

Page 60: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Fuzzy C-means Clustering

• Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters.

• This method (developed by Dunn in 1973 and improved by Bezdek in 1981) is frequently used in pattern recognition.

Page 61: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Fuzzy C-means Clustering

Page 62: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Fuzzy C-means Clustering

Page 63: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Fuzzy C-means Clustering

Page 64: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Fuzzy C-means Clustering

Page 65: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Fuzzy C-means Clustering

Page 66: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Fuzzy C-means Clustering

Page 67: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Fuzzy C-means Clustering

Page 68: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Fuzzy C-means Clusteringhttp://home.dei.polimi.it/matteucc/Clustering/tutorial_html/cmeans.html

Page 69: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Fuzzy C-means Clustering• For example: we have initial centroid 3 & 11

(with m=2)

• For node 2 (1st element):

U11 =

The membership of first node to first cluster

U12 =

The membership of first node to second cluster

%78.9882

81

81

11

1

112

32

32

32

1

12

2

12

2

%22.182

1

181

1

112

112

32

112

1

12

2

12

2

Page 70: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Fuzzy C-means Clustering

• For example: we have initial centroid 3 & 11 (with m=2)

• For node 3 (2nd element):

U21 = 100% The membership of second node to first cluster

U22 = 0%

The membership of second node to second cluster

Page 71: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Fuzzy C-means Clustering

• For example: we have initial centroid 3 & 11

(with m=2)

• For node 4 (3rd element):

U31 =

The membership of first node to first cluster

U32 =

The membership of first node to second cluster

%98

49

50

1

49

11

1

114

34

34

34

1

12

2

12

2

%250

1

149

1

114

114

34

114

1

12

2

12

2

Page 72: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Fuzzy C-means Clustering

• For example: we have initial centroid 3 & 11

(with m=2)

• For node 7 (4th element):

U41 =

The membership of fourth node to first cluster

U42 =

The membership of fourth node to second cluster

%502

1

11

1

117

37

37

37

1

12

2

12

2

%502

1

11

1

117

117

37

117

1

12

2

12

2

Page 73: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Fuzzy C-means Clustering

• C1=

...%)50(%)98(%)100(%)78.98(

...7*%)50(4*%)98(3*%)100(2*%)78.98(2222

2222

Page 74: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

FCM Soft Clustering

Page 75: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Gaussian Mixture Model

Page 76: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Component 1 Component 2p(x

)

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Mixture Model

x

p(x

)

Page 77: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Component 1 Component 2p(x

)

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Mixture Model

x

p(x

)

Page 78: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

-5 0 5 100

0.5

1

1.5

2

Component Models

p(x

)

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Mixture Model

x

p(x

)

Page 79: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

The EM Algorithm

• Dempster, Laird, and Rubin

– general framework for likelihood-based parameter estimation with missing data

• start with initial guesses of parameters

• E step: estimate memberships given params

• M step: estimate params given memberships

• Repeat until convergence

– converges to a (local) maximum of likelihood

– E step and M step are often computationally simple

– generalizes to maximum a posteriori (with priors)

Page 80: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4ANEMIA PATIENTS AND CONTROLS

Red Blood Cell Volume

Red B

lood C

ell

Hem

oglo

bin

Concentr

ation

Page 81: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Red B

lood C

ell H

em

oglo

bin

Concentration

EM ITERATION 1

Page 82: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Red B

lood C

ell H

em

oglo

bin

Concentration

EM ITERATION 3

Page 83: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Red B

lood C

ell H

em

oglo

bin

Concentration

EM ITERATION 5

Page 84: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Red B

lood C

ell H

em

oglo

bin

Concentration

EM ITERATION 10

Page 85: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Red B

lood C

ell H

em

oglo

bin

Concentration

EM ITERATION 15

Page 86: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Red B

lood C

ell H

em

oglo

bin

Concentration

EM ITERATION 25

Page 87: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Gaussian Mixture

• The Gaussian mixture architecture estimates probability density functions (PDF) for each class, and then performs classification based on Bayes’ rule:

)(

)()|()|(

XP

CPCXPXCP i

ii

Where P(X | Ci) is the PDF of class i, evaluated at X, P( Ci ) is the prior probability for class i, and P(X) is the overall PDF, evaluated

at X.

Page 88: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Gaussian Mixture

• Unlike the unimodal Gaussian architecture, which assumes P(X | Cj) to be in the form of a Gaussian, the Gaussian mixture model estimates P(X | Cj) as a weighted average of multiple Gaussians.

Nc

k

kkj GwCXP1

)|(

where wk is the weight of the k-th Gaussian Gk and the weightssum to one. One such PDF model is produced for each class.

Page 89: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Gaussian Mixture

• Each Gaussian component is defined as:

)]()(2/1[

2/12/

1

||)2(

1kk

Tk MXVMX

k

nk e

VG

where Mk is the mean of the Gaussian and Vk is the covariancematrix of the Gaussian.

Page 90: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Gaussian Mixture

• Free parameters of the Gaussian mixture model consist of the means and covariance matrices of the Gaussian components and the weights indicating the contribution of each Gaussian to the approximation of P(X | Cj).

Page 91: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Composition of Gaussian Mixture

G1,w1 G2,w2

G3,w3

G4,w4

G5,w5

Class 1

)(

)()|()|(

XP

CPCXPXCP

j

jj

Nc

k

kkj GwCXP1

)|(

)]()(2/1[

2/12/

1

||)2(

1)|( i

Ti XViX

i

dik eV

GXpG

Variables: μi, Vi, wk

We use EM (estimate-maximize) algorithm to approximate this variables.

Page 92: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Gaussian Mixture

• These parameters are tuned using a complex iterative procedure called the estimate-maximize (EM) algorithm, that aims at maximizing the likelihood of the training set generated by the estimated PDF.

Page 93: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Gaussian Mixture Training Flow Chart (1)

Initialize the initial Gaussian means μi, i=1,…G using the K means clustering algorithmInitialize the covariance matrices, Vi, to the distance to the nearest cluster.Initialize the weights πi =1 / G so that all Gaussian are equally likely.

)]()(2/1[

2/12/

1

||)2(

1)|( i

Ti XViX

i

di eV

GXp

)|()|(1

i

G

i

is GXpXp

Present each pattern X of the training set and model each of the classes Kas a weighte sum of Gaussians:

where G is the number of Gaussians, the πi’s are the weights, and

where Vi is the covariance matrix.

Page 94: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Gaussian Mixture Training Flow Chart (2)

Compute:

G

j

kjj

kiikiiiip

CXp

CGXp

Xp

CGXpXGP

1

),|(

),|(

)(

),|()|(

Iteratively update the weights, means and covariances:

cN

ppi

c

i tN

t1

)(1

)1(

p

N

p

pi

ic

XttN

tic

1

)()(

1)1(

)))())(((()()(

1)1(

1

T

ipip

N

ppi

ic

i tXtXttN

tVc

E-Step for EM

M-Step for EM

Note: Here is the PE which we did in the class

Page 95: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Gaussian Mixture Training Flow Chart (3)

Recompute τip using the new weights, means and covariances. Stop training if

Or the number of epochs reach the specified value. Otherwise, continue the iterative updates.

thresholdttpipipi )()1(

Page 96: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

Gaussian Mixture Test Flow Chart

Present each input pattern X and compute the confidence for each class j:

where is the prior probability of class Cj estimated by counting the number of training patterns. Classify pattern X as the class with the highest confidence.

),|()( jxj CXPCP

N

NCP ci

j )(

Gaussian Mixture End

Page 97: Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods:

100

Acknowledgements

Introduction to Machine Learning, Alphaydin

Pattern Classification” by Duda et al., John Wiley & Sons.

Read GMM from “Automated Detection of Exudates in Colored Retinal Images for Diagnosis of Diabetic Retinopathy”, Applied Optics, Vol. 51 No. 20, 4858-4866, 2012.

Mat

eria

l in

th

ese

slid

es h

as b

een

tak

en f

rom

, th

e fo

llow

ing

reso

urc

es