31
Clustering Algorithms Sunida Ratanothayanon

Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

Embed Size (px)

DESCRIPTION

Clustering Clustering is a classification pattern that divide data into groups in meaningful and useful way Unsupervised classification pattern

Citation preview

Page 1: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

Clustering Algorithms

Sunida Ratanothayanon

Page 2: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

What is Clustering?

Page 3: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

Clustering Clustering is a classification

pattern that divide data into groups in meaningful and useful way

Unsupervised classification pattern

Page 4: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

Clustering Clustering is a classification

pattern that divide data into groups in meaningful and useful way

Unsupervised classification pattern

Page 5: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

Outline K-Means Algorithm

Hierarchical Clustering Algorithm

Page 6: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

K-Means Algorithm A partial clustering algorithm k clusters (# of k is specified by a

user) Each cluster has a cluster center

called centroid. The algorithm will literately group

data into k clusters based on a distance function.

Page 7: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

K-Means Algorithm The centroid can be obtained from

the mean of all data points in the cluster.

Stop when there is no change of center.

Page 8: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

A numerical example

Page 9: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

K-Means example

Data Point x1 x21 22 21

2 19 20

3 18 22

4 1 3

5 4 2

We have five data points with 2 attributesWant to group data into 2 clusters (k=2)

Page 10: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

K-Means exampleWe can plot a graph from five data points as following.

Plot of 5 data points over X1 and X2

0

5

10

15

20

25

0 5 10 15 20 25

X1

X2 Cluster C2Cluster C1

Page 11: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

K-Means example(1 st iteration)

Step1 : Choosing center and defining kData

Point x1 x2

1 22 21

2 19 20

3 18 22

4 1 3

5 4 2

C1=(18,22), C2= (4,2)

Step2 : Computing cluster centersWe already define c1 and c2

Step3 : Finding square of Euclidian distance of each data point from the center and assigning each data points to a cluster

21

nd x yi ii

Page 12: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

K-Means example(1 st iteration)

DataPoint x1 x2

1 22 21

2 19 20

3 18 22

4 1 3

5 4 2Step3 (cont):Distance table for all data points

Data Point

C1 C2

(18,22) (4,2)

(22,21) 4.13 26.9

(19,20) 2.23 23.43

(18,22) 0 24.41

(1,3) 25.49 3.1

(4,2) 24.41 0

Then, we assign each data point to the cluster by comparing its distance to the center. The data point will be assigned to its closest cluster.

Page 13: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

K-Means example(2 nd iteration)

Step2 : Computing cluster centersWe will compute new cluster

centers Member of cluster1 are (22,21), (19,20) and (18,22). We will

find average of these data points

Data Point

C1 C2

(18,22) (4,2)

(22,21) 4.13 26.9

(19,20) 2.23 23.43

(18,22) 0 24.41

(1,3) 25.49 3.1

(4,2) 24.41 0

22 19 18 5921 20 22 63

59 / 3 19.763/ 3 21

C1 is [19.7, 21]

Member of cluster2 are (1,3) and (4,2). 1 4 53 2 5

5 / 2 2.55 / 2 2.5

C2 is [2.5, 2.5]

Page 14: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

K-Means example(2 nd iteration)

Data Point

C1’ C2’

(19.7,21) (2.5,2.5)

(22,21) 2.3 26.88

(19,20) 1.22 24.05

(18,22) 1.97 24.91

(1,3) 25.96 1.58

(4,2) 24.65 1.58

Step3 :Finding square of Euclidian distance of each data point from the center and assigning each data points to a cluster

Distance table for all data points with new centers

Assign each data point to the cluster by comparing its distance to the center. The data point will be assigned to its closest cluster.

Repeat step2 and 3 for the next iteration because centers still have a change

Data Point

C1 C2

(18,22) (4,2)

(22,21) 4.13 26.9

(19,20) 2.23 23.43

(18,22) 0 24.41

(1,3) 25.49 3.1

(4,2) 24.41 0

Page 15: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

K-Means example(3 rd iteration)

Step2 : Computing cluster centersWe will compute new cluster

centers Member of cluster1 are (22,21), (19,20) and (18,22). We will

find average of these data points22 19 18 5921 20 22 63

59 / 3 19.763/ 3 21

C1 is [19.7, 21]

Member of cluster2 are (1,3) and (4,2). 1 4 53 2 5

5 / 2 2.55 / 2 2.5

C2 is [2.5, 2.5]

Data Point

C1’ C2’

(19.7,21) (2.5,2.5)

(22,21) 2.3 26.88

(19,20) 1.22 24.05

(18,22) 1.97 24.91

(1,3) 25.96 1.58

(4,2) 24.65 1.58

Page 16: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

K-Means example(3 rd iteration)

Data Point

C1’ C2’

(19.7,21) (2.5,2.5)

(22,21) 2.3 26.88

(19,20) 1.22 24.05

(18,22) 1.97 24.91

(1,3) 25.96 1.58

(4,2) 24.65 1.58

Step3 :Finding square of Euclidian distance of each data point from the center and assigning each data points to a cluster

Distance table for all data points with new centers

Assign each data point to the cluster by comparing its distance to the center. The data point will be assigned to its closest cluster.

Stop the algorithm because centers remain the same.

Data Point

C1’’ C2’’

(19.7,21) (2.5,2.5)

(22,21) 2.3 26.88

(19,20) 1.22 24.05

(18,22) 1.97 24.91

(1,3) 25.96 1.58

(4,2) 24.65 1.58

Page 17: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

Hierarchical Clustering Algorithm Produce a nest sequence of cluster

like a tree. Allow to have subclusters. Individual data point at the bottom

of the tree are called “Singleton clusters”.

C

E

A

B

D

Page 18: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

Hierarchical Clustering Algorithm Agglomerative method

A tree will be build up from the bottom level and will be merged the nearest pair of clusters at each level to go one level up

Continue until all the data points are merged into a single cluster.

Page 19: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

A numerical example

Page 20: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

Hierarchical Clustering example

We have five data points with 3 attributes

Data Point x1 x2 x3

A 9 3 7

B 10 2 9

C 1 9 4

D 6 5 5

E 1 10 3

Page 21: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

Hierarchical Clustering example(1 st iteration) Data

Point x1 x2 x3

A 9 3 7

B 10 2 9

C 1 9 4

D 6 5 5

E 1 10 3

Step1 : Calculating Euclidian distance between two vector points

Then we obtain distance table as following

Data Point A B C D E

  (9, 3, 7) (10, 2, 9) (1, 9, 4) (6, 5, 5) (1, 10, 3)

A ( 9, 3, 7) 0 2.45 10.44 4.12 11.36

B (10, 2, 9) - 0 12.45 6.4 13.45

C (1, 9, 4) - - 0 6.48 1.41

D (6, 5, 5) - - - 0 7.35

E (1, 10, 3) - - - - 0

Page 22: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

Hierarchical Clustering example(1 st iteration)

Step2 : Forming a tree Consider the most similar pair of

data points from the previous distance table

Data Point A B C D E

  (9, 3, 7) (10, 2, 9) (1, 9, 4) (6, 5, 5) (1, 10, 3)

A ( 9, 3, 7) 0 2.45 10.44 4.12 11.36

B (10, 2, 9) - 0 12.45 6.4 13.45

C (1, 9, 4) - - 0 6.48 1.41

D (6, 5, 5) - - - 0 7.35

E (1, 10, 3) - - - - 0

C and E are the most similar We will obtain the first cluster as

followingC

E

Repeat step1 and 2 until all data points are merged into a single cluster.

Page 23: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

Hierarchical Clustering example(2 nd iteration)

C

E

Data Point A B C D E

  (9, 3, 7) (10, 2, 9) (1, 9, 4) (6, 5, 5) (1, 10, 3)

A ( 9, 3, 7) 0 2.45 10.44 4.12 11.36

B (10, 2, 9) - 0 12.45 6.4 13.45

C (1, 9, 4) - - 0 6.48 1.41

D (6, 5, 5) - - - 0 7.35

E (1, 10, 3) - - - - 0

Step1 : Calculating Euclidian distance between two vector points

We will redraw the distance table including the merge of two entities, C&E.

Data Point A B D C&E

  (9, 3, 7) (10, 2, 9) (6, 5, 5)

A ( 9, 3, 7) 0 2.45 4.12 10.9

B (10, 2, 9) - 0 6.4 12.95

D (6, 5, 5) - - 0 6.90

C&E (1, 9.5, 3.5) - - - 0

A distance for C&E to A can be obtained from

We can use a previous table to get the distance from C to A and E to A.

( , ), ,( , ),d avg d dC A E AC E A

avg (10.44, 11.36) = 10.9

Page 24: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

Hierarchical Clustering example(2 nd iteration)

Step2 : Forming a tree Consider the most similar pair of

data points from the previous distance table

A and B are the most similar We will obtain the second cluster as

following

Repeat step1 and 2 until all data points are merged into a single cluster.

Data Point A B D C&E

  (9, 3, 7) (10, 2, 9) (6, 5, 5)

A ( 9, 3, 7) 0 2.45 4.12 10.9

B (10, 2, 9) - 0 6.4 12.95

D (6, 5, 5) - - 0 6.90

C&E (1, 9.5, 3.5) - - - 0

C

E

A

B

Page 25: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

From previous table, we can obtain following distances for the new distance table

Hierarchical Clustering example(3 rd iteration) Data Point A B D C&E

  (9, 3, 7) (10, 2, 9) (6, 5, 5)

A ( 9, 3, 7) 0 2.45 4.12 10.9

B (10, 2, 9) - 0 6.4 12.95

D (6, 5, 5) - - 0 6.90

C&E (1, 9.5, 3.5) - - - 0

Step1 : Calculating Euclidian distance between two vector points

We will redraw the distance table including the merge entities, C&E and A&B.

Data Point A&B D C&E

    (6, 5, 5)  

A&B 0 5.26 11.93

D (6, 5, 5) - 0 6.9

C&E - - 0

( , ), ( , ) ( , )( , ) (4.12,6.40) 5.26A B D A D B Dd avg d d avg

( , ),( , ) ( , ) ( , ) ( , ), ( , ),( , ) ( , ) (10.9,12.95) 11.93C E A B C E A B C E A C E Bd avg d d avg d d avg

( , ), 6.90C E Dd

Page 26: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

Hierarchical Clustering example(3 rd iteration)

Step2 : Forming a tree Consider the most similar pair of

data points from the previous distance table

A&B and D are the most similar We will obtain the new cluster as

following

Repeat step1 and 2 until all data points are merged into a single cluster.

Data Point A&B D C&E

    (6, 5, 5)  

A&B 0 5.26 11.93

D (6, 5, 5) - 0 6.9

C&E - - 0

C

E

A

B

D

Page 27: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

From previous table, we can obtain a distance from cluster A&B&D to C&E as following

Hierarchical Clustering example(4 th iteration)

Data Point A&B D C&E

    (6, 5, 5)  

A&B 0 5.26 11.93

D (6, 5, 5) - 0 6.9

C&E - - 0

Step1 : Calculating Euclidian distance between two vector points

We will redraw the distance table including the merge entities, C&E and A&B&D.

Data Point A&B&D C&E

     

A&B&D 0 9.4

C&E - 0

( , , ),( , ) ( , ),( , ) ,( , )( , ) (11.93,6.9) 9.4A B D C E A B C E D C Ed avg d d avg

Page 28: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

Hierarchical Clustering example(4 th iteration)

Step2 : Forming a tree Consider the most similar pair of

data points from the previous distance table

We can form a final tree because no more recalculation has to be made

We can merge all data points into a single cluster A&B&D&C&E.

Stop the algorithm.

Data Point A&B&D C&E

     

A&B&D 0 9.4

C&E - 0

C

E

A

B

D

Page 29: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

Conclusion Two major clustering algorithms.

K-Means algorithm An algorithm which literately groups data

into k clusters based on a distance function. # of k is specified by a user.

Hierarchical Clustering algorithm It is a nest sequence of cluster like a tree. A tree will be build up from the bottom level

and continue until all the data points are merged into a single cluster.

Page 30: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

References[1] Hastie, T., Tibeshirani, R., & Friedman J. Data Mining, Inference, Prediction. Unsupervised Learning. pp.453-480

[2] JAIN, A. K., MURTY, M. N., & FLYNN, P. J. (1999). Data Clustering: A Review. ACM Computing Surveys, 31(3), 264-330.

[3] Liu, B. (2006). Web Data Mining. Unsupervised Learning. Springer. pp.117-150.

[4] Ning, T. P., STEINBACH, M., & KUMAR, V. Introduction to Data Mining. Cluster Analysis: Basic Concepts and Algorithms. pp.487- 553.

Page 31: Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

Thank you