Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms

Slide 1

EE3J2 Data Mining

Lecture 18K-means andAgglomerative Algorithms

Slide 2

Today

Unsupervised Learning Clustering K-means

Slide 3EE3J2 Data Mining

Distortion

The distortion for the centroid set C = c1,…,cM is defined by:

In other words, the distortion is the sum of distances between each data point and its nearest centroid

The task of clustering is to find a centroid set C such that the distortion Dist(C) is minimised

T

ttit cydCDist

1

,

Slide 4 4

The K-Means Clustering Method

Given k, the k-means algorithm is implemented in 4 steps: Initialisation

Define the number of clusters (k). Designate a cluster centre (a vector quantity that is of

the same dimensionality of the data) for each cluster. Assign each data point to the closest cluster centre

(centroid). That data point is now a member of that cluster.

Calculate the new cluster centre (the geometric average of all the members of a certain cluster).

Calculate the sum of within-cluster sum-of-squares. If this value has not significantly changed over a certain number of iterations, exit the algorithm. If it has, or the change is insignificant but has not been seen to persist over a certain number of iterations, go back to Step 2.

Remember you converge when you have found the minimum overall distance between the centroid and the objects.

Slide 5

K Means Example(K=2)

Pick seeds

Reassign clusters

Compute centroids

xx

Reasssign clusters

xx xx Compute centroids

Reassign clusters

Converged!

[From Mooney]

Slide 6

So….Basically

Start with randomly k data points (objects).

Find the set of data points that are closer to C0

k (Y0k).

Compute average of these points, notate C1

k -> new centroid. Now repeat again this process and find

the closest objects to C1k

Compute the average to get C2k -> new

centroid, and so on…. Until convergence.

Slide 7 7

Comments on the K-Means Method Strength

Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.

Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms

Weakness Applicable only when mean is defined, then

what about categorical data? Need to specify k, the number of clusters, in

advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-

convex shapes

Slide 8

Hierarchical Clustering

Grouping data objects into a tree of clusters.

Agglomerative clustering Begin by assuming that every data point is a

separate centroid Combine closest centroids until the desired

number of clusters is reached Divisive clustering

Begin by assuming that there is just one centroid/cluster

Split clusters until the desired number of clusters is reached

Slide 9

Agglomerative Clustering - Example

Students Exam1 Exam2 Exam3

Mike 9 3 7

Tom 10 2 9

Bill 1 9 4

T Ren 6 5 5

Ali 1 10 3

Slide 10

Distances between objects

Using Euclidean Distance measure, what's the difference between Mike and Tom?

Mike:9,3,7Tom: 10,2,9

S E1 E2 E3

Mike 9 3 7

Tom 10 2 9

Bill 1 9 4

T 6 5 5

Ali 1 10 3

5.2

)97()23()109( 222

2222

211 ..., NN yxyxyxyxd

Slide 11

Distance Matrix

Mike Tom Bill T Ren Ali

Mike 0 2.5 10.44 4.12 11.75

Tom - 0 12.5 6.4 13.93

Bill - - 0 6.48 1.41

T Ren - - - 0 7.35

Ali - - - - 0

Slide 12

The Algorithm Step 1

Identify the entities which are most similar - this can be easily discerned from the distance table constructed.

In this example, Bill and Ali are most similar, with a distance value of 1.41. They are therefore the most 'related'

Bill

Ali

Slide 13

The Algorithm – Step 2

The two entities that are most similar can now be merged so that they represent a single cluster (or new entity).

So Bill and Ali can now be considered to be a single entity. How do we compare this entity with others? We use the Average linkage between the two.

So the new average vector is [1, 9.5, 3.5] – see first table and average the marks for Bill and Ali.

We now need to redraw the distance table, including the merger of the two entities, and new distance calculations.

Slide 14

The Algorithm – Step 3

Mike Tom T Ren {Bill & Ali}

Mike - 2.5 4.12 10.9

Tom - 6.4 9.1

T Ren - 6.9

{Bill & Ali} -

Slide 15

Next closest students

Mike and Tom with 2.5! So, now we have 2 clusters!

Bill

Ali

Mike

Tom

Slide 16

The distance matrix now

{Mike &

Tom}T Ren {Bill & Ali}

{Mike & Tom}

- 3.7 9.2

T Ren - 6.9

{Bill & Ali} -

Now, T Ren is closest to Bill and Ali so T Ren joines them In the cluster.

Slide 17

The final dendogram

Bill

Ali

Mike

Tom

T Ren

MANY ‘SUB-CLUSTERS’WITHIN ONE CLUSTER

Slide 18

Conclusions

K- Means Algorithm – memorize equations and algorithm.

Hierarchical Clustering: Agglomerative Clustering

Documents

Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms