Clustering methods Course code: 175314

Pasi Fränti

9.2.2017

Machine LearningSchool of Computing

University of Eastern FinlandJoensuu, FINLAND

Part 1: Introduction

Clustering methods

Sample data

Sources of RGB vectors

Red-Green plot of the vectors

Sample data

Employment statistics:

Application examples

Color reconstruction

Image with compression artifacts

Image with original colors

Speaker modelingfor voice biometrics

Training data

Feature extractionand clustering

Matti

Mikko

Tomi

Speaker models

Tomi

Matti

Feature extraction

Best match: Matti !

Mikko

?

Speaker modeling

Speech data Result of clustering

Image segmentation

Normalized color plots according to red and green components.

Image with 4 color clusters

red

gree

n

Signal quantization

Quantized signal Original signal

Approximation of continuous range values (or a very large set of possible discrete values) by a small set of discrete symbols or integer values

http://en.wikipedia.org/wiki/Image:Quantized.signal.svg

Color quantization of imagesColor quantization of images

Color image RGB samples

Clustering

Users on map

Clustering the users

Clustering of photos in two ways

Clustering of photos

Clustering timeline

Photo clusters on map

User anddate

Number of photos

Clusters

Last known location of the user

ClustersNumber

of photos

Functions:

Open cluster

Start slideshow

Clusters in the timeline view

Clustering GPS tracksMobile users, taxi routes, fleet management

Conclusions from clusters

Cluster 1: Office

Cluster 2: Home

Clustering keywords

hostel

auberge

lodge

hostelry

film

cinema

movie

luncheon

meal

lunch

arena

stadium

gym

gymnasium

cafe

eatery

cafeteria

restaurantcoffeehouse

snack

collationsaloon

bar

barroom

ginmill

storeshop cubicle

market

stall

pharmacy

kiosk

booth

outlet

drugstore

0.560.56

0.48 0.48

0.91

0.910.91

00

0

0.91

0

0

0.92

0.88

0.88

0.85

0.74

Clustering text descriptions

Home take care services

Clustering user preferences

Part I:Clustering problem

Subproblems of clustering

1. Where are the clusters?(Algorithmic problem)

2. How many clusters?(Methodological problem: which criterion?)

3. Selection of attributes (Application related problem)

4. Preprocessing the data(Practical problems: normalization, outliers)

Definitions and data

Set of N data points:X={x1, x2, …, xN}

Set of M cluster prototypes (centroids):

C={c1, c2, …, cM},

P={p1, p2, …, pM},

Partition of the data:

Distance and cost function

K

k

kj

kiji xxxxd

1

2),(

N

ipi i

cxPCTSE1

2),(

Euclidean distance of data vectors:

Total square error:

Clustering result as partition

Illustrated by Voronoi diagram

Illustrated by Convex hulls

Cluster prototypesPartition of data

Cluster prototypesPartition of data

Centroids as prototypes

Partition by nearestprototype mapping

Duality of partition and centroids

Centroid condition: for a given partition (P), optimal cluster centroids (C) for minimizing MSE are the average vectors of the clusters:

Mj

x

c

jp

jpi

j

i

i ,11

Nicxdp jiMj

i ,1),(minarg 2

1

Dependency of data structures

Optimal partition: for a given centroids (C), optimal partition is the one with nearest centroid :

K-means algorithm

K-means algorithmX = Data setC = Cluster centroidsP = Partition

K-Means(X, C) → (C, P)

REPEAT

Cprev ← C;

FOR all i∈[1, N] DO

pi ← FindNearest(xi, C);

FOR all j∈[1, k] DOcj ← Average of xi pi = j;

UNTIL C = Cprev

Optimal partition

Optimal centoids

Summary

How to solve?

Solve the clustering: Given input data (X) of N data vectors, and

number of clusters (M), find the clusters. Result given as a set of prototypes, or partition.

Solve the number of clusters: Define appropriate cluster validity function f. Repeat the clustering algorithm for several M. Select the best result according to f.

Solve the problem efficiently.

Algorithmic

problem

Mathematical

problem

Computer science problem

Cluster missingClusters missing

Too m

any clusters

Incorrect cluster allocation

Incorrect number of clusters

Challenges in clustering

Taxonomy of clustering[Jain, Murty, Flynn, Data clustering: A review, ACM Computing Surveys, 1999.]

• One possible classification based on cost function.

• MSE is well defined and most popular.

Clustering method = defines the problem Clustering algorithm = solves the problem Problem defined as cost function

- Goodness of one cluster- Similarity vs. distance - Global vs. local (“merge cost”, “cut”)

Solution: algorithm to solve the problem

Clustering method

Complexity of clustering

• Clustering problem is NP complete [Garey et al., 1982]

• Optimal solution by branch-and-bound in exponential time.

• Practical solutions by heuristic algorithms.

M

j

NjM jj

M

MM

N

1

)1(!

1

• Number of possible clusterings:

Software

Animatorhttp://cs.uef.fi/sipu/clustering/animator/

Clusteratorhttp://cs.uef.fi/paikka/Radu/clusterator/

Cluster software

Main area

Input area

Output

area

• Main area: working space for data

• Input area: inputs to be processed

• Output area:obtained results

• Menu Process:selection of operation

http://cs.uef.fi/sipu/soft/cluster2009.exe

ftp://cs.joensuu.fi/franti/softat/vq/cluster.zip

Clustering

imageData setCodebook

Partition

Procedure to simulate k-means

Open data set (file *.ts), move it into Input areaOpen data set (file *.ts), move it into Input area

Process – Random codebookProcess – Random codebook, select number of clusters, select number of clusters

REPEATREPEAT

Move obtained codebook from Output area into Input Move obtained codebook from Output area into Input areaarea

Process – Optimal partitionProcess – Optimal partition, select Error function, select Error function

Move codebook into Main area, partition into Input Move codebook into Main area, partition into Input areaarea

Process – Optimal codebookProcess – Optimal codebook

UNTIL DESIRED CLUSTERINGUNTIL DESIRED CLUSTERING

Conclusions

Clustering is a fundamental tool needed in everywhere in computer science and beyond.

Failing to do clustering properly may defect the application analysis.

Good clustering tool needed so that researchers can focus on application requirements.

1. S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 3rd edition, 2006.

2. C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

3. A.K. Jain, M.N. Murty and P.J. Flynn, Data clustering: A review, ACM Computing Surveys, 31(3): 264-323, September 1999.

4. M.R. Garey, D.S. Johnson and H.S. Witsenhausen, The complexity of the generalized Lloyd-Max problem, IEEE Transactions on Information Theory, 28(2): 255-256, March 1982.

5. F. Aurenhammer: Voronoi diagrams-a survey of a fundamental geometric data structure, ACM Computing Surveys, 23 (3), 345-405, September 1991.

Literature

Documents

Clustering methods Course code: 175314