Click to add title - WordPress.com · · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Clustering

YZM 3226 – Makine Öğrenmesi

Outline

◘ What is Cluster Analysis?

◘ Similarity Measure

◘ Clustering Applications

◘ Clustering Methods

◘ Clustering Algorithm Selection

◘ Cluster Validation

http://www.newcastle-schools.org.uk/nsn/chemistry/Radioactivity/Contents Page.htm

http://www.newcastle-schools.org.uk/nsn/chemistry/Radioactivity/Contents Page.htm

What is Cluster Analysis?

◘ Clustering is the process of grouping large data sets according to

their similarity.

◘ A cluster is a collection of data objects that are similar to one

another within the same cluster.

Size Based Geographic Distance Based

Each point represents a house

Clustering Definition

◘ Grouping similar data objects into clusters.

◘ Clustering

– Given a set of data points

– Data points have a set of attributes find clusters

– A similarity measure

Clustering Applications

◘ Spatial Data Analysis

– Detect spatial clusters

– e.g. Create thematic maps in GIS

◘ Image Processing

◘ WWW

– Document clustering

• To find groups of documents that are similar to each other based on the

important terms appearing in them.

– Cluster Weblog data

• To discover groups of similar access patterns

◘ Marketting

◘ Medical Applications

◘ Science Applications

Example Clustering Applications

◘ Marketing: – Customer Segmentation: Find clusters of similar customers

– Market Segmentation: Subdivide a market into subsets

◘ Medical: Clustering of disease

◘ Insurance: Identifying groups of insurance policy holders

◘ City-planning: Identifying groups of houses according to their house

type, value, and geographical location

◘ Earth-quake studies: Observed earth quake epicenters should be

clustered along continent faults

Income

Data Types

Categorical

Nominal Ordinal Continous Discrete

Numeric

Continous Discrete Binary Categorical

(Nominal)

Categorical

(Ordinal)

12 0-18 Smoker Mountain bicycle Very Unhappy

45 18-40 Non-Smoker Utility bicycle Unhappy

34 40-100 Racing bicycle Neutral

9 Happy

48 Very Happy

Similarity Measures

Similarity Measures

◘ Numeric Data

– If attributes are continuous:

• Manhattan Distance (p=1)

• Euclidean Distance (p=2)

• Minkowski Distance

◘ Categorical Data

– Jaccard's distance

– ...

◘ Others

– Problem-specific measures

cbacb jid

),(

p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

1- Similarity Measures for Numeric Data

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

Euclidean Distance Matrix

p1 p2 p3 p4

p1 0 4 4 6

p2 4 0 2 4

p3 4 2 0 2

p4 6 4 2 0

Manhattan Distance Matrix

p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

Similarity Measures for Numeric Data

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

Euclidean Distance Matrix

p1 p2 p3 p4

p1 0 4 4 6

p2 4 0 2 4

p3 4 2 0 2

p4 6 4 2 0

Manhattan Distance Matrix

Example for Clustering Numeric Data

◘ Document Clustering

– Each document becomes a `term' vector,

• each term is a component (attribute) of the vector,

• the value of each component is the number of times the corresponding term occurs in

the document.

Document 1

se

aso

n

time

ou

t

lost

wi

n

ga

me

sco

re

ba

ll

play

co

ach

tea

m

Document 2

Document 3

3 0 5 0 2 6 0 2 0 2

0

0

7 0 2 1 0 0 3 0 0

1 0 0 1 2 2 0 3 0

Doc Doc Doc Doc Doc Doc Doc Doc

2- Similarity Measures for Categorical Data

◘ Categorical Data:

◘ e.g. Binary Variables - 0/1 - presence/absence

– Jaccard's coefficient (measure similarity)

– Jaccard's distance (measure dissimilarity)

cbaa jisim

Jaccard ),(

pdbcasum

dcdc

baba

sum

0

1

01

Object i

Object j

cbacb jid

),(

Example for Clustering Categorical Data

◘ Find the Jaccard's distance between Apple and Banana.

Feature of Fruit Sphere shape Sweet Sour Crunchy

Object i =Apple Yes Yes Yes Yes

Object j =Banana No Yes No No

pdbcasum

dcdc

baba

sum

0

1

01

Object i

Object j

cbacb jid

),((a = 1, b = 3, c = 0, d= 0)

(3+0) / (1 + 3 + 0) = 3/4 = 0.75

Example for Clustering Categorical Data

◘ Who are the most likely to have a similar disease?

Let the values Y and P be set to 1, and the value N be set to 0

Result: Jim and Mary are unlikely to have a similar disease.

Jack and Mary are the most likely to have a similar disease.

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N

Mary F Y N P N P N

Jim M Y P N N N N

75.0211

21),(

67.0111

11),(

33.0102

10),(

maryjimd

jimjackd

maryjackd

pdbcasum

dcdc

baba

sum

0

1

01

Object i

Object j

cbacb jid

),(

Clustering Methods

◘ Partitioning Methods

– K-Means, K-Medoids, PAM, CLARA, CLARANS, ...

◘ Hierarchical Methods

– AGNES, DIANA, BIRCH, CURE, CHAMELEON, ...

◘ Density-Based Methods

– DBSCAN, OPTICS, DENCLUE, ...

◘ Grid-Based Methods

– STING, WaveCluster, CLIQUE ...

◘ Model-Based Methods

– COBWEB, CLASSIT, SOM (Self-Organizing Feature Maps) ...

Clustering Algorithm Selection

1. Scalability

– Efficiently execution on large databases

– Scanning database only several times

2. Running on different data types

– Continuous, discrete, binary, nominal, ordinal, …

3. Updateability

– Updating clusters after insertion and deletion of some data values

4. Efficient memory usage

5. Input parameters

– Results in different outputs on different inputs

– Un-understandable and too many input parameters

6. Without any foreknowledge

7. Different cluster shapes

8. Workable on dirty data

– Workable on missing, wrong and noise data

9. Insensitivity on data ordering

10. Multi dimension

– Workability on multi-dimensional datasets

11. Usability for different areas

Cluster Characteristics

◘ Each cluster is represented by the following characteristics

Categories of Clustering Algorithms

Partitioning

Methods

Hierarchical

Methods

K-Means

K-Medoid

PAM

CLARA

CLARANS

AGNES

DIANA

BIRCH

CURE

CHAMELEON

OPTICS

DENCLUE

STING

WaveCluster

CLIQUE

Model

Based

Methods

Density

Based

Methods

Grid

Based

Methods

1

2

3

4

5

6

1

23 4

5

COBWEB

CLASSIT

SOM

DBSCAN

Partitioning Methods

◘ Construct a partition of a database D of n objects into a set of k

clusters

◘ Given a k, find a partition of k clusters that optimizes the chosen

partitioning criterion e.g. minimize SSE (Sum of Squared Distance)

K-Means

◘ K-Means is an algorithm to cluster n objects based on attributes into k

partitions, k < n.

K-Means Example

K-Means Example

◘ http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

K-Means Demo

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

K-Means Adv. DisAdv.

◘ Strength:

– Relatively efficient: O(tkn) n is # objects, k is # clusters, and t is # iterations.

– Easy to understand

◘ Weakness

– Applicable only when mean is defined, then what about categorical data?

– Need to specify k, the number of clusters, in advance

– Unable to handle noisy data and outliers

– Not suitable to discover clusters with non-convex shapes

K-Means result Other clustering

algorithm result

K-Medoids Method

◘ K-Medoids: Instead of taking the mean value of the object in a cluster as a

reference point, medoids can be used, which is the most centrally located

object in a cluster.

K-Medoids Method

K-Means

K-Medoids


Partitioning

Methods

Hierarchical

Methods

K-Means

K-Medoid

PAM

CLARA

CLARANS

OPTICS

DENCLUE

STING

WaveCluster

CLIQUE

Model

Based

Methods

Density

Based

Methods

Grid

Based

Methods

1

2

3

4

5

6

1

23 4

5

COBWEB

CLASSIT

SOM

DBSCAN

AGNES

DIANA

BIRCH

CURE

CHAMELEON

Hierarchical Clustering

◘ Create a hierarchical decomposition of the set of data using some criterion

◘ Strength: This method does not require the number of clusters k as an input.

◘ Weakness: But it needs a termination condition.

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

(AGNES)

divisive

(DIANA)

Hierarchical Clustering

How the Clusters are Merged?

◘ Given: Matrix of similarity between every point pair

◘ Start with each point in a separate cluster and merge clusters based on some

criteria:

– Single link: The minimum of the distances between the members of the clusters.

– Complete link: The maximum of the distances between the members of the clusters.

– Average link: The average of the distances between the members of the clusters.

Min

distance

Average

distance

Max

distance


Single Link: Kümelerin en yakın

elemanları arasındaki uzaklık

Complete Link: Kümelerin en uzak

elemanları arasındaki uzaklık

Average Link: Bir kümedeki elemanlar ve diğer

kümedeki başka elemanlar arasındaki ortalama

uzaklık

Centroid Link: İki kümenin

centroidleri arasındaki uzaklık


Single Link

1

2

3

4

5

6

1

2

3

4

5

3 6 2 5 4 10

0.05

0.1

0.15

0.2 1

2

3

4

5

6

1

2 5

3

4

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Complete Link

1

2

3

4

5

6 1

2

5

3

4

Average Link

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

Hierarchical Clustering Demo

◘ http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html


Partitioning

Methods

Hierarchical

Methods

K-Means

K-Medoid

PAM

CLARA

CLARANS

OPTICS

DENCLUE

STING

WaveCluster

CLIQUE

Model

Based

Methods

Density

Based

Methods

Grid

Based

Methods

1

2

3

4

5

6

1

23 4

5

COBWEB

CLASSIT

SOM

DBSCAN

AGNES

DIANA

BIRCH

CURE

CHAMELEON

Density-Based Clustering

◘ Dense objects should be grouped together into one cluster.

◘ They use a fixed threshold value to determine dense regions.

◘ Density-based clustering algorithms

– DBSCAN (1996)

– OPTICS (1999)

– DENCLUE (1998)

Check the number of points within a specified radius of the point

Core

Border

Outlier

Eps = 1cm

MinPts = 5


Partitioning

Methods

Hierarchical

Methods

K-Means

K-Medoid

PAM

CLARA

CLARANS

OPTICS

DENCLUE

STING

WaveCluster

CLIQUE

Model

Based

Methods

Density

Based

Methods

Grid

Based

Methods

1

2

3

4

5

6

1

23 4

5

COBWEB

CLASSIT

SOM

DBSCAN

AGNES

DIANA

BIRCH

CURE

CHAMELEON

Grid Based Clustering Methods

◘ Simplest approach is to divide region into a number of rectangular

cells of equal volume and define density as # of points the cell

contains


Partitioning

Methods

Hierarchical

Methods

K-Means

K-Medoid

PAM

CLARA

CLARANS

OPTICS

DENCLUE

STING

WaveCluster

CLIQUE

Model

Based

Methods

Density

Based

Methods

Grid

Based

Methods

1

2

3

4

5

6

1

23 4

5

COBWEB

CLASSIT

SOM

DBSCAN

AGNES

DIANA

BIRCH

CURE

CHAMELEON

Model Based Methods

Attempt to optimize the fit between the given data and some mathematical model

It uses statistical functions

Clustering Algorithms

General Overview

Factors Affecting Clustering Results

◘ Outliers

◘ Inappropriate value for parameters

◘ Drawbacks of the clustering algorithm themselves

INPUT DATASET

GOOD CLUSTERING BAD CLUSTERING

Parameter (k=6) Parameter (k=20)

Cluster Validation - SSE

◘ Clustering • Data points in one cluster are more similar to one another.

• Data points in separate clusters are less similar to one another.

◘ A good clustering method will produce high quality clusters with • high intra-class similarity

• low inter-class similarity

Intracluster distances

are minimized

Intercluster distances

are maximized

Clustering in 3-D space.

Sum of Squared Error (SSE)

Cluster Validation - Correlation

Between -1.0 and +1.0

Documents

Click to add title - WordPress.com · · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a