Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Clustering
YZM 3226 – Makine Öğrenmesi
Outline
◘ What is Cluster Analysis?
◘ Similarity Measure
◘ Clustering Applications
◘ Clustering Methods
◘ Clustering Algorithm Selection
◘ Cluster Validation
What is Cluster Analysis?
◘ Clustering is the process of grouping large data sets according to
their similarity.
◘ A cluster is a collection of data objects that are similar to one
another within the same cluster.
Size Based Geographic Distance Based
Each point represents a house
Clustering Definition
◘ Grouping similar data objects into clusters.
◘ Clustering
– Given a set of data points
– Data points have a set of attributes find clusters
– A similarity measure
Clustering Applications
◘ Spatial Data Analysis
– Detect spatial clusters
– e.g. Create thematic maps in GIS
◘ Image Processing
◘ WWW
– Document clustering
• To find groups of documents that are similar to each other based on the
important terms appearing in them.
– Cluster Weblog data
• To discover groups of similar access patterns
◘ Marketting
◘ Medical Applications
◘ Science Applications
Example Clustering Applications
◘ Marketing: – Customer Segmentation: Find clusters of similar customers
– Market Segmentation: Subdivide a market into subsets
◘ Medical: Clustering of disease
◘ Insurance: Identifying groups of insurance policy holders
◘ City-planning: Identifying groups of houses according to their house
type, value, and geographical location
◘ Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Income
Data Types
Categorical
Nominal Ordinal Continous Discrete
Numeric
Continous Discrete Binary Categorical
(Nominal)
Categorical
(Ordinal)
12 0-18 Smoker Mountain bicycle Very Unhappy
45 18-40 Non-Smoker Utility bicycle Unhappy
34 40-100 Racing bicycle Neutral
9 Happy
48 Very Happy
Similarity Measures
Similarity Measures
◘ Numeric Data
– If attributes are continuous:
• Manhattan Distance (p=1)
• Euclidean Distance (p=2)
• Minkowski Distance
◘ Categorical Data
– Jaccard's distance
– ...
◘ Others
– Problem-specific measures
cbacb jid
),(
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
1- Similarity Measures for Numeric Data
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Euclidean Distance Matrix
p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
Manhattan Distance Matrix
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Similarity Measures for Numeric Data
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Euclidean Distance Matrix
p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
Manhattan Distance Matrix
Example for Clustering Numeric Data
◘ Document Clustering
– Each document becomes a `term' vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times the corresponding term occurs in
the document.
Document 1
se
aso
n
time
ou
t
lost
wi
n
ga
me
sco
re
ba
ll
play
co
ach
tea
m
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
Doc Doc Doc Doc Doc Doc Doc Doc
2- Similarity Measures for Categorical Data
◘ Categorical Data:
◘ e.g. Binary Variables - 0/1 - presence/absence
– Jaccard's coefficient (measure similarity)
– Jaccard's distance (measure dissimilarity)
cbaa jisim
Jaccard ),(
pdbcasum
dcdc
baba
sum
0
1
01
Object i
Object j
cbacb jid
),(
Example for Clustering Categorical Data
◘ Find the Jaccard's distance between Apple and Banana.
Feature of Fruit Sphere shape Sweet Sour Crunchy
Object i =Apple Yes Yes Yes Yes
Object j =Banana No Yes No No
pdbcasum
dcdc
baba
sum
0
1
01
Object i
Object j
cbacb jid
),((a = 1, b = 3, c = 0, d= 0)
(3+0) / (1 + 3 + 0) = 3/4 = 0.75
Example for Clustering Categorical Data
◘ Who are the most likely to have a similar disease?
Let the values Y and P be set to 1, and the value N be set to 0
Result: Jim and Mary are unlikely to have a similar disease.
Jack and Mary are the most likely to have a similar disease.
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
75.0211
21),(
67.0111
11),(
33.0102
10),(
maryjimd
jimjackd
maryjackd
pdbcasum
dcdc
baba
sum
0
1
01
Object i
Object j
cbacb jid
),(
Clustering Methods
◘ Partitioning Methods
– K-Means, K-Medoids, PAM, CLARA, CLARANS, ...
◘ Hierarchical Methods
– AGNES, DIANA, BIRCH, CURE, CHAMELEON, ...
◘ Density-Based Methods
– DBSCAN, OPTICS, DENCLUE, ...
◘ Grid-Based Methods
– STING, WaveCluster, CLIQUE ...
◘ Model-Based Methods
– COBWEB, CLASSIT, SOM (Self-Organizing Feature Maps) ...
Clustering Algorithm Selection
1. Scalability
– Efficiently execution on large databases
– Scanning database only several times
2. Running on different data types
– Continuous, discrete, binary, nominal, ordinal, …
3. Updateability
– Updating clusters after insertion and deletion of some data values
4. Efficient memory usage
5. Input parameters
– Results in different outputs on different inputs
– Un-understandable and too many input parameters
6. Without any foreknowledge
7. Different cluster shapes
8. Workable on dirty data
– Workable on missing, wrong and noise data
9. Insensitivity on data ordering
10. Multi dimension
– Workability on multi-dimensional datasets
11. Usability for different areas
Cluster Characteristics
◘ Each cluster is represented by the following characteristics
Categories of Clustering Algorithms
Partitioning
Methods
Hierarchical
Methods
K-Means
K-Medoid
PAM
CLARA
CLARANS
AGNES
DIANA
BIRCH
CURE
CHAMELEON
OPTICS
DENCLUE
STING
WaveCluster
CLIQUE
Model
Based
Methods
Density
Based
Methods
Grid
Based
Methods
1
2
3
4
5
6
1
23 4
5
COBWEB
CLASSIT
SOM
DBSCAN
Partitioning Methods
◘ Construct a partition of a database D of n objects into a set of k
clusters
◘ Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion e.g. minimize SSE (Sum of Squared Distance)
K-Means
◘ K-Means is an algorithm to cluster n objects based on attributes into k
partitions, k < n.
K-Means Example
K-Means Example
◘ http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
K-Means Demo
K-Means Adv. DisAdv.
◘ Strength:
– Relatively efficient: O(tkn) n is # objects, k is # clusters, and t is # iterations.
– Easy to understand
◘ Weakness
– Applicable only when mean is defined, then what about categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
K-Means result Other clustering
algorithm result
K-Medoids Method
◘ K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located
object in a cluster.
K-Medoids Method
K-Means
K-Medoids
Categories of Clustering Algorithms
Partitioning
Methods
Hierarchical
Methods
K-Means
K-Medoid
PAM
CLARA
CLARANS
OPTICS
DENCLUE
STING
WaveCluster
CLIQUE
Model
Based
Methods
Density
Based
Methods
Grid
Based
Methods
1
2
3
4
5
6
1
23 4
5
COBWEB
CLASSIT
SOM
DBSCAN
AGNES
DIANA
BIRCH
CURE
CHAMELEON
Hierarchical Clustering
◘ Create a hierarchical decomposition of the set of data using some criterion
◘ Strength: This method does not require the number of clusters k as an input.
◘ Weakness: But it needs a termination condition.
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)
Hierarchical Clustering
How the Clusters are Merged?
◘ Given: Matrix of similarity between every point pair
◘ Start with each point in a separate cluster and merge clusters based on some
criteria:
– Single link: The minimum of the distances between the members of the clusters.
– Complete link: The maximum of the distances between the members of the clusters.
– Average link: The average of the distances between the members of the clusters.
Min
distance
Average
distance
Max
distance
How the Clusters are Merged?
Single Link: Kümelerin en yakın
elemanları arasındaki uzaklık
Complete Link: Kümelerin en uzak
elemanları arasındaki uzaklık
Average Link: Bir kümedeki elemanlar ve diğer
kümedeki başka elemanlar arasındaki ortalama
uzaklık
Centroid Link: İki kümenin
centroidleri arasındaki uzaklık
How the Clusters are Merged?
Single Link
1
2
3
4
5
6
1
2
3
4
5
3 6 2 5 4 10
0.05
0.1
0.15
0.2 1
2
3
4
5
6
1
2 5
3
4
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Complete Link
1
2
3
4
5
6 1
2
5
3
4
Average Link
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
Hierarchical Clustering Demo
◘ http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html
Categories of Clustering Algorithms
Partitioning
Methods
Hierarchical
Methods
K-Means
K-Medoid
PAM
CLARA
CLARANS
OPTICS
DENCLUE
STING
WaveCluster
CLIQUE
Model
Based
Methods
Density
Based
Methods
Grid
Based
Methods
1
2
3
4
5
6
1
23 4
5
COBWEB
CLASSIT
SOM
DBSCAN
AGNES
DIANA
BIRCH
CURE
CHAMELEON
Density-Based Clustering
◘ Dense objects should be grouped together into one cluster.
◘ They use a fixed threshold value to determine dense regions.
◘ Density-based clustering algorithms
– DBSCAN (1996)
– OPTICS (1999)
– DENCLUE (1998)
Check the number of points within a specified radius of the point
Core
Border
Outlier
Eps = 1cm
MinPts = 5
Categories of Clustering Algorithms
Partitioning
Methods
Hierarchical
Methods
K-Means
K-Medoid
PAM
CLARA
CLARANS
OPTICS
DENCLUE
STING
WaveCluster
CLIQUE
Model
Based
Methods
Density
Based
Methods
Grid
Based
Methods
1
2
3
4
5
6
1
23 4
5
COBWEB
CLASSIT
SOM
DBSCAN
AGNES
DIANA
BIRCH
CURE
CHAMELEON
Grid Based Clustering Methods
◘ Simplest approach is to divide region into a number of rectangular
cells of equal volume and define density as # of points the cell
contains
Categories of Clustering Algorithms
Partitioning
Methods
Hierarchical
Methods
K-Means
K-Medoid
PAM
CLARA
CLARANS
OPTICS
DENCLUE
STING
WaveCluster
CLIQUE
Model
Based
Methods
Density
Based
Methods
Grid
Based
Methods
1
2
3
4
5
6
1
23 4
5
COBWEB
CLASSIT
SOM
DBSCAN
AGNES
DIANA
BIRCH
CURE
CHAMELEON
Model Based Methods
Attempt to optimize the fit between the given data and some mathematical model
It uses statistical functions
Clustering Algorithms
General Overview
Factors Affecting Clustering Results
◘ Outliers
◘ Inappropriate value for parameters
◘ Drawbacks of the clustering algorithm themselves
INPUT DATASET
GOOD CLUSTERING BAD CLUSTERING
Parameter (k=6) Parameter (k=20)
Cluster Validation - SSE
◘ Clustering • Data points in one cluster are more similar to one another.
• Data points in separate clusters are less similar to one another.
◘ A good clustering method will produce high quality clusters with • high intra-class similarity
• low inter-class similarity
Intracluster distances
are minimized
Intercluster distances
are maximized
Clustering in 3-D space.
Sum of Squared Error (SSE)
Cluster Validation - Correlation
Between -1.0 and +1.0