View
112
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
DBM630: Data Mining and
Data Warehousing
MS.IT. Rangsit University
1
Semester 2/2011
Lecture 9
Clustering
by Kritsada Sriphaew (sriphaew.k AT gmail.com)
Topics
2
What is Cluster Analysis?
Types of Attributes in Cluster Analysis
Major Clustering Approaches
Partitioning Algorithms
Hierarchical Algorithms
Data Warehousing and Data Mining by Kritsada Sriphaew
Clustering Analysis 3
Classification vs. Clustering
Classification: Supervised learning
Learns a method for predicting the instance class from pre-labeled (classified) instances
4
Clustering
Unsupervised learning:
Finds “natural” grouping of instances given un-labeled data
Clustering Analysis
Clustering Analysis 5
Clustering Methods
Many different method and algorithms:
For numeric and/or symbolic data
Deterministic vs. probabilistic
Exclusive vs. overlapping
Hierarchical vs. flat
Top-down vs. bottom-up
Clustering Analysis 6
What is Cluster Analysis ? Cluster: a collection of data objects
High similarity of objects within a cluster
Low similarity of objects across clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is an unsupervised classification: no predefined classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
Clustering Analysis 7
General Applications of Clustering Pattern Recognition
In biology, derive plant and animal taxonomies, categorize genes
Spatial Data Analysis
create thematic maps in GIS by clustering feature spaces
detect spatial clusters and explain them in spatial data mining
Image Processing
Economic Science (e.g. market research) discovering distinct groups in their customer bases and characterize customer
groups based on purchasing patterns
WWW
Document classification
Cluster Weblog data to discover groups of similar access patterns
Clustering Analysis 8
Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop targeted marketing programs
Land use: Identification of areas of similar land use in an earth observation database
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
City-planning: Identifying groups of houses according to their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
Clustering Analysis 9
Clustering Evaluation
Manual inspection
Benchmarking on existing labels
Cluster quality measures
distance measures
high similarity within a cluster, low across clusters
Clustering Analysis 10
Criteria for Clustering A good clustering method will produce high quality clusters
with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on:
both the similarity measure used by the method and its implementation to the new cases
ability to discover some or all of the hidden patterns
Clustering Analysis 11
Requirements of Clustering in Data Mining Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to
determine input parameters Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability
Clustering Analysis 12
The distance function
Simplest case: one numeric attribute A
Distance(X,Y) = A(X) – A(Y)
Several numeric attributes:
Distance(X,Y) = Euclidean distance between X,Y
Nominal attributes: distance is set to 1 if values are different, 0 if they are equal
Are all attributes equally important?
Weighting the attributes might be necessary
Clustering Analysis 13
From Data Matrix to Similarity or Dissimilarity Matrices
Data matrix (or object-by-attribute structure)
m objects with n attributes, e.g., relational data
Similarity and dissimilarity matrices a collection of proximities for all pairs of m objects.
mnmjm
iniji
nj
xxx
xxx
xxx
1
1
1111
0
0
0
1
1
mjm
i
dd
d
1
1
1
1
1
mjm
i
ss
s
10
ij
jiij
d
dd
10
ij
jiij
s
ss
Clustering Analysis 14
Distance Functions (Overview)
To transform a data matrix to similarity or dissimilarity matrices, we need a definition of distance.
Some definitions of distance functions depend on the type of attributes interval-scaled attributes
Boolean attributes
nominal, ordinal and ratio attributes.
Weights should be associated with different attributes based on applications and data semantics.
It is hard to define “similar enough” or “good enough” the answer is typically highly subjective.
Clustering Analysis 15
Similarity and Dissimilarity Between Objects (I)
Distances are normally used to measure the similarity or dissimilarity between two data objects
Some popular ones include: Minkowski distance:
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer
If q = 1, d is Manhattan distance
q q
jpip
q
ji
q
jiij xxxxxxd )||...|||(| 2211
||...|||| 2211 jpipjijiij xxxxxxd
Clustering Analysis 16
Similarity and Dissimilarity Between Objects (II)
If q = 2, d is Euclidean distance:
Properties
d(i,j) >= 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) <= d(i,k) + d(k,j)
Also one can use weighted distance, parametric Pearson product
moment correlation, or other dissimilarity measures.
)||...|||(| 22
22
2
11 jpipjijiij xxxxxxd
)||...||||( 22
222
2
111 jpipijijixxwxxwxxwd
ij
Clustering Analysis: Types of Attributes 17
Types of Attributes in Clustering Interval-scaled attributes
Continuous measures of a roughly linear scale
Binary attributes
Two-state measures: 0 or 1
Nominal, ordinal, and ratio attributes
More than two states, nominal or ordinal or nonlinear scale
Mixed types
Mixture of interval-scaled, symmetric binary, asymmetric binary,
nominal, ordinal, or ratio-scaled attributes
18
Interval-valued Attributes Standardize data Calculate the mean absolute deviation:
where
Calculate the standardized measurement (mean-absolute-based z-
score)
Using mean absolute deviation is more robust than using standard
deviation since the z-scores of outliers do not become too small. Hence, the outliers remain detectable.
|)|...|||(|121 fnffffff mxmxmxns
.)...21
1nffff
xx(xn m
f
fif
if s
mx z
Clustering Analysis: Types of Attributes
19
Binary Attributes A contingency table for binary data
Simple matching coefficient (invariant, if the binary variable is symmetric):
Jaccard coefficient (noninvariant if the binary variable is asymmetric):
pdbcasum
dcdc
baba
sum
0
1
01
Object i
Object j
dcbacb d
ij
cbacb d
ij
A binary variable contains two
possible outcomes: 1
(positive/present) or 0
(negative/absent).
• If there is no preference for
which outcome should be
coded as 0 and which as 1, the
binary variable is
called symmetric.
• If the outcomes of a binary
variable are not equally
important, the binary variable is
called asymmetric, such as "is
color-blind" for a human being.
The most important outcome
is usually coded as 1 (present)
and the other is coded as 0
(absent).
Clustering Analysis: Types of Attributes
20
Dissimilarity on Binary Attributes Example
The gender is symmetric attribute and the remaining attributes are asymmetric
binary. Here, let the values Y and P be set to 1, and the value N be set to 0. Then calculate only asymmetric binary
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
75.0211
21),(
67.0111
11),(
33.0102
10),(
maryjimd
jimjackd
maryjackd
Clustering Analysis: Types of Attributes
21
Nominal Attributes A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching
m is the # of matches, p is the total # of nominal attributes
Method 2: Use a large number of binary variables creating a new binary variable for each of the M nominal
states
p
mpdij
Clustering Analysis: Types of Attributes
22
Ordinal Attributes
},...,1{fif
Mr
An ordinal variable can be discrete or continuous
order is important, e.g., rank
Can be treated like interval-scaled
replacing xif by their rank
map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
compute the dissimilarity using methods for interval-scaled
variables
1
1
f
if
if M
rz
Clustering Analysis: Types of Attributes
23
Ratio-Scaled Attributes Ratio-scaled variable: a positive measurement on a nonlinear
scale, approximately at exponential scale, such as AeBt or Ae-Bt
Methods:
(1) treat them like interval-scaled attributes
not a good choice!
(2) apply logarithmic transformation
yif = log(xif)
(3) treat them as continuous ordinal data and
treat their rank as interval-scaled.
Clustering Analysis: Types of Attributes
24
Mixed Types
A database may contain all the six types of attributes
symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.
One may use a weighted formula to combine their effects.
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is interval-based: use the normalized distance
f is ordinal or ratio-scaled
compute ranks rif and
treat (normalized) zif as interval-scaled
)(1
)()(1
fij
pf
fij
fij
pf
ij
dd
Clustering Analysis: Types of Attributes
Clustering Analysis: Clustering Approaches 25
Major Clustering Approaches Partitioning algorithms: Construct various partitions and then
evaluate them by some criterion
Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other
Clustering Analysis: Partitioning Algorithms 26
Partitioning Approach Construct a partition of a database D of n objects into a set of k
clusters
Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67): Each cluster is represented by the center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster
27
The K-Means Clustering Method (Overview)
Given k, the k-means algorithm is implemented in 4 steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster.
Assign each object to the cluster with the nearest seed point.
Go back to Step 2, stop when no more new assignment.
Clustering Analysis: Partitioning Algorithms
28
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
The K-Means Clustering Method (An Graphical Example)
Clustering Analysis: Partitioning Algorithms
29
K-means example, step 1
k1
k2
k3
X
Y
Pick 3
initial
cluster
centers
(randomly)
Clustering Analysis: Partitioning Algorithms
Given k=3,
30
K-means example, step 2
k1
k2
k3
X
Y
Assign
each point
to the closest
cluster
center
Clustering Analysis: Partitioning Algorithms
31
K-means example, step 3
X
Y
Move
each cluster
center
to the mean
of each cluster
k1
k2
k2
k1
k3
k3
Clustering Analysis: Partitioning Algorithms
32
K-means example, step 4
X
Y
Reassign
points
closest to a
different new
cluster center
Q: Which
points are
reassigned?
k1
k2
k3
Clustering Analysis: Partitioning Algorithms
33
K-means example, step 4 …
X
Y
A: three
points with
animation
k1
k3 k2
Clustering Analysis: Partitioning Algorithms
34
K-means example, step 4b
X
Y
re-compute
cluster
means
k1
k3 k2
Clustering Analysis: Partitioning Algorithms
35
K-means example, step 5
X
Y
move cluster
centers to
cluster means
k2
k1
k3
Clustering Analysis: Partitioning Algorithms
36
Problems to be considered What can be the problems with K-means clustering? Result can vary significantly depending on initial choice of seeds
(number and position) Can get trapped in local minimum
Example:
Q: What can be done? A: To increase chance of finding global optimum: restart with
different random seeds. What can be done about outliers?
instances
initial cluster
centers
Clustering Analysis: Partitioning Algorithms
37
The K-Means Clustering Method (Strength and Weakness)
Strength
Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n
Good for finding clusters with spherical shapes
Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms
Weakness
Applicable only when mean is defined, then what about categorical data?
Need to specify k, no. of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
Clustering Analysis: Partitioning Algorithms
38
The K-Means Clustering Method (Variations – I)
A few variants of the k-means which differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means K-medoids – instead of mean, use medians of each cluster
For large databases, use sampling
Handling categorical data: k-modes (Huang’98)
Replacing means of clusters with modes
Using new dissimilarity measures to deal with categorical objects
Using a frequency-based method to update modes of clusters
A mixture of categorical/numerical data: k-prototype method
Mean of 1, 3, 5, 7, 9 is 5 Mean of 1, 3, 5, 7, 1009 is 205 Median of 1, 3, 5, 7, 1009 is 5 Median advantage: not affected by extreme values
Clustering Analysis: Partitioning Algorithms
39
The K-Medoids Clustering Method (Overview)
Find representative objects, called medoids, in clusters
PAM (Partitioning Around Medoids, 1987)
starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering
PAM works effectively for small data sets, but does not scale well for large data sets
CLARA (Kaufmann & Rousseeuw, 1990)
CLARANS (Ng & Han, 1994): Randomized sampling
Clustering Analysis: Partitioning Algorithms
40
The K-Medoids Clustering Method (PAM - Partitioning Around Medoids)
PAM (Kaufman and Rousseeuw, 1987), built in Splus
Use real object to represent the cluster
Select k representative objects arbitrarily
For each pair of non-selected object h and selected object i, calculate the total swapping cost Sih
For each pair of i and h, If Sih < 0, i is replaced by h
Then assign each non-selected object to the most similar representative object
repeat steps 2-3 until there is no change
Clustering Analysis: Partitioning Algorithms
PAM example
Clustering Analysis 41
Cluster the following data set of ten objects into two clusters i.e k = 2.
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6
PAM example, step 1
Clustering Analysis 42
Initialise k medoids. Let assume c1 = (3,4) and c2 = (7,4)
Calculating distance so as to associate each data object to its nearest medoid. Assume that cost is calculated using Minkowski distance metric with r = 1.
c1 Data objects
(Xi)
Cost
(distan
ce)
3 4 2 6 3
3 4 3 8 4
3 4 4 7 4
3 4 6 2 5
3 4 6 4 3
3 4 7 3 5
3 4 8 5 6
3 4 7 6 6
c2 Data objects
(Xi)
Cost
(distan
ce)
7 4 2 6 7
7 4 3 8 8
7 4 4 7 6
7 4 6 2 3
7 4 6 4 1
7 4 7 3 1
7 4 8 5 2
7 4 7 6 2
PAM example, step 1b
Clustering Analysis 43
Then the clusters become: Cluster1 = {(3,4)(2,6)(3,8)(4,7)} Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}
Total cost = 3 + 4 + 4 + 3 + 1 + 1 + 2 + 2 = 20
c1 Data objects
(Xi)
Cost
(distan
ce)
3 4 2 6 3
3 4 3 8 4
3 4 4 7 4
3 4 6 2 5
3 4 6 4 3
3 4 7 3 5
3 4 8 5 6
3 4 7 6 6
c2 Data objects
(Xi)
Cost
(distan
ce)
7 4 2 6 7
7 4 3 8 8
7 4 4 7 6
7 4 6 2 3
7 4 6 4 1
7 4 7 3 1
7 4 8 5 2
7 4 7 6 2
PAM example, step 2
Clustering Analysis 44
Selection of nonmedoid O′ randomly. Let us assume O′ = (7,3). So now the medoids are c1(3,4) and O′(7,3)
Calculate the cost of new medoid by using the formula in the step1. Total cost = 3+4+4+2+2+1+3+3 = 22
c1 Data objects
(Xi)
Cost
(distan
ce)
3 4 2 6 3
3 4 3 8 4
3 4 4 7 4
3 4 6 2 5
3 4 6 4 3
3 4 7 3 5
3 4 8 5 6
3 4 7 6 6
O′ Data objects
(Xi)
Cost
(distan
ce)
7 3 2 6 8
7 3 3 8 9
7 3 4 7 7
7 3 6 2 2
7 3 6 4 2
7 3 7 4 1
7 3 8 5 3
7 3 7 6 3
PAM example, step 2b
Clustering Analysis 45
So cost of swapping medoid from c2 to O′ is
S = current total cost – past total cost = 22-20 = 2 >0
So moving to O′ would be bad idea, so the previous choice was good, and algorithm terminates here (i.e there is no change in the medoids).
It may happen some data points may shift from one cluster to another cluster depending upon their closeness to medoid
46
CLARA (Clustering Large Applications) (1990)
CLARA (Kaufmann and Rousseeuw in 1990)
Built in statistical analysis packages, such as S+
It draws multiple samples of the data set, applies PAM on each
sample, and gives the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
Efficiency depends on the sample size
A good clustering based on samples will not necessarily represent a good
clustering of the whole data set if the sample is biased
Clustering Analysis: Partitioning Algorithms
47
CLARANS (“Randomized” CLARA) (1994)
CLARANS (A Clustering Algorithm based on Randomized Search) (Ng
and Han’94)
CLARANS draws sample of neighbors dynamically
The clustering process can be presented as searching a graph where
every node is a potential solution, that is, a set of k medoids
If the local optimum is found, CLARANS starts with new randomly
selected node in search for a new local optimum
It is more efficient and scalable than both PAM and CLARA
Focusing techniques and spatial access structures may further improve
its performance (Ester et al.’95)
Clustering Analysis: Partitioning Algorithms
48
The Partition-Based Clustering (Discussion)
Result can vary significantly based on initial choice of seeds
Algorithm can get trapped in a local minimum
Example: four instances at the vertices of a two-dimensional rectangle
Local minimum: two cluster centers at the midpoints of the rectangle’s long sides
Simple way to increase chance of finding a global optimum: restart with different random seeds
Clustering Analysis: Hierarchical Algorithms
49
Hierarchical Clustering Use distance matrix as clustering criteria.
This method does not require the number of clusters k as an
input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
Agglomerative
(AGNES)
Divisive
(DIANA) Clustering Analysis: Hierarchical Algorithms
50
AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Use the Single-Link method and the dissimilarity matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Clustering Analysis: Hierarchical Algorithms
51
Dendrogram for Hierarchical Clustering Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
Clustering Analysis: Hierarchical Algorithms
52
DIANA - Divisive Analysis Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Clustering Analysis: Hierarchical Algorithms
53
Hierarchical Clustering Major weakness of agglomerative clustering methods do not scale well: time complexity of at least O(n2),
where n is the number of total objects can never undo what was done previously
Clustering Analysis: Hierarchical Algorithms
54
Distances - Hierarchical Clustering (Overview)
Four measures for distance between clusters are: Single linkage (Minimum distance):
Complete linkage (Maximum distance):
Centroid comparison (Mean distance):
Element comparison (Average distance):
p'pminCCd Cjp'Cpjimin i ,),(
p'pmaxCCd Cjp'Cpjimax i ,),(
jijimean mmCCd ),(
Clustering Analysis: Hierarchical Algorithms
i jCp Cp'ji
jiavg p'pnn
CCd1
),(
55
Distances - Hierarchical Clustering (Graphical Representation)
Four measures for distance between clusters are (1) single linkage, (2) complete linkage, (3) centroid comparison and (4) element comparison
x x
x
Cluster 1 Cluster 2
Cluster 3 (2)complete
(3)centroid
(1)single
x = centroids
(4) Element comparison:
average distance among all
elements in two clusters
Clustering Analysis: Hierarchical Algorithms
Practice
Clustering Analysis 56
Use single and complete link agglomerative clustering to group the data described by the following distance matrix. Show the dendrograms.
A B C D
A 0 1 4 5
B 0 2 6
C 0 3
D 0
Clustering Analysis 57
Other Clustering Methods Incremental Clustering
Probability-based Clustering, Bayesian Clustering
EM Algorithm Cluster Schemes Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Clustering Analysis 58
Advanced Method: BIRCH (Overview)
Balanced Iterative Reducing and Clustering using Hierarchies [Tian Zhang, Raghu Ramakrishnan, Miron Livny, 1996]
Incremental, hierarchical, one scan Save clustering information in a tree Each entry in the tree contains information about one cluster New nodes inserted in closest entry in tree Only works with "metric" attributes
Must have Euclidean coordinates Designed for very large data sets
Time and memory constraints are explicit Treats dense regions of data points as sub-clusters
Not all data points are important for clustering Only one scan of data is necessary
Clustering Analysis 59
BIRCH (Merits) Incremental, distance-based approach
Decisions are made without scanning all data points, or all currently existing clusters
Does not need the whole data set in advance
Unique approach: Distance-based algorithms generally need all the data points to work
Make best use of available memory while minimizing I/O costs
Does not assume that the probability distributions on attributes is independent
Clustering Analysis 60
BIRCH – Clustering Feature and Clustering Feature Tree
BIRTH introduces two concepts, clustering feature and clustering feature tree (CF Tree), which are used to summarize cluster representations.
These structures help the clustering method achieve good speed and scalability in large databases and make it effective for incremental and dynamic clustering of incoming object
Given n d-dimensional data objects or points in a cluster, we can define the centroid x0, radius R and diameter D of the cluster as follows:
Clustering Analysis 61
BIRCH – Centroid, Radius and Diameter
• Given a cluster of instances , we define: • Centroid: the center of a cluster
• Radius: average distance from member points to centroid
• Diameter: average pair-wise distance within a cluster
Clustering Analysis 62
BIRCH – Centroid Euclidean and Manhattan distances
• The centroid Euclidean distance and centroid Manhattan distance are defined between any two clusters. • Centroid Euclidean distance
• Centroid Manhattan distance
Clustering Analysis 63
BIRCH (Average inter-cluster, Average intra-cluster, Variance increase)
• The average inter-cluster, the average intra-cluster, and the variance increase distances are defined as follows
• Average inter-cluster
• Average intra-cluster
• Variance increase distances
Clustering Analysis 64
Clustering Feature CF = (N,LS,SS)
N: Number of points in cluster
LS: Sum of points in the cluster
SS: Sum of squares of points in the cluster CF Tree
Balanced search tree
Node has CF triple for each child
Leaf node represents cluster and has CF value for each subcluster in it.
Subcluster has maximum diameter
Clustering Analysis 65
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4) (4,7)
(2,6) (3,8)
(4,5)
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi
2
),,( 21212121SSSSLSLSNNCFCF
(6,2) (8,4)
(7,2) (8,5)
(7,4)
CF = (5, (36,17),(262,65)) CF = (10, (52,47),(316,255))
Clustering Analysis 66
Merging two clusters Cluster {Xi}:
i = 1, 2, …, N1
Cluster {Xj}:
j = N1+1, N1+2, …, N1+N2
Cluster Xl = {Xi} + {Xj}:
l = 1, 2, …, N1, N1+1, N1+2, …, N1+N2
Clustering Analysis 67
CF Tree CF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6 prev next CF1 CF2 CF4
prev next
B = 7
L = 6
Root
Non-leaf node
Leaf node Leaf node
Clustering Analysis 68
Properties of CF-Tree Each non-leaf node has at most B entries Each leaf node has at most L CF entries which each satisfy threshold
T Node size is determined by dimensionality of data space and input
parameter P (page size)
Branching Factor and Thread hold
Clustering Analysis 69
BIRCH Algorithm (CF-Tree Insertion)
Recurse down from root, find the appropriate leaf Follow the "closest"-CF path, w.r.t. D0 / … / D4 Modify the leaf If the closest-CF leaf cannot absorb, make a new CF entry. If there is no room for new leaf, split the parent node Traverse back & up Updating CFs on the path or splitting nodes
Clustering Analysis 70
Improve Clusters
Clustering Analysis 71
BIRCH Algorithm (Overall steps)
Clustering Analysis 72
Details of Each Step Phase 1: Load data into memory
Build an initial in-memory CF-tree with the data (one scan) Subsequent phases become fast, accurate, less order sensitive
Phase 2: Condense data Rebuild the CF-tree with a larger T Condensing is optional
Phase 3: Global clustering Use existing clustering algorithm on CF entries Helps fix problem where natural clusters span nodes
Phase 4: Cluster refining Do additional passes over the dataset & reassign data points to the
closest centroid from phase 3 Refining is optional
Clustering Analysis 73
Summary of BIRCH
BIRCH works with very large data sets
Explicitly bounded by computational resources.
The computation complexity is O(n), where n is the number of objects to be clustered.
Runs with specified amount of memory (P)
Superior to CLARANS and k-MEANS
Quality, speed, stability and scalability
Clustering Analysis 74
CURE (Clustering Using REpresentatives) CURE was proposed by Guha, Rastogi & Shim, 1998
It stops the creation of a cluster hierarchy if a level consists of k clusters
Each cluster has c representatives (instead of one)
Choose c well scattered points in the cluster
Shrink them towards the mean of the cluster by a fraction of
The representatives capture the physical shape and geometry of the cluster
It can treat arbitrary shaped clusters and avoid single-link effect.
Merge the closest two clusters
Distance of two clusters: the distance between the two closest representatives
Clustering Analysis 75
CURE Algorithm
Clustering Analysis 76
CURE Algorithm
Clustering Analysis 77
CURE Algorithm (Another form)
Clustering Analysis 78
CURE Algorithm (for Large Databases)
Clustering Analysis 79
Data Partitioning and Clustering s = 50
p = 2
s/p = 25
s/pq = 5
x x
x
y
y y
y
x
y
x
Clustering Analysis 80
Cure: Shrinking Representative Points Shrink the multiple representative points towards the gravity center by
a fraction of .
Multiple representatives capture the shape of the cluster
x
y
x
y
Clustering Analysis 81
Clustering Categorical Data: ROCK
ROCK: RObust Clustering using linKs, by S. Guha, R. Rastogi, K. Shim (ICDE’99).
Use links to measure similarity/proximity
Not distance-based with categorical attributes
Computational complexity:
Basic ideas (Jaccard coefficient):
Similarity function and neighbors:
Let T1 = {1,2,3}, T2={3,4,5}
O n nm m n nm a
( log )2 2
Sim T TT T
T T( , )
1 2
1 2
1 2
Sim T T( , ){ }
{ , , , , }.1 2
3
1 2 3 4 5
1
50 2
Clustering Analysis 82
ROCK: An Example
Links: The number of common neighbors for the two points. Using Jaccard
Use Distances to determine neighbors
(pt1,pt4) = 0, (pt1,pt2) = 0, (pt1,pt3) = 0
(pt2,pt3) = 0.6, (pt2,pt4) = 0.2
(pt3,pt4) = 0.2
Use 0.2 as threshold for neighbors
Pt2 and Pt3 have 3 common neighbors
Pt3 and Pt4 have 3 common neighbors
Pt2 and Pt4 have 3 common neighbors
Resulting clusters (1), (2,3,4) which makes more sense
Clustering Analysis 83
ROCK: Property & Algorithm
Links: The number of common neighbours for the two points.
Algorithm Draw random sample
Cluster with links (maybe agglomerative hierarchical)
Label data in disk
{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}
{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
{1,2,3} {1,2,4} 3
Clustering Analysis 84
CHAMELEON
CHAMELEON: hierarchical clustering using dynamic modeling, by G. Karypis, E.H. Han and V. Kumar’99
Measures the similarity based on a dynamic model
Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters
A two phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters
Clustering Analysis 85
Graph-based clustering
Sparsification techniques keep the connections to the most similar (nearest) neighbors of a point while breaking the connections to less similar points.
The nearest neighbors of a point tend to belong to the same class as the point itself.
This reduces the impact of noise and outliers and sharpens the distinction between clusters.
Clustering Analysis 86
Overall Framework of CHAMELEON
Construct
Sparse Graph Partition the Graph
Merge Partition
Final Clusters
Data Set
Clustering Analysis 87
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as density-connected points
Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition
Several interesting studies: DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98)
Clustering Analysis 88
Density-Based Clustering (Background)
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly density -reachable from a point q wrt. Eps, MinPts if
1) p belongs to NEps(q)
2) core point condition:
|NEps (q)| >= MinPts
p
q
MinPts = 5
Eps = 1 cm
Clustering Analysis 89
Density-Based Clustering (Background)
The neighborhood with in a radius Eps of a given object is call the neighborhood of the object.
If the neighborhood of an object contains at least a minimum number, MinPts, of objects, then the object is called a core object.
Given a set of objects, D, we say that an object p is directly density-reachable from object q if p is within the neighborhood of q, and q is a core object.
An object p is density-reachable from object q with respect to and MinPts in a set of objects D, if there is a chain of object p1, …, pn, where p1=q and pn=p such that p(i+1) is directly density-reachable from pi with respect to and MinPts, for 1≤i≤n, from pi D.
An object p is density-connected to object q with respect to
Clustering Analysis 90
Density-Based Clustering
Density-reachable:
A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi
Density-connected
A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts.
p
q p1
p q
o
Clustering Analysis 91
DBSCAN: Density Based Spatial Clustering of Applications with
Noise
Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases with noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
Clustering Analysis 92
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p wrt Eps and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable from p and
DBSCAN visits the next point of the database.
Continue the process until all of the points have been processed.