Dbm630 lecture09

DBM630: Data Mining and

Data Warehousing

MS.IT. Rangsit University

1

Semester 2/2011

Lecture 9

Clustering

by Kritsada Sriphaew (sriphaew.k AT gmail.com)

Topics

2

What is Cluster Analysis?

Types of Attributes in Cluster Analysis

Major Clustering Approaches

Partitioning Algorithms

Hierarchical Algorithms

Data Warehousing and Data Mining by Kritsada Sriphaew

Clustering Analysis 3

Classification vs. Clustering

Classification: Supervised learning

Learns a method for predicting the instance class from pre-labeled (classified) instances

4

Clustering

Unsupervised learning:

Finds “natural” grouping of instances given un-labeled data

Clustering Analysis


Clustering Methods

Many different method and algorithms:

For numeric and/or symbolic data

Deterministic vs. probabilistic

Exclusive vs. overlapping

Hierarchical vs. flat

Top-down vs. bottom-up


What is Cluster Analysis ? Cluster: a collection of data objects

High similarity of objects within a cluster

Low similarity of objects across clusters

Cluster analysis

Grouping a set of data objects into clusters

Clustering is an unsupervised classification: no predefined classes

Typical applications

As a stand-alone tool to get insight into data distribution

As a preprocessing step for other algorithms


General Applications of Clustering Pattern Recognition

In biology, derive plant and animal taxonomies, categorize genes

Spatial Data Analysis

create thematic maps in GIS by clustering feature spaces

detect spatial clusters and explain them in spatial data mining

Image Processing

Economic Science (e.g. market research) discovering distinct groups in their customer bases and characterize customer

groups based on purchasing patterns

WWW

Document classification

Cluster Weblog data to discover groups of similar access patterns


Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their

customer bases, and then use this knowledge to develop targeted marketing programs

Land use: Identification of areas of similar land use in an earth observation database

Insurance: Identifying groups of motor insurance policy holders with a high average claim cost

City-planning: Identifying groups of houses according to their house type, value, and geographical location

Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults


Clustering Evaluation

Manual inspection

Benchmarking on existing labels

Cluster quality measures

distance measures

high similarity within a cluster, low across clusters


Criteria for Clustering A good clustering method will produce high quality clusters

with

high intra-class similarity

low inter-class similarity

The quality of a clustering result depends on:

both the similarity measure used by the method and its implementation to the new cases

ability to discover some or all of the hidden patterns


Requirements of Clustering in Data Mining Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to

determine input parameters Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability


The distance function

Simplest case: one numeric attribute A

Distance(X,Y) = A(X) – A(Y)

Several numeric attributes:

Distance(X,Y) = Euclidean distance between X,Y

Nominal attributes: distance is set to 1 if values are different, 0 if they are equal

Are all attributes equally important?

Weighting the attributes might be necessary


From Data Matrix to Similarity or Dissimilarity Matrices

Data matrix (or object-by-attribute structure)

m objects with n attributes, e.g., relational data

Similarity and dissimilarity matrices a collection of proximities for all pairs of m objects.

mnmjm

iniji

nj

xxx

xxx

xxx

1

1

1111

0

0

0

1

1

mjm

i

dd

d

1

1

1

1

1

mjm

i

ss

s

10

ij

jiij

d

dd

10

ij

jiij

s

ss


Distance Functions (Overview)

To transform a data matrix to similarity or dissimilarity matrices, we need a definition of distance.

Some definitions of distance functions depend on the type of attributes interval-scaled attributes

Boolean attributes

nominal, ordinal and ratio attributes.

Weights should be associated with different attributes based on applications and data semantics.

It is hard to define “similar enough” or “good enough” the answer is typically highly subjective.


Similarity and Dissimilarity Between Objects (I)

Distances are normally used to measure the similarity or dissimilarity between two data objects

Some popular ones include: Minkowski distance:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer

If q = 1, d is Manhattan distance

q q

jpip

q

ji

q

jiij xxxxxxd )||...|||(| 2211

||...|||| 2211 jpipjijiij xxxxxxd


Similarity and Dissimilarity Between Objects (II)

If q = 2, d is Euclidean distance:

Properties

d(i,j) >= 0

d(i,i) = 0

d(i,j) = d(j,i)

d(i,j) <= d(i,k) + d(k,j)

Also one can use weighted distance, parametric Pearson product

moment correlation, or other dissimilarity measures.

)||...|||(| 22

22

2

11 jpipjijiij xxxxxxd

)||...||||( 22

222

2

111 jpipijijixxwxxwxxwd

ij

Clustering Analysis: Types of Attributes 17

Types of Attributes in Clustering Interval-scaled attributes

Continuous measures of a roughly linear scale

Binary attributes

Two-state measures: 0 or 1

Nominal, ordinal, and ratio attributes

More than two states, nominal or ordinal or nonlinear scale

Mixed types

Mixture of interval-scaled, symmetric binary, asymmetric binary,

nominal, ordinal, or ratio-scaled attributes

18

Interval-valued Attributes Standardize data Calculate the mean absolute deviation:

where

Calculate the standardized measurement (mean-absolute-based z-

score)

Using mean absolute deviation is more robust than using standard

deviation since the z-scores of outliers do not become too small. Hence, the outliers remain detectable.

|)|...|||(|121 fnffffff mxmxmxns

.)...21

1nffff

xx(xn m

f

fif

if s

mx z

Clustering Analysis: Types of Attributes

19

Binary Attributes A contingency table for binary data

Simple matching coefficient (invariant, if the binary variable is symmetric):

Jaccard coefficient (noninvariant if the binary variable is asymmetric):

pdbcasum

dcdc

baba

sum

0

1

01

Object i

Object j

dcbacb d

ij

cbacb d

ij

A binary variable contains two

possible outcomes: 1

(positive/present) or 0

(negative/absent).

• If there is no preference for

which outcome should be

coded as 0 and which as 1, the

binary variable is

called symmetric.

• If the outcomes of a binary

variable are not equally

important, the binary variable is

called asymmetric, such as "is

color-blind" for a human being.

The most important outcome

is usually coded as 1 (present)

and the other is coded as 0

(absent).


20

Dissimilarity on Binary Attributes Example

The gender is symmetric attribute and the remaining attributes are asymmetric

binary. Here, let the values Y and P be set to 1, and the value N be set to 0. Then calculate only asymmetric binary

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N

Mary F Y N P N P N

Jim M Y P N N N N

75.0211

21),(

67.0111

11),(

33.0102

10),(

maryjimd

jimjackd

maryjackd


21

Nominal Attributes A generalization of the binary variable in that it can take

more than 2 states, e.g., red, yellow, blue, green

Method 1: Simple matching

m is the # of matches, p is the total # of nominal attributes

Method 2: Use a large number of binary variables creating a new binary variable for each of the M nominal

states

p

mpdij


22

Ordinal Attributes

},...,1{fif

Mr

An ordinal variable can be discrete or continuous

order is important, e.g., rank

Can be treated like interval-scaled

replacing xif by their rank

map the range of each variable onto [0, 1] by replacing

i-th object in the f-th variable by

compute the dissimilarity using methods for interval-scaled

variables

1

1

f

if

if M

rz


23

Ratio-Scaled Attributes Ratio-scaled variable: a positive measurement on a nonlinear

scale, approximately at exponential scale, such as AeBt or Ae-Bt

Methods:

(1) treat them like interval-scaled attributes

not a good choice!

(2) apply logarithmic transformation

yif = log(xif)

(3) treat them as continuous ordinal data and

treat their rank as interval-scaled.


24

Mixed Types

A database may contain all the six types of attributes

symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.

One may use a weighted formula to combine their effects.

f is binary or nominal:

dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise

f is interval-based: use the normalized distance

f is ordinal or ratio-scaled

compute ranks rif and

treat (normalized) zif as interval-scaled

)(1

)()(1

fij

pf

fij

fij

pf

ij

dd


Clustering Analysis: Clustering Approaches 25

Major Clustering Approaches Partitioning algorithms: Construct various partitions and then

evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion

Density-based: based on connectivity and density functions

Grid-based: based on a multiple-level granularity structure

Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other

Clustering Analysis: Partitioning Algorithms 26

Partitioning Approach Construct a partition of a database D of n objects into a set of k

clusters

Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms

k-means (MacQueen’67): Each cluster is represented by the center of the cluster

k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

27

The K-Means Clustering Method (Overview)

Given k, the k-means algorithm is implemented in 4 steps:

Partition objects into k nonempty subsets

Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster.

Assign each object to the cluster with the nearest seed point.

Go back to Step 2, stop when no more new assignment.

Clustering Analysis: Partitioning Algorithms

28

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

The K-Means Clustering Method (An Graphical Example)


29

K-means example, step 1

k1

k2

k3

X

Y

Pick 3

initial

cluster

centers

(randomly)


Given k=3,

30


k1

k2

k3

X

Y

Assign

each point

to the closest

cluster

center


31


X

Y

Move

each cluster

center

to the mean

of each cluster

k1

k2

k2

k1

k3

k3


32


X

Y

Reassign

points

closest to a

different new

cluster center

Q: Which

points are

reassigned?

k1

k2

k3


33

K-means example, step 4 …

X

Y

A: three

points with

animation

k1

k3 k2


34

K-means example, step 4b

X

Y

re-compute

cluster

means

k1

k3 k2


35


X

Y

move cluster

centers to

cluster means

k2

k1

k3


36

Problems to be considered What can be the problems with K-means clustering? Result can vary significantly depending on initial choice of seeds

(number and position) Can get trapped in local minimum

Example:

Q: What can be done? A: To increase chance of finding global optimum: restart with

different random seeds. What can be done about outliers?

instances

initial cluster

centers


37

The K-Means Clustering Method (Strength and Weakness)

Strength

Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n

Good for finding clusters with spherical shapes

Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms

Weakness

Applicable only when mean is defined, then what about categorical data?

Need to specify k, no. of clusters, in advance

Unable to handle noisy data and outliers

Not suitable to discover clusters with non-convex shapes


38

The K-Means Clustering Method (Variations – I)

A few variants of the k-means which differ in

Selection of the initial k means

Dissimilarity calculations

Strategies to calculate cluster means K-medoids – instead of mean, use medians of each cluster

For large databases, use sampling

Handling categorical data: k-modes (Huang’98)

Replacing means of clusters with modes

Using new dissimilarity measures to deal with categorical objects

Using a frequency-based method to update modes of clusters

A mixture of categorical/numerical data: k-prototype method

Mean of 1, 3, 5, 7, 9 is 5 Mean of 1, 3, 5, 7, 1009 is 205 Median of 1, 3, 5, 7, 1009 is 5 Median advantage: not affected by extreme values


39

The K-Medoids Clustering Method (Overview)

Find representative objects, called medoids, in clusters

PAM (Partitioning Around Medoids, 1987)

starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering

PAM works effectively for small data sets, but does not scale well for large data sets

CLARA (Kaufmann & Rousseeuw, 1990)

CLARANS (Ng & Han, 1994): Randomized sampling


40

The K-Medoids Clustering Method (PAM - Partitioning Around Medoids)

PAM (Kaufman and Rousseeuw, 1987), built in Splus

Use real object to represent the cluster

Select k representative objects arbitrarily

For each pair of non-selected object h and selected object i, calculate the total swapping cost Sih

For each pair of i and h, If Sih < 0, i is replaced by h

Then assign each non-selected object to the most similar representative object

repeat steps 2-3 until there is no change


PAM example


Cluster the following data set of ten objects into two clusters i.e k = 2.

X1 2 6

X2 3 4

X3 3 8

X4 4 7

X5 6 2

X6 6 4

X7 7 3

X8 7 4

X9 8 5

X10 7 6

http://en.wikipedia.org/wiki/File:Kmedoid1.jpg

PAM example, step 1


Initialise k medoids. Let assume c1 = (3,4) and c2 = (7,4)

Calculating distance so as to associate each data object to its nearest medoid. Assume that cost is calculated using Minkowski distance metric with r = 1.

c1 Data objects

(Xi)

Cost

(distan

ce)

3 4 2 6 3

3 4 3 8 4

3 4 4 7 4

3 4 6 2 5

3 4 6 4 3

3 4 7 3 5

3 4 8 5 6

3 4 7 6 6

c2 Data objects

(Xi)

Cost

(distan

ce)

7 4 2 6 7

7 4 3 8 8

7 4 4 7 6

7 4 6 2 3

7 4 6 4 1

7 4 7 3 1

7 4 8 5 2

7 4 7 6 2

http://en.wikipedia.org/wiki/Minkowski_distance



PAM example, step 1b


Then the clusters become: Cluster1 = {(3,4)(2,6)(3,8)(4,7)} Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}

Total cost = 3 + 4 + 4 + 3 + 1 + 1 + 2 + 2 = 20

c1 Data objects

(Xi)

Cost

(distan

ce)

3 4 2 6 3

3 4 3 8 4

3 4 4 7 4

3 4 6 2 5

3 4 6 4 3

3 4 7 3 5

3 4 8 5 6

3 4 7 6 6

c2 Data objects

(Xi)

Cost

(distan

ce)

7 4 2 6 7

7 4 3 8 8

7 4 4 7 6

7 4 6 2 3

7 4 6 4 1

7 4 7 3 1

7 4 8 5 2

7 4 7 6 2

http://en.wikipedia.org/wiki/File:Kmedoid2.jpg

PAM example, step 2


Selection of nonmedoid O′ randomly. Let us assume O′ = (7,3). So now the medoids are c1(3,4) and O′(7,3)

Calculate the cost of new medoid by using the formula in the step1. Total cost = 3+4+4+2+2+1+3+3 = 22

c1 Data objects

(Xi)

Cost

(distan

ce)

3 4 2 6 3

3 4 3 8 4

3 4 4 7 4

3 4 6 2 5

3 4 6 4 3

3 4 7 3 5

3 4 8 5 6

3 4 7 6 6

O′ Data objects

(Xi)

Cost

(distan

ce)

7 3 2 6 8

7 3 3 8 9

7 3 4 7 7

7 3 6 2 2

7 3 6 4 2

7 3 7 4 1

7 3 8 5 3

7 3 7 6 3

PAM example, step 2b


So cost of swapping medoid from c2 to O′ is

S = current total cost – past total cost = 22-20 = 2 >0

So moving to O′ would be bad idea, so the previous choice was good, and algorithm terminates here (i.e there is no change in the medoids).

It may happen some data points may shift from one cluster to another cluster depending upon their closeness to medoid

46

CLARA (Clustering Large Applications) (1990)

CLARA (Kaufmann and Rousseeuw in 1990)

Built in statistical analysis packages, such as S+

It draws multiple samples of the data set, applies PAM on each

sample, and gives the best clustering as the output

Strength: deals with larger data sets than PAM

Weakness:

Efficiency depends on the sample size

A good clustering based on samples will not necessarily represent a good

clustering of the whole data set if the sample is biased


47

CLARANS (“Randomized” CLARA) (1994)

CLARANS (A Clustering Algorithm based on Randomized Search) (Ng

and Han’94)

CLARANS draws sample of neighbors dynamically

The clustering process can be presented as searching a graph where

every node is a potential solution, that is, a set of k medoids

If the local optimum is found, CLARANS starts with new randomly

selected node in search for a new local optimum

It is more efficient and scalable than both PAM and CLARA

Focusing techniques and spatial access structures may further improve

its performance (Ester et al.’95)


48

The Partition-Based Clustering (Discussion)

Result can vary significantly based on initial choice of seeds

Algorithm can get trapped in a local minimum

Example: four instances at the vertices of a two-dimensional rectangle

Local minimum: two cluster centers at the midpoints of the rectangle’s long sides

Simple way to increase chance of finding a global optimum: restart with different random seeds

Clustering Analysis: Hierarchical Algorithms

49

Hierarchical Clustering Use distance matrix as clustering criteria.

This method does not require the number of clusters k as an

input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

Agglomerative

(AGNES)

Divisive

(DIANA) Clustering Analysis: Hierarchical Algorithms

50

AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Use the Single-Link method and the dissimilarity matrix.

Merge nodes that have the least dissimilarity

Go on in a non-descending fashion

Eventually all nodes belong to the same cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


51

Dendrogram for Hierarchical Clustering Decompose data objects into a several levels of nested

partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.


52

DIANA - Divisive Analysis Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Inverse order of AGNES

Eventually each node forms a cluster on its own

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


53

Hierarchical Clustering Major weakness of agglomerative clustering methods do not scale well: time complexity of at least O(n2),

where n is the number of total objects can never undo what was done previously


54

Distances - Hierarchical Clustering (Overview)

Four measures for distance between clusters are: Single linkage (Minimum distance):

Complete linkage (Maximum distance):

Centroid comparison (Mean distance):

Element comparison (Average distance):

p'pminCCd Cjp'Cpjimin i ,),(

p'pmaxCCd Cjp'Cpjimax i ,),(

jijimean mmCCd ),(


i jCp Cp'ji

jiavg p'pnn

CCd1

),(

55

Distances - Hierarchical Clustering (Graphical Representation)

Four measures for distance between clusters are (1) single linkage, (2) complete linkage, (3) centroid comparison and (4) element comparison

x x

x

Cluster 1 Cluster 2

Cluster 3 (2)complete

(3)centroid

(1)single

x = centroids

(4) Element comparison:

average distance among all

elements in two clusters


Practice


Use single and complete link agglomerative clustering to group the data described by the following distance matrix. Show the dendrograms.

A B C D

A 0 1 4 5

B 0 2 6

C 0 3

D 0


Other Clustering Methods Incremental Clustering

Probability-based Clustering, Bayesian Clustering

EM Algorithm Cluster Schemes Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods


Advanced Method: BIRCH (Overview)

Balanced Iterative Reducing and Clustering using Hierarchies [Tian Zhang, Raghu Ramakrishnan, Miron Livny, 1996]

Incremental, hierarchical, one scan Save clustering information in a tree Each entry in the tree contains information about one cluster New nodes inserted in closest entry in tree Only works with "metric" attributes

Must have Euclidean coordinates Designed for very large data sets

Time and memory constraints are explicit Treats dense regions of data points as sub-clusters

Not all data points are important for clustering Only one scan of data is necessary


BIRCH (Merits) Incremental, distance-based approach

Decisions are made without scanning all data points, or all currently existing clusters

Does not need the whole data set in advance

Unique approach: Distance-based algorithms generally need all the data points to work

Make best use of available memory while minimizing I/O costs

Does not assume that the probability distributions on attributes is independent


BIRCH – Clustering Feature and Clustering Feature Tree

BIRTH introduces two concepts, clustering feature and clustering feature tree (CF Tree), which are used to summarize cluster representations.

These structures help the clustering method achieve good speed and scalability in large databases and make it effective for incremental and dynamic clustering of incoming object

Given n d-dimensional data objects or points in a cluster, we can define the centroid x0, radius R and diameter D of the cluster as follows:


BIRCH – Centroid, Radius and Diameter

• Given a cluster of instances , we define: • Centroid: the center of a cluster

• Radius: average distance from member points to centroid

• Diameter: average pair-wise distance within a cluster


BIRCH – Centroid Euclidean and Manhattan distances

• The centroid Euclidean distance and centroid Manhattan distance are defined between any two clusters. • Centroid Euclidean distance

• Centroid Manhattan distance


BIRCH (Average inter-cluster, Average intra-cluster, Variance increase)

• The average inter-cluster, the average intra-cluster, and the variance increase distances are defined as follows

• Average inter-cluster

• Average intra-cluster

• Variance increase distances


Clustering Feature CF = (N,LS,SS)

N: Number of points in cluster

LS: Sum of points in the cluster

SS: Sum of squares of points in the cluster CF Tree

Balanced search tree

Node has CF triple for each child

Leaf node represents cluster and has CF value for each subcluster in it.

Subcluster has maximum diameter


0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4) (4,7)

(2,6) (3,8)

(4,5)

Clustering Feature Vector

Clustering Feature: CF = (N, LS, SS)

N: Number of data points

LS: Ni=1=Xi

SS: Ni=1=Xi

2

),,( 21212121SSSSLSLSNNCFCF

(6,2) (8,4)

(7,2) (8,5)

(7,4)

CF = (5, (36,17),(262,65)) CF = (10, (52,47),(316,255))


Merging two clusters Cluster {Xi}:

i = 1, 2, …, N1

Cluster {Xj}:

j = N1+1, N1+2, …, N1+N2

Cluster Xl = {Xi} + {Xj}:

l = 1, 2, …, N1, N1+1, N1+2, …, N1+N2


CF Tree CF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6 prev next CF1 CF2 CF4

prev next

B = 7

L = 6

Root

Non-leaf node

Leaf node Leaf node


Properties of CF-Tree Each non-leaf node has at most B entries Each leaf node has at most L CF entries which each satisfy threshold

T Node size is determined by dimensionality of data space and input

parameter P (page size)

Branching Factor and Thread hold


BIRCH Algorithm (CF-Tree Insertion)

Recurse down from root, find the appropriate leaf Follow the "closest"-CF path, w.r.t. D0 / … / D4 Modify the leaf If the closest-CF leaf cannot absorb, make a new CF entry. If there is no room for new leaf, split the parent node Traverse back & up Updating CFs on the path or splitting nodes


Improve Clusters


BIRCH Algorithm (Overall steps)


Details of Each Step Phase 1: Load data into memory

Build an initial in-memory CF-tree with the data (one scan) Subsequent phases become fast, accurate, less order sensitive

Phase 2: Condense data Rebuild the CF-tree with a larger T Condensing is optional

Phase 3: Global clustering Use existing clustering algorithm on CF entries Helps fix problem where natural clusters span nodes

Phase 4: Cluster refining Do additional passes over the dataset & reassign data points to the

closest centroid from phase 3 Refining is optional


Summary of BIRCH

BIRCH works with very large data sets

Explicitly bounded by computational resources.

The computation complexity is O(n), where n is the number of objects to be clustered.

Runs with specified amount of memory (P)

Superior to CLARANS and k-MEANS

Quality, speed, stability and scalability


CURE (Clustering Using REpresentatives) CURE was proposed by Guha, Rastogi & Shim, 1998

It stops the creation of a cluster hierarchy if a level consists of k clusters

Each cluster has c representatives (instead of one)

Choose c well scattered points in the cluster

Shrink them towards the mean of the cluster by a fraction of

The representatives capture the physical shape and geometry of the cluster

It can treat arbitrary shaped clusters and avoid single-link effect.

Merge the closest two clusters

Distance of two clusters: the distance between the two closest representatives


CURE Algorithm


CURE Algorithm


CURE Algorithm (Another form)


CURE Algorithm (for Large Databases)


Data Partitioning and Clustering s = 50

p = 2

s/p = 25

s/pq = 5

x x

x

y

y y

y

x

y

x


Cure: Shrinking Representative Points Shrink the multiple representative points towards the gravity center by

a fraction of .

Multiple representatives capture the shape of the cluster

x

y

x

y


Clustering Categorical Data: ROCK

ROCK: RObust Clustering using linKs, by S. Guha, R. Rastogi, K. Shim (ICDE’99).

Use links to measure similarity/proximity

Not distance-based with categorical attributes

Computational complexity:

Basic ideas (Jaccard coefficient):

Similarity function and neighbors:

Let T1 = {1,2,3}, T2={3,4,5}

O n nm m n nm a

( log )2 2

Sim T TT T

T T( , )

1 2

1 2

1 2

Sim T T( , ){ }

{ , , , , }.1 2

3

1 2 3 4 5

1

50 2


ROCK: An Example

Links: The number of common neighbors for the two points. Using Jaccard

Use Distances to determine neighbors

(pt1,pt4) = 0, (pt1,pt2) = 0, (pt1,pt3) = 0

(pt2,pt3) = 0.6, (pt2,pt4) = 0.2

(pt3,pt4) = 0.2

Use 0.2 as threshold for neighbors

Pt2 and Pt3 have 3 common neighbors



Resulting clusters (1), (2,3,4) which makes more sense


ROCK: Property & Algorithm

Links: The number of common neighbours for the two points.

Algorithm Draw random sample

Cluster with links (maybe agglomerative hierarchical)

Label data in disk

{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}

{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}

{1,2,3} {1,2,4} 3


CHAMELEON

CHAMELEON: hierarchical clustering using dynamic modeling, by G. Karypis, E.H. Han and V. Kumar’99

Measures the similarity based on a dynamic model

Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters

A two phase algorithm

1. Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters

2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters


Graph-based clustering

Sparsification techniques keep the connections to the most similar (nearest) neighbors of a point while breaking the connections to less similar points.

The nearest neighbors of a point tend to belong to the same class as the point itself.

This reduces the impact of noise and outliers and sharpens the distinction between clusters.


Overall Framework of CHAMELEON

Construct

Sparse Graph Partition the Graph

Merge Partition

Final Clusters

Data Set


Density-Based Clustering Methods

Clustering based on density (local cluster criterion), such as density-connected points

Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition

Several interesting studies: DBSCAN: Ester, et al. (KDD’96)

OPTICS: Ankerst, et al (SIGMOD’99).

DENCLUE: Hinneburg & D. Keim (KDD’98)

CLIQUE: Agrawal, et al. (SIGMOD’98)


Density-Based Clustering (Background)

Two parameters:

Eps: Maximum radius of the neighbourhood

MinPts: Minimum number of points in an Eps-neighbourhood of that point

NEps(p): {q belongs to D | dist(p,q) <= Eps}

Directly density-reachable: A point p is directly density -reachable from a point q wrt. Eps, MinPts if

1) p belongs to NEps(q)

2) core point condition:

|NEps (q)| >= MinPts

p

q

MinPts = 5

Eps = 1 cm


Density-Based Clustering (Background)

The neighborhood with in a radius Eps of a given object is call the neighborhood of the object.

If the neighborhood of an object contains at least a minimum number, MinPts, of objects, then the object is called a core object.

Given a set of objects, D, we say that an object p is directly density-reachable from object q if p is within the neighborhood of q, and q is a core object.

An object p is density-reachable from object q with respect to and MinPts in a set of objects D, if there is a chain of object p1, …, pn, where p1=q and pn=p such that p(i+1) is directly density-reachable from pi with respect to and MinPts, for 1≤i≤n, from pi D.

An object p is density-connected to object q with respect to


Density-Based Clustering

Density-reachable:

A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi

Density-connected

A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts.

p

q p1

p q

o


DBSCAN: Density Based Spatial Clustering of Applications with

Noise

Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points

Discovers clusters of arbitrary shape in spatial databases with noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5


DBSCAN: The Algorithm

Arbitrary select a point p

Retrieve all points density-reachable from p wrt Eps and MinPts.

If p is a core point, a cluster is formed.

If p is a border point, no points are density-reachable from p and

DBSCAN visits the next point of the database.

Continue the process until all of the points have been processed.

Technology

Dbm630 lecture09