34
1 Chapter 8: Clustering

Chapter 8: Clustering

Embed Size (px)

DESCRIPTION

Chapter 8: Clustering. Searching for groups. Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre-classified data. Search for groups or clusters of data points (records) that are similar to one another. - PowerPoint PPT Presentation

Citation preview

Page 1: Chapter 8:    Clustering

1

Chapter 8: Clustering

Page 2: Chapter 8:    Clustering

2

Searching for groups Clustering is unsupervised or

undirected. Unlike classification, in clustering, no

pre-classified data. Search for groups or clusters of data

points (records) that are similar to one another.

Similar points may mean: similar customers, products, that will behave in similar ways.

Page 3: Chapter 8:    Clustering

3

Group similar points together

Group points into classes using some distance measures. Within-cluster distance, and between

cluster distance Applications:

As a stand-alone tool to get insight into data distribution

As a preprocessing step for other algorithms

Page 4: Chapter 8:    Clustering

4

An Illustration

Page 5: Chapter 8:    Clustering

5

Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs

Insurance: Identifying groups of motor insurance policy holders with some interesting characteristics.

City-planning: Identifying groups of houses according to their house type, value, and geographical location

Page 6: Chapter 8:    Clustering

6

Concepts of Clustering

Clusters Different ways of

representing clusters Division with

boundaries Spheres Probabilistic Dendrograms …

1 2 3

I1

I2

In

0.5 0.2 0.3

Page 7: Chapter 8:    Clustering

7

Clustering Clustering quality

Inter-clusters distance maximized Intra-clusters distance minimized

The quality of a clustering result depends on both the similarity measure used by the method and its application.

The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns

Clustering vs. classification Which one is more difficult? Why? There are a huge number of clustering techniques.

Page 8: Chapter 8:    Clustering

8

Dissimilarity/Distance Measure

Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d (i, j)

The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.

Weights should be associated with different variables based on applications and data semantics.

It is hard to define “similar enough” or “good enough”. The answer is typically highly subjective.

Page 9: Chapter 8:    Clustering

9

Types of data in clustering analysis

Interval-scaled variables

Binary variables

Nominal, ordinal, and ratio variables

Variables of mixed types

Page 10: Chapter 8:    Clustering

10

Interval-valued variables Continuous measurements in a roughly

linear scale, e.g., weight, height, temperature, etc

Standardize data (depending on applications) Calculate the mean absolute deviation:

where Calculate the standardized measurement (z-

score)

.)...21

1nffff

xx(xn m

|)|...|||(|121 fnffffff

mxmxmxns

f

fifif s

mx z

Page 11: Chapter 8:    Clustering

11

Similarity Between Objects Distance: Measure the similarity or

dissimilarity between two data objects Some popular ones include: Minkowski

distance:

where (xi1, xi2, …, xip) and (xj1, xj2, …, xjp) are

two p-dimensional data objects, and q is a positive integer

If q = 1, d is Manhattan distance

qq

pp

qq

jx

ix

jx

ix

jx

ixjid )||...|||(|),(

2211

||...||||),(2211 pp jxixjxixjxixjid

Page 12: Chapter 8:    Clustering

12

Similarity Between Objects (Cont.)

If q = 2, d is Euclidean distance:

Properties d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j)

Also, one can use weighted distance, and many other similarity/distance measures.

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

Page 13: Chapter 8:    Clustering

13

Binary Variables A contingency table for binary data

Simple matching coefficient (invariant, if

the binary variable is symmetric):

Jaccard coefficient (noninvariant if the

binary variable is asymmetric):

dcbacb jid

),(

pdbcasum

dcdc

baba

sum

0

1

01

cbacb jid

),(

Object i

Object j

Page 14: Chapter 8:    Clustering

14

Dissimilarity of Binary Variables

Example

gender is a symmetric attribute (not used below) the remaining attributes are asymmetric

attributes let the values Y and P be set to 1, and the value

N be set to 0

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N

75.0211

21),(

67.0111

11),(

33.0102

10),(

maryjimd

jimjackd

maryjackd

Page 15: Chapter 8:    Clustering

15

Nominal Variables A generalization of the binary variable in that

it can take more than 2 states, e.g., red, yellow, blue, green, etc

Method 1: Simple matching m: # of matches, p: total # of variables

Method 2: use a large number of binary variables creating a new binary variable for each of the M

nominal states

pmpjid ),(

Page 16: Chapter 8:    Clustering

16

Ordinal Variables

An ordinal variable can be discrete or continuous Order is important, e.g., rank Can be treated like interval-scaled (f is a variable)

replace xif by their ranks

map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by

compute the dissimilarity using methods for interval-scaled variables

11

f

ifif M

rz

},...,1{fif

Mr

Page 17: Chapter 8:    Clustering

17

Ratio-Scaled Variables Ratio-scaled variable: a measurement on a

nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt, e.g., growth of a bacteria population.

Methods: treat them like interval-scaled variables—not a good idea!

(why?—the scale can be distorted) apply logarithmic transformation

yif = log(xif) treat them as continuous ordinal data and then treat their

ranks as interval-scaled

Page 18: Chapter 8:    Clustering

18

Variables of Mixed Types A database may contain all six types of

variables symmetric binary, asymmetric binary, nominal,

ordinal, interval and ratio One may use a weighted formula to combine

their effects

f is binary or nominal:dij

(f) = 0 if xif = xjf , or dij(f) = 1 o.w.

f is interval-based: use the normalized distance f is ordinal or ratio-scaled

compute ranks rif and and treat zif as interval-scaled

)(1

)()(1),(

fij

pf

fij

fij

pf

djid

1

1

f

if

Mrz

if

Page 19: Chapter 8:    Clustering

19

Major Clustering Techniques

Partitioning algorithms: Construct various partitions and then evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion

Density-based: based on connectivity and density functions

Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of the model to each other.

Page 20: Chapter 8:    Clustering

20

Partitioning Algorithms: Basic Concept

Partitioning method: Construct a partition of a database D of n objects into a set of k clusters

Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means : Each cluster is represented by the center of

the cluster k-medoids or PAM (Partition around medoids): Each

cluster is represented by one of the objects in the cluster

Page 21: Chapter 8:    Clustering

21

The K-Means Clustering Given k, the k-means algorithm is as

follows: 1) Choose k cluster centers to coincide with k

randomly-chosen points2) Assign each data point to the closest cluster

center 3) Recompute the cluster centers using the

current cluster memberships.4) If a convergence criterion is not met, go to

2).Typical convergence criteria are: no (or minimal)

reassignment of data points to new cluster centers, or minimal decrease in squared error.

2

1

||

k

iCp i

impE

p is a point and mi is the mean of cluster Ci

Page 22: Chapter 8:    Clustering

22

Example For simplicity, 1 dimensional data and k=2. data: 1, 2, 5, 6,7 K-means:

Randomly select 5 and 6 as initial centroids; => Two clusters {1,2,5} and {6,7};

meanC1=8/3, meanC2=6.5 => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6 => no change. Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2

+ 1^2 = 2.5

Page 23: Chapter 8:    Clustering

23

Comments on K-Means Strength: efficient: O(tkn), where n is # data points, k

is # clusters, and t is # iterations. Normally, k, t << n. Comment: Often terminates at a local optimum. The

global optimum may be found using techniques such as: deterministic annealing and genetic algorithms

Weakness Applicable only when mean is defined, difficult for categorical

data Need to specify k, the number of clusters, in advance Sensitive to noisy data and outliers Not suitable to discover clusters with non-convex shapes Sensitive to initial seeds

Page 24: Chapter 8:    Clustering

24

Variations of the K-Means Method

A few variants of the k-means which differ in Selection of the initial k seeds Dissimilarity measures Strategies to calculate cluster means

Handling categorical data: k-modes Replacing means of clusters with modes Using new dissimilarity measures to deal with

categorical objects Using a frequency based method to update

modes of clusters

Page 25: Chapter 8:    Clustering

25

k-Medoids clustering method

k-Means algorithm is sensitive to outliers Since an object with an extremely large value may

substantially distort the distribution of the data. Medoid – the most centrally located point in a

cluster, as a representative point of the cluster.

An example

In contrast, a centroid is not necessarily inside a cluster.

Initial Medoids

Page 26: Chapter 8:    Clustering

26

Partition Around Medoids PAM:

1. Given k2. Randomly pick k instances as initial medoids3. Assign each data point to the nearest

medoid x4. Calculate the objective function

the sum of dissimilarities of all points to their nearest medoids. (squared-error criterion)

5. Randomly select an point y6. Swap x by y if the swap reduces the

objective function7. Repeat (3-6) until no change

Page 27: Chapter 8:    Clustering

27

Comments on PAM Pam is more robust than k-means in

the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean (why?)

Pam works well for small data sets but does not scale well for large data sets. O(k(n-k)2 ) for each change

where n is # of data, k is # of clusters

Outlier (100 unit away)

Page 28: Chapter 8:    Clustering

28

CLARA: Clustering Large Applications CLARA: Built in statistical analysis packages,

such as S+ It draws multiple samples of the data set,

applies PAM on each sample, and gives the best clustering as the output

Strength: deals with larger data sets than PAM Weakness:

Efficiency depends on the sample size A good clustering based on samples will not

necessarily represent a good clustering of the whole data set if the sample is biased

There are other scale-up methods e.g., CLARANS

Page 29: Chapter 8:    Clustering

29

Hierarchical Clustering Use distance matrix for clustering. This

method does not require the number of clusters k as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

divisive

Page 30: Chapter 8:    Clustering

30

Agglomerative Clustering

At the beginning, each data point forms a cluster (also called a node). Merge nodes/clusters that have the least dissimilarity.Go on mergingEventually all nodes belong to the same cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 31: Chapter 8:    Clustering

31

A Dendrogram Shows How the Clusters are Merged Hierarchically

Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

Page 32: Chapter 8:    Clustering

32

Divisive Clustering

Inverse order of agglomerative clustering

Eventually each node forms a cluster on its own

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 33: Chapter 8:    Clustering

33

More on Hierarchical Methods

Major weakness of agglomerative clustering methods do not scale well: time complexity at least O(n2),

where n is the total number of objects can never undo what was done previously

Integration of hierarchical with distance-based clustering to scale-up these clustering methods BIRCH (1996): uses CF-tree and incrementally

adjusts the quality of sub-clusters CURE (1998): selects well-scattered points from the

cluster and then shrinks them towards the center of the cluster by a specified fraction

Page 34: Chapter 8:    Clustering

34

Summary Cluster analysis groups objects based on

their similarity and has wide applications Measure of similarity can be computed for

various types of data Clustering algorithms can be categorized

into partitioning methods, hierarchical methods, density-based methods, etc

Clustering can also be used for outlier detection which are useful for fraud detection

What is the best clustering algorithm?