Data Mining - Sharif University of Technologyce.sharif.edu/.../resources/root/Slides/Lect-22to23.pdf · 2020. 9. 7. · Table of contents 1 Introduction 2 Data matrix and dissimilarity

Data MiningClustering

Hamid Beigy

Sharif University of Technology

Fall 1396

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 41

Table of contents

1 Introduction

2 Data matrix and dissimilarity matrix

3 Proximity Measures

4 Clustering methodsPartitioning methodsHierarchical methodsModel-based clusteringDensity based clusteringGrid-based clustering

5 Cluster validation and assessment


Table of contents

1 Introduction






Introduction

Clustering is the process of grouping a set of data objects into multiple groups or clustersso that objects within a cluster have high similarity, but are very dissimilar to objects inother clusters.

Dissimilarities and similarities are assessed based on the attribute values describing theobjects and often involve distance measures.

Clustering as a data mining tool has its roots in many application areas such as biology,security, business intelligence, and Web search.


Requirements for cluster analysis

Clustering is a challenging research field and the following are its typical requirements.

ScalabilityAbility to deal with different types of attributesDiscovery of clusters with arbitrary shapeRequirements for domain knowledge to determine input parametersAbility to deal with noisy dataIncremental clustering and insensitivity to input orderCapability of clustering high-dimensionality dataConstraint-based clusteringInterpretability and usability


Comparing clustering methods

The clustering methods can be compared using the following aspects:

The partitioning criteria : In some methods, all the objects are partitioned so that nohierarchy exists among the clusters.Separation of clusters : In some methods, data partitioned into mutually exclusive clusterswhile in some other methods, the clusters may not be exclusive, that is, a data object maybelong to more than one cluster.Similarity measure : Some methods determine the similarity between two objects by thedistance between them; while in other methods, the similarity may be defined by connectivitybased on density or contiguity.Clustering space : Many clustering methods search for clusters within the entire data space.These methods are useful for low-dimensionality data sets. With high- dimensional data,however, there can be many irrelevant attributes, which can make similarity measurementsunreliable. Consequently, clusters found in the full space are often meaningless. Its oftenbetter to instead search for clusters within different subspaces of the same data set.


Table of contents

1 Introduction






Data matrix and dissimilarity matrix

Suppose that we have n objects described by p attributes. The objects arex1 = (x11, x12, . . . , x1p), x2 = (x21, x22, . . . , x2p), and so on, where xij is the value forobject xi of the j th attribute. For brevity, we hereafter refer to object xi as object i .The objects may be tuples in a relational database, and are also referred to as datasamples or feature vectors.Main memory-based clustering and nearest-neighbor algorithms typically operate on eitherof the following two data structures:

Data matrix This structure stores the n objects in the form of a table or n × p matrix.x11 . . . x1f . . . x1p...

......

......

xi1 . . . xif . . . xip...

......

......

xn1 . . . xnf . . . xnp

Dissimilarity matrix : This structure stores a collection of proximities that are available forall pairs of objects. It is often represented by an n × n matrix or table:

0 d(1, 2) d(1, 3) . . . d(1, n)d(2, 1) 0 d(2, 3) . . . d(2, n)

......

.... . .

...d(n, 1) d(n, 2) d(n, 3) . . . 0


Table of contents

1 Introduction






Proximity Measures

Proximity measures for nominal attributes : Let the number of states of a nominalattribute be M. The dissimilarity between two objects i and j can be computed based onthe ratio of mismatches:

d(i , j) =p −m

p

where m is the number of matches and p is the total number of attributes describing theobjects.Proximity measures for binary attributes : Binary attributes are either symmetric orasymmetric.

HAN 09-ch02-039-082-9780123814791 2011/6/1 3:15 Page 70 #32

70 Chapter 2 Getting to Know Your Data

Alternatively, similarity can be computed as

sim(i, j) = 1 � d(i, j) = m

p. (2.12)

Proximity between objects described by nominal attributes can be computed usingan alternative encoding scheme. Nominal attributes can be encoded using asymmetricbinary attributes by creating a new binary attribute for each of the M states. For anobject with a given state value, the binary attribute representing that state is set to 1,while the remaining binary attributes are set to 0. For example, to encode the nominalattribute map color, a binary attribute can be created for each of the five colors previ-ously listed. For an object having the color yellow, the yellow attribute is set to 1, whilethe remaining four attributes are set to 0. Proximity measures for this form of encodingcan be calculated using the methods discussed in the next subsection.

2.4.3 Proximity Measures for Binary AttributesLet’s look at dissimilarity and similarity measures for objects described by eithersymmetric or asymmetric binary attributes.

Recall that a binary attribute has only one of two states: 0 and 1, where 0 means thatthe attribute is absent, and 1 means that it is present (Section 2.1.3). Given the attributesmoker describing a patient, for instance, 1 indicates that the patient smokes, while 0indicates that the patient does not. Treating binary attributes as if they are numeric canbe misleading. Therefore, methods specific to binary data are necessary for computingdissimilarity.

“So, how can we compute the dissimilarity between two binary attributes?” Oneapproach involves computing a dissimilarity matrix from the given binary data. If allbinary attributes are thought of as having the same weight, we have the 2 ⇥ 2 contin-gency table of Table 2.3, where q is the number of attributes that equal 1 for both objectsi and j, r is the number of attributes that equal 1 for object i but equal 0 for object j, s isthe number of attributes that equal 0 for object i but equal 1 for object j, and t is thenumber of attributes that equal 0 for both objects i and j. The total number of attributesis p, where p = q + r + s + t .

Recall that for symmetric binary attributes, each state is equally valuable. Dis-similarity that is based on symmetric binary attributes is called symmetric binarydissimilarity. If objects i and j are described by symmetric binary attributes, then the

Table 2.3 Contingency Table for Binary Attributes

Object j

1 0 sum

1 q r q + r

Object i 0 s t s + t

sum q + s r + t p

For symmetric binary attributes, similarity is calculated as

d(i , j) =r + s

q + r + s + t

For asymmetric binary attributes when the number of negative matches, t, is unimportantand the number of positive matches, q, is important , similarity is calculated as

d(i , j) =r + s

q + r + s

Coefficient 1− d(i , j) is called the Jaccard coefficient.Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 41

Proximity Measures (cont.)

Dissimilarity of numeric attributes :

The most popular distance measure is Euclidean distance

d(i , j) =√(xi1 − xj2)2 + (xi2 − xj1)2 + . . .+ (xip − xjp)2

Another well-known measure is Manhattan distance

d(i , j) = |xi1 − xj2|+ |xi2 − xj1|+ . . .+ |xip − xjp|

Minkowski distance is generalization of Euclidean and Manhattan distances

d(i , j) = h

√|xi1 − xj2|h + |xi2 − xj1|h + . . .+ |xip − xjp|h

Dissimilarity of ordinal attributes : We first replace each xif by its corresponding rankrif ∈ {1, . . . ,Mf } and then normalize it using

zif =rif − 1

Mf − 1

Then dissimilarity can be computed using distance measures for numeric attributes usingzif .


Proximity Measures (cont.)

Dissimilarity for attributes of mixed types : A more preferable approach is to process allattribute types together, performing a single analysis.

d(i , j) =

∑pf=1 δ

(f )ij d

(f )ij∑p

f=1 δ(f )ij

where the indicator δ(f )ij = 0 if either

xif or xjf is missingxif = xjf = 0 and attribute f is asymmetric binary

and otherwise δ(f )ij = 1.

The distance d(f )ij is computed based on the type of attribute f .


Table of contents

1 Introduction






Clustering methods

There are many clustering algorithms in the literature. It is difficult to provide a crispcategorization of clustering methods because these categories may overlap so that amethod may have features from several categories. In general, the major fundamentalclustering methods can be classified into the following categories.

HAN 17-ch10-443-496-9780123814791 2011/6/1 3:44 Page 450 #8

450 Chapter 10 Cluster Analysis: Basic Concepts and Methods

Grid-based methods: Grid-based methods quantize the object space into a finitenumber of cells that form a grid structure. All the clustering operations are per-formed on the grid structure (i.e., on the quantized space). The main advantage ofthis approach is its fast processing time, which is typically independent of the num-ber of data objects and dependent only on the number of cells in each dimension inthe quantized space.

Using grids is often an efficient approach to many spatial data mining problems,including clustering. Therefore, grid-based methods can be integrated with otherclustering methods such as density-based methods and hierarchical methods. Grid-based clustering is studied in Section 10.5.

These methods are briefly summarized in Figure 10.1. Some clustering algorithmsintegrate the ideas of several clustering methods, so that it is sometimes difficult to clas-sify a given algorithm as uniquely belonging to only one clustering method category.Furthermore, some applications may have clustering criteria that require the integrationof several clustering techniques.

In the following sections, we examine each clustering method in detail. Advancedclustering methods and related issues are discussed in Chapter 11. In general, thenotation used is as follows. Let D be a data set of n objects to be clustered. An object isdescribed by d variables, where each variable is also called an attribute or a dimension,

Method General Characteristics

Partitioningmethods

– Find mutually exclusive clusters of spherical shape– Distance-based– May use mean or medoid (etc.) to represent cluster center– Effective for small- to medium-size data sets

Hierarchicalmethods

– Clustering is a hierarchical decomposition (i.e., multiple levels)– Cannot correct erroneous merges or splits– May incorporate other techniques like microclustering or

consider object “linkages”

Density-basedmethods

– Can find arbitrarily shaped clusters– Clusters are dense regions of objects in space that are

separated by low-density regions– Cluster density: Each point must have a minimum number of

points within its “neighborhood”– May filter out outliers

Grid-basedmethods

– Use a multiresolution grid data structure– Fast processing time (typically independent of the number of

data objects, yet dependent on grid size)

Figure 10.1 Overview of clustering methods discussed in this chapter. Note that some algorithms maycombine various methods.


Partitioning methods

The simplest and most fundamental version of cluster analysis is partitioning, whichorganizes the objects of a set into several exclusive groups or clusters.Formally, given a data set, D, of n objects, and k, the number of clusters to form, apartitioning algorithm organizes the objects into k partitions (k ≤ n), where eachpartition represents a cluster.The clusters are formed to optimize an objective partitioning criterion, such as adissimilarity function based on distance, so that the objects within a cluster are similar toone another and dissimilar to objects in other clusters in terms of the data set attributes.

HAN 17-ch10-443-496-9780123814791 2011/6/1 3:44 Page 453 #11

10.2 Partitioning Methods 453

(a) Initial clustering (b) Iterate (c) Final clustering

+

+

+

+

+

++

+

+

Figure 10.3 Clustering of a set of objects using the k-means method; for (b) update cluster centers andreassign objects accordingly (the mean of each cluster is marked by a +).

Example 10.1 Clustering by k-means partitioning. Consider a set of objects located in 2-D space,as depicted in Figure 10.3(a). Let k = 3, that is, the user would like the objects to bepartitioned into three clusters.

According to the algorithm in Figure 10.2, we arbitrarily choose three objects asthe three initial cluster centers, where cluster centers are marked by a +. Each objectis assigned to a cluster based on the cluster center to which it is the nearest. Such adistribution forms silhouettes encircled by dotted curves, as shown in Figure 10.3(a).

Next, the cluster centers are updated. That is, the mean value of each cluster is recal-culated based on the current objects in the cluster. Using the new cluster centers, theobjects are redistributed to the clusters based on which cluster center is the nearest.Such a redistribution forms new silhouettes encircled by dashed curves, as shown inFigure 10.3(b).

This process iterates, leading to Figure 10.3(c). The process of iteratively reassigningobjects to clusters to improve the partitioning is referred to as iterative relocation. Even-tually, no reassignment of the objects in any cluster occurs and so the process terminates.The resulting clusters are returned by the clustering process.

The k-means method is not guaranteed to converge to the global optimum and oftenterminates at a local optimum. The results may depend on the initial random selectionof cluster centers. (You will be asked to give an example to show this as an exercise.)To obtain good results in practice, it is common to run the k-means algorithm multipletimes with different initial cluster centers.

The time complexity of the k-means algorithm is O(nkt), where n is the total numberof objects, k is the number of clusters, and t is the number of iterations. Normally, k ⌧ nand t ⌧ n. Therefore, the method is relatively scalable and efficient in processing largedata sets.

There are several variants of the k-means method. These can differ in the selectionof the initial k-means, the calculation of dissimilarity, and the strategies for calculatingcluster means.


k-Means clustering algorithm

Suppose a data set, D, contains n objects in Euclidean space. Partitioning methodsdistribute the objects in D into k clusters, C1, . . . ,Ck , that is, Ci ⊂ D and Ci ∩ Cj = ϕfor (1 ≤ i , j ≤ k).

An objective function is used to assess the partitioning quality so that objects within acluster are similar to one another but dissimilar to objects in other clusters.

This is, the objective function aims for high intracluster similarity and low interclustersimilarity.

A centroid-based partitioning technique uses the centroid of a cluster, Ci , to representthat cluster.

The difference between an object p ∈ Ci and µi , the representative of the cluster, ismeasured by ||p − µi ||.The quality of cluster Ci can be measured by the within-cluster variation, which is thesum of squared error between all objects in Ci and the centroid ci , defined as

E =n∑

i=1

∑p∈Ci

||p − µi ||2


k-Means clustering algorithm (cont.)336 Representative-based Clustering

2 3 4 10 11 12 20 25 30

(a) Initial datasetµ1 = 2

2 3

µ2 = 4

4 10 11 12 20 25 30

(b) Iteration: t = 1µ1 = 2.5

2 3 4

µ2 = 16

10 11 12 20 25 30

(c) Iteration: t = 2µ1 = 3

2 3 4 10

µ2 = 18

11 12 20 25 30

(d) Iteration: t = 3µ1 = 4.75

2 3 4 10 11 12

µ2 = 19.60

20 25 30

(e) Iteration: t = 4µ1 = 7

2 3 4 10 11 12

µ2 = 25

20 25 30

(f) Iteration: t = 5 (converged)

Figure 13.1. K-means in one dimension.

Example 13.2 (K-means in Two Dimensions). In Figure 13.2 we illustrate theK-means algorithm on the Iris dataset, using the first two principal components asthe two dimensions. Iris has n = 150 points, and we want to find k = 3 clusters,corresponding to the three types of Irises. A random initialization of the clustermeans yields

µ1 = (−0.98,−1.24)T µ2 = (−2.96,1.16)T µ3 = (−1.69,−0.80)T

as shown in Figure 13.2a. With these initial clusters, K-means takes eight iterationsto converge. Figure 13.2b shows the clusters and their means after one iteration:

µ1 = (1.56,−0.08)T µ2 = (−2.86,0.53)T µ3 = (−1.50,−0.05)T

Finally, Figure 13.2c shows the clusters on convergence. The final means are asfollows:

µ1 = (2.64,0.19)T µ2 = (−2.35,0.27)T µ3 = (−0.66,−0.33)T


k-Means clustering algorithm (cont.)

The k-means method is not guaranteed to converge to the global optimum and oftenterminates at a local optimum.

The results may depend on the initial random selection of cluster centers.

To obtain good results in practice, it is common to run the k-means algorithm multipletimes with different initial cluster centers.

The time complexity of the k-means algorithm is O(nkt), where n is the total number ofobjects, k is the number of clusters, and t is the number of iterations.

Normally, k ≪ n and t ≪ n. Therefore, the method is relatively scalable and efficient inprocessing large data sets.

There are several variants of the k-means method. These can differ in the selection of theinitial k-means, the calculation of dissimilarity, and the strategies for calculating clustermeans.

The k-modes method is a variant of k-means, which extends the k-means paradigm to clusternominal data by replacing the means of clusters with modes.The partitioning around medoid (PAM) is a realization of k-medoids method (to reducesensitivity to outliers).


Hierarchical methods

A hierarchical clustering method works by grouping data objects into a hierarchy or treeof clusters.

HAN 17-ch10-443-496-9780123814791 2011/6/1 3:44 Page 460 #18


aab

b

c

d

ede

cde

abcde

Step 0 Step 1 Step 2 Step 3 Step 4


Divisive(DIANA)

Agglomerative(AGNES)

Figure 10.6 Agglomerative and divisive hierarchical clustering on data objects {a,b,c,d,e}.

Levell=0

l=1l=2

l=3

l=4

a b c d e1.0

0.8

0.6

0.4

0.2

0.0

Sim

ilari

ty s

cale

Figure 10.7 Dendrogram representation for hierarchical clustering of data objects {a,b,c,d,e}.

different clusters. This is a single-linkage approach in that each cluster is representedby all the objects in the cluster, and the similarity between two clusters is measuredby the similarity of the closest pair of data points belonging to different clusters. Thecluster-merging process repeats until all the objects are eventually merged to form onecluster.

DIANA, the divisive method, proceeds in the contrasting way. All the objects are usedto form one initial cluster. The cluster is split according to some principle such as themaximum Euclidean distance between the closest neighboring objects in the cluster. Thecluster-splitting process repeats until, eventually, each new cluster contains only a singleobject.

A tree structure called a dendrogram is commonly used to represent the process ofhierarchical clustering. It shows how objects are grouped together (in an agglomerativemethod) or partitioned (in a divisive method) step-by-step. Figure 10.7 shows a den-drogram for the five objects presented in Figure 10.6, where l = 0 shows the five objectsas singleton clusters at level 0. At l = 1, objects a and b are grouped together to form the

Hierarchical clustering methods

Agglomerative hierarchical clusteringDivisive hierarchical clustering


Distance measures in hierarchical methods

Whether using an agglomerative method or a divisive method, a core need is to measurethe distance between two clusters, where each cluster is generally a set of objects.

Four widely used measures for distance between clusters are as follows, where |p − q| isthe distance between two objects or points, p and q; µi is the mean for cluster, Ci ; and niis the number of objects in Ci . They are also known as linkage measures.

Minimum distancedmin(Ci ,Cj) = min

p∈Ci ,q∈Cj

{|p − q|}

Maximum distance

dmax(Ci ,Cj) = maxp∈Ci ,q∈Cj

{|p − q|}

Mean distance

dmean(Ci ,Cj) = |µi − µj |

Average distance

dmin(Ci ,Cj) =1

ninj

∑p∈Ci ,q∈Cj

|p − q|


Hierarchical methods

HAN 17-ch10-443-496-9780123814791 2011/6/1 3:44 Page 460 #18


aab

b

c

d

ede

cde

abcde



Divisive(DIANA)

Agglomerative(AGNES)

Figure 10.6 Agglomerative and divisive hierarchical clustering on data objects {a,b,c,d,e}.

Levell=0

l=1l=2

l=3

l=4

a b c d e1.0

0.8

0.6

0.4

0.2

0.0

Sim

ilari

ty s

cale

Figure 10.7 Dendrogram representation for hierarchical clustering of data objects {a,b,c,d,e}.

different clusters. This is a single-linkage approach in that each cluster is representedby all the objects in the cluster, and the similarity between two clusters is measuredby the similarity of the closest pair of data points belonging to different clusters. Thecluster-merging process repeats until all the objects are eventually merged to form onecluster.

DIANA, the divisive method, proceeds in the contrasting way. All the objects are usedto form one initial cluster. The cluster is split according to some principle such as themaximum Euclidean distance between the closest neighboring objects in the cluster. Thecluster-splitting process repeats until, eventually, each new cluster contains only a singleobject.

A tree structure called a dendrogram is commonly used to represent the process ofhierarchical clustering. It shows how objects are grouped together (in an agglomerativemethod) or partitioned (in a divisive method) step-by-step. Figure 10.7 shows a den-drogram for the five objects presented in Figure 10.6, where l = 0 shows the five objectsas singleton clusters at level 0. At l = 1, objects a and b are grouped together to form the


Model-based clustering

k-means is closely related to a probabilistic model known as the Gaussian mixture model.

p(x) =K∑

k=1

πkN (x |µk ,Σk)

πk , µk ,Σk are parameters. πk are called mixing proportions and each Gaussian is called amixture component.

The model is simply a weighted sum of Gaussians. But it is much more powerful than asingle Gaussian, because it can model multi-modal distributions.

Gaussian mixture models example

I A mixture of three Gaussians.

Roland Memisevic Machine Learning 21

Gaussian mixture models example

A Gaussian fit to some data. Gaussian mixture fit to same data.


Gaussian mixture models

p(x) =

X

k

⇡kN (x|µk,⌃k)

I Note that for p(x) to be a probability distribution, werequire that

Pk ⇡k = 1 and that ⇡k > 0 8k

I Thus, we may interpret the ⇡k as probabilities themselves!

I This motivates introducing latent variables z andre-writing the model, equivalently, in terms of twodistributions p(z) and p(z|x) as follows:

p(x) =

X

z

p(z)p(x|z)


Gaussian mixture models

I Here

p(z) =

KY

k=1

⇡

zkk

is a discrete distribution (that is, z is a one-hot encodinglike in K-means.)

I Andp(x|zk = 1) = N (x|µk,⌃k)

is a conditional Gaussian distribution.

I Why rewrite the mixture model like this?


Note that for p(x) to be a probability distribution, we require that∑

k πk = 1 and thatfor all k we have πk > 0. Thus, we may interpret the πk as probabilities themselves.

Set of parameters θ = {{πk}, {µk}, {Σk}}Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 18 / 41

Model-based clustering (cont.)

Let use a K-dimensional binary random variable z in which a particular element zk equalsto 1 and other elements are 0.

The values of zk therefore satisfy zk ∈ {0, 1} and∑

k zk = 1

We define the joint distribution p(x , z) in terms of a marginal distribution p(z) and aconditional distribution p(x |z).The marginal distribution over z is specified in terms of πk , such that

p(zk = 1) = πk

We can write this distribution in the form of

p(zk = 1) =K∏

k=1

πzkk

The conditional distribution of x given a particular value for z is a Gaussian

p(x |zk = 1) = N (x |µk ,Σk)

This can also be written in the form of

p(x |zk = 1) =K∏

k=1

N (x |µk ,Σk)zk



The marginal distribution of x equals to

p(x) =∑z

p(z)p(x |z) =K∑

k=1

πkN (x |µk ,Σk)

We can write p(zk = 1|x) as

γ(zk) = p(zk = 1|x) =p(zk = 1)p(x |zk = 1)

p(x)

=p(zk = 1)p(x |zk = 1)∑Kj=1 p(zj = 1)p(x |zj = 1)

=πkN (x |µk ,Σk)∑Kj=1 πjN (x |µj ,Σj)

We shall view πk as the prior probability of zk = 1, and the quantity γ(zk) as thecorresponding posterior probability once we have observed x .


Gaussian mixture model (example)112 2. PROBABILITY DISTRIBUTIONS

0.5 0.3

0.2

(a)

0 0.5 1

0

0.5

1 (b)

0 0.5 1

0

0.5

1

Figure 2.23 Illustration of a mixture of 3 Gaussians in a two-dimensional space. (a) Contours of constantdensity for each of the mixture components, in which the 3 components are denoted red, blue and green, andthe values of the mixing coefficients are shown below each component. (b) Contours of the marginal probabilitydensity p(x) of the mixture distribution. (c) A surface plot of the distribution p(x).

We therefore see that the mixing coefficients satisfy the requirements to be probabil-ities.

From the sum and product rules, the marginal density is given by

p(x) =K!

k=1

p(k)p(x|k) (2.191)

which is equivalent to (2.188) in which we can view πk = p(k) as the prior prob-ability of picking the kth component, and the density N (x|µk,Σk) = p(x|k) asthe probability of x conditioned on k. As we shall see in later chapters, an impor-tant role is played by the posterior probabilities p(k|x), which are also known asresponsibilities. From Bayes’ theorem these are given by

γk(x) ≡ p(k|x)

=p(k)p(x|k)"

l p(l)p(x|l)

=πkN (x|µk,Σk)"

l πlN (x|µl,Σl). (2.192)

We shall discuss the probabilistic interpretation of the mixture distribution in greaterdetail in Chapter 9.

The form of the Gaussian mixture distribution is governed by the parameters π,µ and Σ, where we have used the notation π ≡ {π1, . . . , πK}, µ ≡ {µ1, . . . ,µK}and Σ ≡ {Σ1, . . .ΣK}. One way to set the values of these parameters is to usemaximum likelihood. From (2.188) the log of the likelihood function is given by

ln p(X|π, µ,Σ) =N!

n=1

ln

#K!

k=1

πkN (xn|µk,Σk)

$(2.193)

9.2. Mixtures of Gaussians 433

(a)

0 0.5 1

0

0.5

1 (b)

0 0.5 1

0

0.5

1 (c)

0 0.5 1

0

0.5

1

Figure 9.5 Example of 500 points drawn from the mixture of 3 Gaussians shown in Figure 2.23. (a) Samplesfrom the joint distribution p(z)p(x|z) in which the three states of z, corresponding to the three components of themixture, are depicted in red, green, and blue, and (b) the corresponding samples from the marginal distributionp(x), which is obtained by simply ignoring the values of z and just plotting the x values. The data set in (a) issaid to be complete, whereas that in (b) is incomplete. (c) The same samples in which the colours represent thevalue of the responsibilities γ(znk) associated with data point xn, obtained by plotting the corresponding pointusing proportions of red, blue, and green ink given by γ(znk) for k = 1, 2, 3, respectively

matrix X in which the nth row is given by xTn . Similarly, the corresponding latent

variables will be denoted by an N × K matrix Z with rows zTn . If we assume that

the data points are drawn independently from the distribution, then we can expressthe Gaussian mixture model for this i.i.d. data set using the graphical representationshown in Figure 9.6. From (9.7) the log of the likelihood function is given by

ln p(X|π, µ,Σ) =N!

n=1

ln

"K!

k=1

πkN (xn|µk,Σk)

#. (9.14)

Before discussing how to maximize this function, it is worth emphasizing thatthere is a significant problem associated with the maximum likelihood frameworkapplied to Gaussian mixture models, due to the presence of singularities. For sim-plicity, consider a Gaussian mixture whose components have covariance matricesgiven by Σk = σ2

kI, where I is the unit matrix, although the conclusions will holdfor general covariance matrices. Suppose that one of the components of the mixturemodel, let us say the jth component, has its mean µj exactly equal to one of the data

Figure 9.6 Graphical representation of a Gaussian mixture modelfor a set of N i.i.d. data points {xn}, with correspondinglatent points {zn}, where n = 1, . . . , N .

xn

zn

N

µ Σ

π



Let X = {x1, . . . , xN} be drawn i.i.d. from mixture of Gaussian. The log-likelihood of theobservations equals to

ln p(x |µ, π,Σ) =N∑

n=1

ln

[K∑

k=1

πkN (xn|µk ,Σk)

]

Setting the derivatives of ln p(x |µ, π,Σ) with respect to µk and setting it equal to zero,we obtain

0 = −N∑

n=1

πkN (xn|µk ,Σk)∑Kj=1 πjN (xn|µj ,Σj)︸︷︷︸

γ(znk )

Σk(xn − µk)

Multiplying by Σ−1k and then simplifying, we obtain

µk =1

Nk

N∑n=1

γ(znk)xn

Nk =N∑

n=1

γ(znk)



Setting the derivatives of ln p(x |µ, π,Σ) with respect to Σk and setting it equal to zero,we obtain

Σk =1

Nk

N∑n=1

γ(znk)(xn − µk)(xn − µk)T

We maximize ln p(x |µ, π,Σ) with respect to πk with constraint∑K

k=1 πk = 1. This canbe achieved using a Lagrange multiplier and maximizing the following quantity

ln p(x |µ, π,Σ) + λ

(K∑

k=1

πk − 1

).

which givesN∑

n=1

πkN (xn|µk ,Σk)∑Kj=1 πjN (xn|µj ,Σj)

+ λ

If we now multiply both sides by πk and sum over k making use of the constraint∑Kk=1 πk = 1, we find λ = N. Using this to eliminate λ and rearranging we obtain

πk =Nk

N


EM for Gassian mixture models

1 Initialize µk , Σk , and πk , and evaluate the initial value of the log likelihood.

2 E step Evaluate γ(znk) using the current parameter values

γ(znk) =πkN (xn|µk ,Σk)∑Kj=1 πjN (xn|µj ,Σj)

3 M step Re-estimate the parameters using the current value of γ(znk)

µk =1

Nk

N∑n=1

γ(znk)xn

Σk =1

Nk

N∑n=1

γ(znk)(xn − µk)(xn − µk)T

πk =Nk

N

where Nk =∑N

n=1 γ(znk).

4 Evaluate the log likelihood ln p(x |µ, π,Σ) =∑N

n=1 ln[∑K

k=1 πkN (xn|µk ,Σk)]and check

for convergence of either the parameters or the log likelihood. If the convergence criterionis not satisfied return to step 2.


Model-based clustering (example)

9.2. Mixtures of Gaussians 437

(a)−2 0 2

−2

0

2

(b)−2 0 2

−2

0

2

(c)

L = 1

−2 0 2

−2

0

2

(d)

L = 2

−2 0 2

−2

0

2

(e)

L = 5

−2 0 2

−2

0

2

(f)

L = 20

−2 0 2

−2

0

2

Figure 9.8 Illustration of the EM algorithm using the Old Faithful set as used for the illustration of the K-meansalgorithm in Figure 9.1. See the text for details.

and the M step, for reasons that will become apparent shortly. In the expectationstep, or E step, we use the current values for the parameters to evaluate the posteriorprobabilities, or responsibilities, given by (9.13). We then use these probabilities inthe maximization step, or M step, to re-estimate the means, covariances, and mix-ing coefficients using the results (9.17), (9.19), and (9.22). Note that in so doingwe first evaluate the new means using (9.17) and then use these new values to findthe covariances using (9.19), in keeping with the corresponding result for a singleGaussian distribution. We shall show that each update to the parameters resultingfrom an E step followed by an M step is guaranteed to increase the log likelihoodfunction. In practice, the algorithm is deemed to have converged when the changeSection 9.4in the log likelihood function, or alternatively in the parameters, falls below somethreshold. We illustrate the EM algorithm for a mixture of two Gaussians applied tothe rescaled Old Faithful data set in Figure 9.8. Here a mixture of two Gaussiansis used, with centres initialized using the same values as for the K-means algorithmin Figure 9.1, and with precision matrices initialized to be proportional to the unitmatrix. Plot (a) shows the data points in green, together with the initial configura-tion of the mixture model in which the one standard-deviation contours for the two


Density based clustering

The general idea of these methods is to continue growing a given cluster as long as thedensity in the neighborhood exceeds some threshold.

How can we find dense regions in density-based clustering?

The density of an object x can be measured by the number of objects close to x .

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds coreobjects, that is, objects that have dense neighborhoods.

It connects core objects and their neighborhoods to form dense regions as clusters.

HAN 17-ch10-443-496-9780123814791 2011/6/1 3:44 Page 471 #29

10.4 Density-Based Methods 471

10.4 Density-Based Methods

Partitioning and hierarchical methods are designed to find spherical-shaped clusters.They have difficulty finding clusters of arbitrary shape such as the “S” shape and ovalclusters in Figure 10.13. Given such data, they would likely inaccurately identify convexregions, where noise or outliers are included in the clusters.

To find clusters of arbitrary shape, alternatively, we can model clusters as denseregions in the data space, separated by sparse regions. This is the main strategy behinddensity-based clustering methods, which can discover clusters of nonspherical shape.In this section, you will learn the basic techniques of density-based clustering bystudying three representative methods, namely, DBSCAN (Section 10.4.1), OPTICS(Section 10.4.2), and DENCLUE (Section 10.4.3).

10.4.1 DBSCAN: Density-Based Clustering Based on ConnectedRegions with High Density“How can we find dense regions in density-based clustering?” The density of an object o

can be measured by the number of objects close to o. DBSCAN (Density-Based SpatialClustering of Applications with Noise) finds core objects, that is, objects that have denseneighborhoods. It connects core objects and their neighborhoods to form dense regionsas clusters.

“How does DBSCAN quantify the neighborhood of an object?” A user-specified para-meter ✏ > 0 is used to specify the radius of a neighborhood we consider for every object.The ✏-neighborhood of an object o is the space within a radius ✏ centered at o.

Due to the fixed neighborhood size parameterized by ✏, the density of a neighbor-hood can be measured simply by the number of objects in the neighborhood. To deter-mine whether a neighborhood is dense or not, DBSCAN uses another user-specified

Figure 10.13 Clusters of arbitrary shape.Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 26 / 41

Density based clustering (cont.)

How does DBSCAN quantify the neighborhood of an object?

User-specified parameter ϵ > 0 is used to specify the radius of a neighborhood weconsider for every object.

Definition (ϵ-neighborhood)

The ϵ-neighborhood of an object x is the space within a radius ϵ centered at x .

Due to the fixed neighborhood size parameterized by ϵ, the density of a neighborhood canbe measured simply by the number of objects in the neighborhood.

Definition (ϵ-neighborhood)

An object is a core object if the ϵ-neighborhood of the object contains at least MinPts objects.

376 Density-based Clustering

20

95

170

245

320

395

0 100 200 300 400 500 600

X1

X2

Figure 15.1. Density-based dataset.

ϵx

(a)

x

yz

(b)

Figure 15.2. (a) Neighborhood of a point. (b) Core, border, and noise points.

of points, x0,x1, . . . ,xl , such that x = x0 and y = xl , and xi is directly density reachablefrom xi−1 for all i = 1, . . . , l. In other words, there is set of core points leading from y tox. Note that density reachability is an asymmetric or directed relationship. Define anytwo points x and y to be density connected if there exists a core point z, such that bothx and y are density reachable from z. A density-based cluster is defined as a maximalset of density connected points.

The pseudo-code for the DBSCAN density-based clustering method is shown inAlgorithm 15.1. First, DBSCAN computes the ϵ-neighborhood Nϵ(xi) for each pointxi in the dataset D, and checks if it is a core point (lines 2–5). It also sets the clusterid id(xi) = ∅ for all points, indicating that they are not assigned to any cluster. Next,starting from each unassigned core point, the method recursively finds all its densityconnected points, which are assigned to the same cluster (line 10). Some border point



Given a set, D, of objects, we can identify all core objects with respect to the givenparameters, ϵ and MinPts. The clustering task is reduced to using core objects and theirneighborhoods to form dense regions, where the dense regions are clusters.

Definition (Directly density-reachable)

For a core object q and an object p, we say that p is directly density-reachable from q (withrespect to ϵ and MinPts) if p is within the ϵ−neighborhood of q.

An object p is directly density-reachable from another object q if and only if q is a coreobject and p is in the ϵ−neighborhood of q.

HAN 17-ch10-443-496-9780123814791 2011/6/1 3:44 Page 473 #31


q

m

ps

o

r

Figure 10.14 Density-reachability and density-connectivity in density-based clustering. Source: Based onEster, Kriegel, Sander, and Xu [EKSX96].

“How does DBSCAN find clusters?” Initially, all objects in a given data set D aremarked as “unvisited.” DBSCAN randomly selects an unvisited object p, marks p as“visited,” and checks whether the ✏-neighborhood of p contains at least MinPts objects.If not, p is marked as a noise point. Otherwise, a new cluster C is created for p, and allthe objects in the ✏-neighborhood of p are added to a candidate set, N . DBSCAN iter-atively adds to C those objects in N that do not belong to any cluster. In this process,for an object p

0 in N that carries the label “unvisited,” DBSCAN marks it as “visited” andchecks its ✏-neighborhood. If the ✏-neighborhood of p

0 has at least MinPts objects, thoseobjects in the ✏-neighborhood of p

0 are added to N . DBSCAN continues adding objectsto C until C can no longer be expanded, that is, N is empty. At this time, cluster C iscompleted, and thus is output.

To find the next cluster, DBSCAN randomly selects an unvisited object from theremaining ones. The clustering process continues until all objects are visited. Thepseudocode of the DBSCAN algorithm is given in Figure 10.15.

If a spatial index is used, the computational complexity of DBSCAN is O(n logn),where n is the number of database objects. Otherwise, the complexity is O(n2). Withappropriate settings of the user-defined parameters, ✏ and MinPts, the algorithm iseffective in finding arbitrary-shaped clusters.

10.4.2 OPTICS: Ordering Points to Identifythe Clustering StructureAlthough DBSCAN can cluster objects given input parameters such as ✏ (the maxi-mum radius of a neighborhood) and MinPts (the minimum number of points requiredin the neighborhood of a core object), it encumbers users with the responsibility ofselecting parameter values that will lead to the discovery of acceptable clusters. This isa problem associated with many other clustering algorithms. Such parameter settings

MinPts = 3Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 28 / 41


How can we assemble a large dense region using small dense regions centered by coreobjects?

Definition (Density-reachable)

An object p is density-reachable from q (with respect to ϵ and MinPts in D) if there is a chainof objects p1, . . . , pn, such that p1 = q, pn = p, and pi+1 is directly density-reachable from piwith respect to ϵ and MinPts, for 1 ≤ i ≤ n, pi ∈ D.

HAN 17-ch10-443-496-9780123814791 2011/6/1 3:44 Page 473 #31


q

m

ps

o

r









MinPts = 3



To connect core objects as well as their neighbors in a dense region, DBSCAN uses thenotion of density-connectedness.

Definition (Density-connected)

Two objects p1, p2 ∈ D are density-connected with respect to ϵ and MinPts if there is anobject q ∈ D such that both p1 and p2 are density-reachable from q with respect to ϵ andMinPts.

HAN 17-ch10-443-496-9780123814791 2011/6/1 3:44 Page 473 #31


q

m

ps

o

r









MinPts = 3



How does DBSCAN find clusters?

1 Initially, all objects in data set D are marked as unvisited.

2 It randomly selects an unvisited object p, marks p as visited, and checks whether p is corepoint or not.

3 If p is not core point, then p is marked as a noise point. Otherwise, a new cluster C iscreated for p, and all the objects in the ϵ− neighborhood of p are added to a candidateset N.

4 DBSCAN iteratively adds to C those objects in N that do not belong to any cluster.

5 In this process, for an object p′ ∈ N that carries the label unvisited, DBSCAN marks it asvisited and checks its ϵ−neighborhood.

6 If p′ is a core point, then those objects in its ϵ−neighborhood are added to N.

7 DBSCAN continues adding objects to C until C can no longer be expanded, that is, N isempty. At this time, cluster C is completed, and thus is output.

8 To find the next cluster, DBSCAN randomly selects an unvisited object from theremaining ones.

9 The clustering process continues until all objects are visited.


Density based clustering (example)376 Density-based Clustering

20

95

170

245

320

395

0 100 200 300 400 500 600

X1

X2

Figure 15.1. Density-based dataset.

ϵx

(a)

x

yz

(b)

Figure 15.2. (a) Neighborhood of a point. (b) Core, border, and noise points.

of points, x0,x1, . . . ,xl , such that x = x0 and y = xl , and xi is directly density reachablefrom xi−1 for all i = 1, . . . , l. In other words, there is set of core points leading from y tox. Note that density reachability is an asymmetric or directed relationship. Define anytwo points x and y to be density connected if there exists a core point z, such that bothx and y are density reachable from z. A density-based cluster is defined as a maximalset of density connected points.

The pseudo-code for the DBSCAN density-based clustering method is shown inAlgorithm 15.1. First, DBSCAN computes the ϵ-neighborhood Nϵ(xi) for each pointxi in the dataset D, and checks if it is a core point (lines 2–5). It also sets the clusterid id(xi) = ∅ for all points, indicating that they are not assigned to any cluster. Next,starting from each unassigned core point, the method recursively finds all its densityconnected points, which are assigned to the same cluster (line 10). Some border point


Grid-based clustering

The grid-based clustering approach uses a multiresolution grid data structure.

HAN 17-ch10-443-496-9780123814791 2011/6/1 3:44 Page 480 #38


(i – 1)st layer

First layer

ith layer

Figure 10.19 Hierarchical structure for STING clustering.

beforehand or obtained by hypothesis tests such as the �2 test. The type of distributionof a higher-level cell can be computed based on the majority of distribution types of itscorresponding lower-level cells in conjunction with a threshold filtering process. If thedistributions of the lower-level cells disagree with each other and fail the threshold test,the distribution type of the high-level cell is set to none.

“How is this statistical information useful for query answering?” The statistical para-meters can be used in a top-down, grid-based manner as follows. First, a layer within thehierarchical structure is determined from which the query-answering process is to start.This layer typically contains a small number of cells. For each cell in the current layer,we compute the confidence interval (or estimated probability range) reflecting the cell’srelevancy to the given query. The irrelevant cells are removed from further considera-tion. Processing of the next lower level examines only the remaining relevant cells. Thisprocess is repeated until the bottom layer is reached. At this time, if the query specifica-tion is met, the regions of relevant cells that satisfy the query are returned. Otherwise,the data that fall into the relevant cells are retrieved and further processed until theymeet the query’s requirements.

An interesting property of STING is that it approaches the clustering result ofDBSCAN if the granularity approaches 0 (i.e., toward very low-level data). In otherwords, using the count and cell size information, dense clusters can be identifiedapproximately using STING. Therefore, STING can also be regarded as a density-basedclustering method.

“What advantages does STING offer over other clustering methods?” STING offersseveral advantages: (1) the grid-based computation is query-independent because thestatistical information stored in each cell represents the summary information of thedata in the grid cell, independent of the query; (2) the grid structure facilitates parallelprocessing and incremental updating; and (3) the method’s efficiency is a major advan-tage: STING goes through the database once to compute the statistical parameters of thecells, and hence the time complexity of generating clusters is O(n), where n is the totalnumber of objects. After generating the hierarchical structure, the query processing time


Table of contents

1 Introduction






Cluster validation and assessment

Cluster evaluation assesses the feasibility of clustering analysis on a data set and thequality of the results generated by a clustering method. The major tasks of clusteringevaluation include the following:

1 Assessing clustering tendency : In this task, for a given data set, we assess whether anonrandom structure exists in the data. Clustering analysis on a data set is meaningful onlywhen there is a nonrandom structure in the data.

2 Determining the number of clusters in a data set : Algorithms such as k-means, require thenumber of clusters in a data set as the parameter. Moreover, the number of clusters can beregarded as an interesting and important summary statistic of a data set. Therefore, it isdesirable to estimate this number even before a clustering algorithm is used to derive detailedclusters.A simple method is to set the number of clusters to about

√n/2 for a data set of n points.

3 Measuring clustering quality : After applying a clustering method on a data set, we want toassess how good the resulting clusters are. There are also measures that score clusteringsand thus can compare two sets of clustering results on the same data set.


Assessing clustering tendency

HAN 17-ch10-443-496-9780123814791 2011/6/1 3:44 Page 485 #43

10.6 Evaluation of Clustering 485

Figure 10.21 A data set that is uniformly distributed in the data space.

a random variable, o, we want to determine how far away o is from being uniformlydistributed in the data space. We calculate the Hopkins Statistic as follows:

1. Sample n points, p1, . . . , p

n

, uniformly from D. That is, each point in D has the sameprobability of being included in this sample. For each point, p

i

, we find the nearestneighbor of p

i

(1 i n) in D, and let xi be the distance between p

i

and its nearestneighbor in D. That is,

xi = min

v2D{dist(p

i

,v)}. (10.25)

2. Sample n points, q1, . . . , q

n

, uniformly from D. For each q

i

(1 i n), we find thenearest neighbor of q

i

in D � {q

i

}, and let yi be the distance between q

i

and its nearestneighbor in D�{q

i

}. That is,

yi = min

v2D,v 6=q

i

{dist(q

i

,v)}. (10.26)

3. Calculate the Hopkins Statistic, H , as

H =Pn

i=1 yiPni=1 xi +

Pni=1 yi

. (10.27)

“What does the Hopkins Statistic tell us about how likely data set D follows a uni-form distribution in the data space?” If D were uniformly distributed, then

Pni=1 yi andPn

i=1 xi would be close to each other, and thus H would be about 0.5. However, if D werehighly skewed, then

Pni=1 yi would be substantially smaller than

Pni=1 xi in expectation,

and thus H would be close to 0.


Cluster validation and assessment

How good is the clustering generated by a method?

How can we compare the clusterings generated by different methods?

Clustering is an unupervised learning technique and it is hard to evaluate the quality ofthe output of any given method.

If we use probabilistic models, we can always evaluate the likelihood of a test set, but thishas two drawbacks:

1 It does not directly assess any clustering that is discovered by the model.2 It does not apply to non-probabilistic methods.

We discuss some performance measures not based on likelihood.

The goal of clustering is to assign points that are similar to the same cluster, and toensure that points that are dissimilar are in different clusters.

There are several ways of measuring these quantities

1 Internal criterion : Typical objective functions in clustering formalize the goal of attaininghigh intra-cluster similarity and low inter-cluster similarity. But good scores on an internalcriterion do not necessarily translate into good effectiveness in an application. An alternativeto internal criteria is direct evaluation in the application of interest.

2 External criterion : Suppose we have labels for each object. Then we can compare theclustering with the labels using various metrics. We will use some of these metrics later,when we compare clustering methods.


Purity

Purity is a simple and transparent evaluation measure. Consider the following clustering.

25.1. Introduction 877

Figure 25.1 Three clusters with labeled objects inside. Based on Figure 16.4 of (Manning et al. 2008).

Clustering is an unupervised learning technique, so it is hard to evaluate the quality of the outputof any given method. If we use probabilistic models, we can always evaluate the likelihood ofa test set, but this has two drawbacks: first, it does not directly assess any clustering that isdiscovered by the model; and second, it does not apply to non-probabilistic methods. So nowwe discuss some performance measures not based on likelihood.Intuitively, the goal of clustering is to assign points that are similar to the same cluster,

and to ensure that points that are dissimilar are in different clusters. There are several waysof measuring these quantities e.g., see (Jain and Dubes 1988; Kaufman and Rousseeuw 1990).However, these internal criteria may be of limited use. An alternative is to rely on some externalform of data with which to validate the method. For example, suppose we have labels for eachobject, as in Figure 25.1. (Equivalently, we can have a reference clustering; given a clustering, wecan induce a set of labels and vice versa.) Then we can compare the clustering with the labelsusing various metrics which we describe below. We will use some of these metrics later, whenwe compare clustering methods.

25.1.2.1 Purity

Let Nij be the number of objects in cluster i that belong to class j, and let Ni =!C

j=1 Nij bethe total number of objects in cluster i. Define pij = Nij/Ni; this is the empirical distributionover class labels for cluster i. We define the purity of a cluster as pi ! maxj pij , and theoverall purity of a clustering as

purity !"

i

Ni

Npi (25.5)

For example, in Figure 25.1, we have that the purity is

6

17

5

6+

6

17

4

6+

5

17

3

5=

5 + 4 + 3

17= 0.71 (25.6)

The purity ranges between 0 (bad) and 1 (good). However, we can trivially achieve a purity of1 by putting each object into its own cluster, so this measure does not penalize for the numberof clusters.

25.1.2.2 Rand index

Let U = {u1, . . . , uR} and V = {v1, . . . , VC} be two different partitions of the N data points,i.e., two different (flat) clusterings. For example, U might be the estimated clustering and Vis reference clustering derived from the class labels. Now define a 2 × 2 contingency table,

Let Nij be the number of objects in cluster i that belongs to class j and Ni =∑C

j=1Nij bethe total number of objects in cluster i .

We define purity of cluster i as pi ≜ maxj

(Nij

Ni

), and the overall purity of a clustering as

purity ≜∑i

Ni

Npi .

For the above figure, the purity is

6

17

5

6+

6

17

4

6+

5

17

3

5=

5 + 4 + 3

17= 0.71

Bad clusterings have purity values close to 0, a perfect clustering has a purity of 1.

High purity is easy to achieve when the number of clusters is large. In particular, purity is1 if each point gets its own cluster. Thus, we cannot use purity to trade off the quality ofthe clustering against the number of clusters.


Rand index

Let U = {u1, . . . , uR} and V = {v1, . . . , vC} be two different clustering of N data points.

For example, U might be the estimated clustering and V is reference clustering derivedfrom the class labels.

Define a 2× 2 contingency table, containing the following numbers:1 TP is the number of pairs that are in the same cluster in both U and V (true positives);2 TN is the number of pairs that are in different clusters in both U and V (true negatives);3 FN is the number of pairs that are in different clusters in U but the same cluster in V (false

negatives);4 FP is the number of pairs that are in the same cluster in U but different clusters in V (false

positives).

Rand index is defined as

RI ≜ TP + TN

TP + FP + FN + TN

Rand index can be interpreted as the fraction of clustering decisions that are correct.Clearly RI ∈ [0, 1].


Rand index (example)

Consider the following clustering



Clustering is an unupervised learning technique, so it is hard to evaluate the quality of the outputof any given method. If we use probabilistic models, we can always evaluate the likelihood ofa test set, but this has two drawbacks: first, it does not directly assess any clustering that isdiscovered by the model; and second, it does not apply to non-probabilistic methods. So nowwe discuss some performance measures not based on likelihood.

Intuitively, the goal of clustering is to assign points that are similar to the same cluster,and to ensure that points that are dissimilar are in different clusters. There are several waysof measuring these quantities e.g., see (Jain and Dubes 1988; Kaufman and Rousseeuw 1990).However, these internal criteria may be of limited use. An alternative is to rely on some externalform of data with which to validate the method. For example, suppose we have labels for eachobject, as in Figure 25.1. (Equivalently, we can have a reference clustering; given a clustering, wecan induce a set of labels and vice versa.) Then we can compare the clustering with the labelsusing various metrics which we describe below. We will use some of these metrics later, whenwe compare clustering methods.

25.1.2.1 Purity



purity !"

i

Ni

Npi (25.5)


6

17

5

6+

6

17

4

6+

5

17

3

5=

5 + 4 + 3

17= 0.71 (25.6)


25.1.2.2 Rand index


The three clusters contain 6, 6 and 5 points, so we have

TP + FP =

(6

2

)+

(6

2

)+

(5

2

)= 40.

The number of true positives

TP =

(5

2

)+

(4

2

)+

(3

2

)+

(2

2

)= 20.

Then FP = 40− 20 = 20. Similarly, FN = 24 and TN = 72.

Hence Rand index

RI =20 + 72

20 + 20 + 24 + 72= 0.68.


Rand index (example)

Consider the following clustering



Clustering is an unupervised learning technique, so it is hard to evaluate the quality of the outputof any given method. If we use probabilistic models, we can always evaluate the likelihood ofa test set, but this has two drawbacks: first, it does not directly assess any clustering that isdiscovered by the model; and second, it does not apply to non-probabilistic methods. So nowwe discuss some performance measures not based on likelihood.

Intuitively, the goal of clustering is to assign points that are similar to the same cluster,and to ensure that points that are dissimilar are in different clusters. There are several waysof measuring these quantities e.g., see (Jain and Dubes 1988; Kaufman and Rousseeuw 1990).However, these internal criteria may be of limited use. An alternative is to rely on some externalform of data with which to validate the method. For example, suppose we have labels for eachobject, as in Figure 25.1. (Equivalently, we can have a reference clustering; given a clustering, wecan induce a set of labels and vice versa.) Then we can compare the clustering with the labelsusing various metrics which we describe below. We will use some of these metrics later, whenwe compare clustering methods.

25.1.2.1 Purity



purity !"

i

Ni

Npi (25.5)


6

17

5

6+

6

17

4

6+

5

17

3

5=

5 + 4 + 3

17= 0.71 (25.6)


25.1.2.2 Rand index


The three clusters contain 6, 6 and 5 points, so we have

TP + FP =

(6

2

)+

(6

2

)+

(5

2

)= 40.

The number of true positives

TP =

(5

2

)+

(4

2

)+

(3

2

)+

(2

2

)= 20.

Then FP = 40− 20 = 20. Similarly, FN = 24 and TN = 72.Hence Rand index

RI =20 + 72

20 + 20 + 24 + 72= 0.68.

Rand index only achieves its lower bound of 0 if TP = TN = 0, which is a rare event. Wecan define an adjusted Rand index

ARI ≜ index − E[index ]max index − E[index ]

.


Mutual information

We can measure cluster quality is computing mutual information between U and V .

Let PUV (i , j) =|ui∩vj |

N be the probability that a randomly chosen object belongs to clusterui in U and vj in V .

Let PU(i) =|ui |N be the be the probability that a randomly chosen object belongs to

cluster ui in U.

Let PV (j) =|vj |N be the be the probability that a randomly chosen object belongs to

cluster vj in V .Then mutual information is defined

I(U,V ) ≜R∑i=1

C∑j=1

PUV (i , j) logPUV (i , j)

PU(i)PV (j).

This lies between 0 and min{H(U),H(V )}.The maximum value can be achieved by using a lots of small clusters, which have lowentropy.To compensate this, we can use normalized mutual information (NMI)

NMI (U,V ) ≜ I(U,V )12 [H(U) +H(V )]

.

This lies between 0 and 1.

Please read section 25.1 of Murphy.Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 41 / 41

Documents

Data Mining - Sharif University of Technologyce.sharif.edu/.../resources/root/Slides/Lect-22to23.pdf · 2020. 9. 7. · Table of contents 1 Introduction 2 Data matrix and dissimilarity