107
Lesson 6 Prof. Enza Messina

Lesson 6 Prof. Enza Messina -  · MI measures how much information a random variable can tell about onother one VI ... It tries to map each class to a different cluster so

  • Upload
    ngonhan

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Lesson 6

Prof. Enza Messina

2

We discussed many clustering algorithms

Each algorithm can partition data, but different algorithms or input parameters cause different clusters, or reveal different clustering structures.

Problem: objectively and quantitatively evaluating the resulting clusters

Is the resulting clustering structure meaningful ?

Are the clusters

meaningful?

4

Supervised classification:

Class labels known for ground truth

Accuracy, precision, recall

Cluster analysis

No class labels

Validation need to:

Compare clustering algorithms

Solve number of clusters

Avoid finding patterns in noise

Cluster validation

P

Precision = 5/5 = 100%

Recall = 5/7 = 71%

Oranges:

Apples:

Precision = 3/5 = 60%

Recall = 3/3 = 100%

It is necessary to find a way to validate the goodness of partitions after clustering. Otherwise, it would be difficult to make use of different clustering results

Clustering validation has long been recognized as one of the vital issues essential to the success of clustering applications

How to evaluate the “goodness” of the resulting clusters?

The best suitable measures to use in practice remain unknown

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random

Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random

Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random

Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random

Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Complete

Link

Cluster validation refers to procedures that evaluate the results

of clustering in a quantitative and objective fashion. [Jain &

Dubes, 1988]

◦ How to be “quantitative”: To employ the measures.

◦ How to be “objective”: To validate the measures!

m* INPUT: DataSet(X)

Clustering Algorithm

Validity Index

Different number of clusters m

Partitions

Cluster validation process

Dataset X, Objective function F

Algorithms: A1, A2,…Ak

Question: Which algorithm is the best for this objective function?

R1 = A1(X), R2 = A2(X),…,Rk=Ak(X)

Compare F(R1), F(R2),…,F(Rk)

1. Determining the clustering tendency of a set of data, i.e., determine

whether non-random structure actually exists in the data (cluster).

1. Determining the clustering tendency of a set of data, i.e., determine

whether non-random structure actually exists in the data (cluster).

2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.

1. Determining the clustering tendency of a set of data, i.e., determine

whether non-random structure actually exists in the data (cluster).

2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.

3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.

- Use only the data

1. Determining the clustering tendency of a set of data, i.e., determine

whether non-random structure actually exists in the data (cluster).

2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.

3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.

- Use only the data

4. Comparing the results of two different sets of cluster analysis to determine which is better.

1. Determining the clustering tendency of a set of data, i.e., determine

whether non-random structure actually exists in the data (cluster).

2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.

3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.

- Use only the data

4. Comparing the results of two different sets of cluster analysis to determine which is better.

5. Determining the ‘correct’ number of clusters.

1. Determining the clustering tendency of a set of data, i.e., determine

whether non-random structure actually exists in the data (cluster).

2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.

3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.

- Use only the data

4. Comparing the results of two different sets of cluster analysis to determine which is better.

5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to evaluate

the entire clustering or just individual clusters.

Need a framework to interpret any measure. ◦ For example, if our measure of evaluation has the value, 10, is that

good, fair, or poor?

Need a framework to interpret any measure. ◦ For example, if our measure of evaluation has the value, 10, is that

good, fair, or poor?

Statistics provide a framework for cluster validity ◦ The more “atypical” a clustering result is, the more likely it represents

valid structure in the data

◦ Can compare the values of an index that result from random data or clustering to those of a clustering result.

If the value of the index is unlikely, then the cluster results are valid

◦ These approaches are more complicated and harder to understand.

Need a framework to interpret any measure. ◦ For example, if our measure of evaluation has the value, 10, is that

good, fair, or poor?

Statistics provide a framework for cluster validity ◦ The more “atypical” a clustering result is, the more likely it represents

valid structure in the data

◦ Can compare the values of an index that result from random data or clustering to those of a clustering result.

If the value of the index is unlikely, then the cluster results are valid

◦ These approaches are more complicated and harder to understand.

For comparing the results of two different sets of cluster analysis, a framework is less necessary. ◦ However, there is the question of whether the difference between

two index values is significant

Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following two types.

◦ External Index: Used to measure the extent to which cluster labels match

externally supplied class labels. Entropy

◦ Internal Index: Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE)

Sometimes these are referred to as criteria instead of indices ◦ However, sometimes criterion is the general strategy and index is the numerical

measure that implements the criterion.

Internal Index:

• Validate without external info

• With different number of clusters

• Solve the number of clusters

External Index

• Validate against ground truth

• Compare two clusters: (how similar)

Measuring clustering validity

?

?

? ?

External validation measures use external information not

present in the data to evaluate the extent to which the

clustering structure discovered by a clustering algorithm

matches some external structure, e.g., the one specified by

the given class labels.

They are rarely available!

External measures are based on a matrix that summarize the number of correct predictions and wrong predictions

Precision and Recall

Precision is the fraction of retrieved instances that are

relevant, while recall is the fraction of relevant instances that

are retrieved

Precision = TP / ( TP + FP )

Recall = TP / ( TP + FN )

Given a data set D with n objects, assume that there is a partition

P = P1,P2,...,PK{ } where and Pi Ç Pj = Æ for i £ i ¹ j £ K

and K is the number of clusters. If the “true” class labels for the data are given, another partition can be generated on D:

C = C1,C2,...,CK '{ } where and Ci ÇCj = Æ for i £ i ¹ j £ K '

where K’ is the number of classes.

Let nij be the number of object in cluster Pi from classes Cj, then the information on the overlap between the two partitions can be written in the form of a contingency matrix

Measure the purity of the cluster wrt the given class label

Originally designed for hierarchical clustering. It combines precision and recall concepts from the retrieval community.

Developed in the field of information theory. MI measures how much information a random variable can tell about onother one VI measures the information that is gained or lost in changing from the class set to the cluster set.

Evaluate the clustering quality by the agreement and/or disagreement of the pairs of data objects in different partitions

It tries to map each class to a different cluster so as to minimize the missclassification rate .

First introduced for graph clustering. It measures the representativeness of the majority objects in each cluster and in each class.

Equivalent to purity, but at a micro level

Equivalent to rand statistics

Entropy = - pi

pij

pi

logpij

pij=1

K '

åi=1

K

å

If a random variable view on cluster P and class C is taken, then:

pij = nij / n P = Pi ÙC = Cj{ }pi = ni / n is the marginal probability

Therefore

Entropy = pi -p(Cj | Pi )log p(Cj | Pi )j=1

K '

åi=1

K

å = piH (C | Pi ) = H (C | P)i=1

K

å

where H( ) is the Shannon entropy

H (X)= P(xi ) × logi=1

n

å P(xi )

Entropia di una sorgente binaria

- Based on the concept of entropy

- Mutual Information (MI) measures the information that two clusterings

share and Variation of Information (VI) is the complement of MI

)(PH H(C)

H(P |C) H(C | P)MI

VI (P,C)

MI (P,C) = pij logpij

pi pjj=1

K '

åi=1

K

å = H (C) - H(C | P) = H(C)- E

ii Pclusterofsizen :

nj : sizeof classCj

nij : number of shared

objectsin Pi and Cj

Mutual Information

VI (P,C) == H(C | P)+ H(C | P)

Mutual Information It measures the information that X and Y share, i.e. how much knowing one

of these variables reduces uncertainty about the other.

For example, if X and Y are independent, then knowing X does not give any

information about Y and vice versa, so their mutual information is zero.

At the other extreme, if X is a deterministic function of Y and Y is a

deterministic function of X then all information conveyed by X is shared with

Y. Thus knowing X determines the value of Y and vice versa.

It measures the “purity” of each cluster by the ratio of the objects from the majority class

Purity = pi max j

pij

pi

æ

èçö

ø÷i=1

K

å

Entropy, Mutual Information and Purity cannot capture the uniform effect of clustering algorithms such as K-means !!!

VI is an improved version of the Entropy and Mutual Information measures

VD is an improved version of the Purity measure (it also reflects the integrety of a class

Entropy, Mutual Information and Purity cannot capture the uniform effect of clustering algorithms such as K-means !!!

VI is an improved version of the Entropy and Mutual Information measures

VD is an improved version of the Purity measure (it also reflects the integrety of a class)

VI = - pi × log pi

i=1

K

å - pj × log pj

j=1

K '

å - 2 pij × logpij

pi pjj=1

K '

åi=1

K

å

VD = 2n- max j nij - maxi nij

j=1

K '

åi=1

K

åæ

èç

ö

ø÷ / 2n

A clustering can be considered “valid” if it has an unusually high or low value, as measured with respect to the baseline distribution.

Normalization of R, FM, Γ, Γ’, J and MS

Sn =S- E(S)

max(S) - E(S)

where max(S) is the maximum value of the measure S and E(S) is the expected value of S based on the baseline distribution

Hubert and Arabie (1985) suggested using the multivariate hypergeometric distribution as the baseline distribution in which row and column sums are fixed, but the partitions are randomly selected

E(nij

2

æ

èç

ö

ø÷

j

åi

å ) =

nji

2

æ

èç

ö

ø÷

i

ånij

2

æ

èç

ö

ø÷

j

å

n

2

æ

èç

ö

ø÷

from which the normalized R, FM, Γ, Γ’, J and MS can be derived

Another normalization scheme is

Sn =S- min(S)

max(S)- min(S)

Used by VI and VD

Exact values of min(S) and max(S) are often impossible to know,

Therefore they are approximated by the lower and upper bound

respectively

VI =

VI

2log(max(K,K '))

LB(VI) = 0

UB(VI) = 2log(max(K,K’))

Used to measure the dispersion degree of the class sizes

DCV=CV1-CV0

CV =s

x

CV0 is the CV of true class sizes

CV1 is the CV of cluster sizes

Commonly used to compare different clustering algorithms.

A real-life data set for clustering has no class labels.

◦ Thus although an algorithm may perform very well on some

labeled data sets, no guarantee that it will perform well on

the actual application data at hand.

The fact that it performs well on some label data sets does

give us some confidence of the quality of the algorithm.

Two matrices ◦ Proximity Matrix

◦ “Incidence” Matrix One row and one column for each data point

An entry is 1 if the associated pair of points belong to the same cluster

An entry is 0 if the associated pair of points belongs to different clusters

Two matrices ◦ Proximity Matrix

◦ “Incidence” Matrix One row and one column for each data point

An entry is 1 if the associated pair of points belong to the same cluster

An entry is 0 if the associated pair of points belongs to different clusters

Compute the correlation between the two matrices ◦ Since the matrices are symmetric, only the correlation between

n(n-1) / 2 entries needs to be calculated.

Two matrices ◦ Proximity Matrix

◦ “Incidence” Matrix One row and one column for each data point

An entry is 1 if the associated pair of points belong to the same cluster

An entry is 0 if the associated pair of points belongs to different clusters

Compute the correlation between the two matrices ◦ Since the matrices are symmetric, only the correlation between

n(n-1) / 2 entries needs to be calculated.

High correlation indicates that points that belong to the same cluster are close to each other.

Two matrices ◦ Proximity Matrix

◦ “Incidence” Matrix One row and one column for each data point

An entry is 1 if the associated pair of points belong to the same cluster

An entry is 0 if the associated pair of points belongs to different clusters

Compute the correlation between the two matrices ◦ Since the matrices are symmetric, only the correlation between

n(n-1) / 2 entries needs to be calculated.

High correlation indicates that points that belong to the same cluster are close to each other.

Not a good measure for some density or contiguity based clusters.

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = -0.9235 Corr = -0.5810

Order the similarity matrix with respect to cluster labels and inspect visually.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Clusters in random data are not so crisp

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Clusters in random data are not so crisp

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Clusters in random data are not so crisp

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Complete Link

1 2

3

5

6

4

7

DBSCAN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

As the goal of clustering is to make objects within the same cluster similar and objects in different clusters distinct, internal validation measures are often based on the following two criteria: Compactness: this measures how closely related the objects in a cluster are. Separation: this measures how distinct or well-separated a cluster is from other clusters

Cluster Cohesion: Measures how closely related are objects in a cluster

Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters

◦ Cohesion is measured by the within cluster sum of squares (WSS)

◦ Separation is measured by the between cluster sum of squares

Where |Ci| is the size of cluster i

WSS= (x- ci )2

xÎCi

åi

å

BSS= Ci (c- ci )2

i

å

Example: SSE ◦ BSS + WSS = constant

1091

9)35.4(2)5.13(2

1)5.45()5.44()5.12()5.11(

22

2222

Total

BSS

WSS

1 2 3 4 5 X X X c1 c2

c

K=2 clusters:

10010

0)33(4

10)35()34()32()31(

2

2222

Total

BSS

WSSK=1 cluster:

A proximity graph based approach can also be used for cohesion and separation.

◦ Cluster cohesion is the sum of the weight of all links within a cluster.

◦ Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster.

cohesion separation

Root-mean-square standard deviation (RMSSTD) is the square root of the pooled sample variance of all the attributes. It measures the homogeneity of the formed clusters

R-squared (RS) is the ratio of sum of squares between clusters to the total sum of squares of the whole data set. It measures the degree of difference between clusters

Modified Hubert Γstatistic (Γ) evaluates the difference between clusters by counting the disagreements of pairs of data objects in two partitions.

Calinski–Harabasz index (CH) evaluates the cluster validity based on the average between- and within-cluster sum of squares

Index I (I) measures separation based on the maximum distance between cluster centers, and measures compactness based on the sum of distances between objects and their cluster center

Dunn’s index (D) uses the minimum pairwise distance between objects in different clusters as the intercluster separation and the maximum diameter among all clusters as the intracluster compactness

Silhouette index (S) validates the clustering performance based on the pairwise difference of between- and within-cluster distances. In addition, the optimal cluster number is determined by maximizing the value of this index

• Cohesion a(x): average distance of x to all other vectors in the same cluster.

• Separation b(x): average distance of x to the vectors in other clusters. Find the minimum among the clusters.

• Silhouette s(x):

• s(x) = [-1, +1]: -1=bad, 0=indifferent, 1=good

• Silhouette coefficient (SC):

)}(),(max{

)()()(

xbxa

xaxbxs

Silhouette coefficient

N

i

xsN

SC1

)(1

separation

x

a(x): average distance in the cluster

cohesion

x

b(x): average distances to others clusters, find minimal

Silhouette coefficient

Davies–Bouldin index (DB): For each cluster C, thesimilarities between C and all other clusters arecomputed, and the highest value is assigned to C as cluster similarity. Then the DB index can be obtained averaging all the cluster similarities. The smaller the indexis, the better the clustering result is.

Xie-Beni index (XB) defines the intercluster separation as the minimum square distance between cluster centers, and the intracluster compactness as the mean square distance between each data object and its cluster center. The optimal cluster number is reached when the minimum of XB is found.

SD index (SD) is based on the concepts of the average scattering and the total separation of clusters. The first term evaluates compactness based on variances of cluster objects, and the second term evaluates separation difference based on distances between cluster centers. The SD index is the summation of these two terms, and the optimal number of clusters can be obtained by minimizing the value of SD.

SDbw index (SDbw) takes density into account to measurethe intercluster separation. The basic idea is that for eachpair of cluster centers, at least one of their densitiesshould be larger than the density of their midpoint

Clustering Validation index based on Nearest Neighbors (CV NN) evaluates the intercluster separation based on objects that carry the geometrical information of each cluster.

The Clustering Validation index based on Nearest Neighbors (CV NN) evaluates the intercluster separation based on objects that carry the geometrical information of each cluster. If an object is located in the center of a cluster and is surrounded by objects in the same cluster, it is well separated from other clusters and thus contributes little to the intercluster separation. If an object is located at the edge of a cluster and is surrounded mostly by objects in other clusters, it connects to other clusters tightly and thus contributes a lot to the intercluster separation. The two contributions are normalized and summed up.

The first three indices monotonically increase or decrease as the cluster number NC increases They only take either separation or compactness into account It is claimed that the optimal cluster number is reached at the shift point of the curves, which is also known as “the elbow”. However, since the judgment of the shift point is very subjective and hard to determine, these three measures are excluded from future studies.

The index CH is computed as (BSS/WSS)* ((n−NC)/(NC−1)) WSS = (Within group sum of squares) BSS = (Between group sum of squares) NC= Number of Clusters By introducing noise, WSS increases in a more significant way compared with BSS. Therefore, for the same NC, CH will decrease by the influence of noise, which makes the value of CH instable.

D uses the minimum pairwise distance between objects in different clusters as the intercluster separation, and the maximum diameter among all clusters as the intracluster compactness. When noise is introduced, the intercluster separation can decrease sharply since it uses only the minimum pairwise distance, rather than the average pairwise distance, between objects in different clusters Thus, the value of D may change dramatically and the corresponding optimal cluster number will be influenced by the noise

The indices other than CH and D will also be influenced by noise in a less sensitive way. Thus, in order to minimize the adverse effect of noise, in practice it is always good to remove noise before clustering

I measures compactness based on the sum of distances between objects and their cluster center. When NC is small, objects with high density are likely in the same cluster, which makes the sum of distances remain almost the same. Since most of the objects are in one cluster, the total sum will not change too much. Therefore, as NC increases, I will decrease since NC is in the denominator.

S uses the average minimum distance between clusters as the intercluster separation. For a data set with subclusters, the intercluster separation will achieve its maximum value when subclusters close to each other are considered as one big cluster. The other measures suffers the same problem.

Since K-means has the uniform effect of tending to divide objects into relatively equal sizes, it does not have a good performance when dealing with skewed distributed data sets. CH is based on WSS, which shares the same idea of K-Means. Therefore, the similar conclusion can be applied to CH.

A data set with arbitrary shapes is always hard to handle. Indices that use the minimum pairwise distance between objects in different clusters to measure the intercluster separation are not good for dealing with arbitrary shaped data sets, nor are indices that use cluster centers.

CVNN evaluates the intercluster separation based on objects that carry the geometrical information of each cluster. Using multiple representatives, this method is more effective than the others when dealing with arbitrary shapes.

“The validation of clustering structures is the most difficult and

frustrating part of cluster analysis.

Without a strong effort in this direction, cluster analysis will

remain a black art accessible only to those true believers who

have experience and great courage.”

Algorithms for Clustering Data, Jain and Dubes

Efficient implementation

Brute force: solve clustering for all possible number of

clusters.

Stepwise: as in brute force but start using previous

solution and iterate less.

Criterion-guided search: Integrate cost function

directly into the optimization function.

Number of clusters

Search for each separately

100 %

Number of clusters

Start from the previous result

30-40 %

Number of clusters

Integrate with the cost function!

3-6 %