- Home
- Documents
*Lesson 6 Prof. Enza Messina - .MI measures how much information a random variable can tell about*

prev

next

out of 107

View

212Download

0

Embed Size (px)

Lesson 6

Prof. Enza Messina

2

We discussed many clustering algorithms

Each algorithm can partition data, but different algorithms or input parameters cause different clusters, or reveal different clustering structures.

Problem: objectively and quantitatively evaluating the resulting clusters

Is the resulting clustering structure meaningful ?

Are the clusters

meaningful?

4

Supervised classification:

Class labels known for ground truth

Accuracy, precision, recall

Cluster analysis

No class labels

Validation need to:

Compare clustering algorithms

Solve number of clusters

Avoid finding patterns in noise

Cluster validation

P

Precision = 5/5 = 100%

Recall = 5/7 = 71%

Oranges:

Apples:

Precision = 3/5 = 60%

Recall = 3/3 = 100%

It is necessary to find a way to validate the goodness of partitions after clustering. Otherwise, it would be difficult to make use of different clustering results

Clustering validation has long been recognized as one of the vital issues essential to the success of clustering applications

How to evaluate the goodness of the resulting clusters?

The best suitable measures to use in practice remain unknown

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random

Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random

Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random

Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random

Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Complete

Link

Cluster validation refers to procedures that evaluate the results

of clustering in a quantitative and objective fashion. [Jain &

Dubes, 1988]

How to be quantitative: To employ the measures.

How to be objective: To validate the measures!

m* INPUT: DataSet(X)

Clustering Algorithm

Validity Index

Different number of clusters m

Partitions

Cluster validation process

Dataset X, Objective function F

Algorithms: A1, A2,Ak

Question: Which algorithm is the best for this objective function?

R1 = A1(X), R2 = A2(X),,Rk=Ak(X)

Compare F(R1), F(R2),,F(Rk)

1. Determining the clustering tendency of a set of data, i.e., determine whether non-random structure actually exists in the data (cluster).

1. Determining the clustering tendency of a set of data, i.e., determine whether non-random structure actually exists in the data (cluster).

2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.

1. Determining the clustering tendency of a set of data, i.e., determine whether non-random structure actually exists in the data (cluster).

2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.

3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.

- Use only the data

2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.

3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.

- Use only the data

4. Comparing the results of two different sets of cluster analysis to determine which is better.

3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.

- Use only the data

4. Comparing the results of two different sets of cluster analysis to determine which is better.

5. Determining the correct number of clusters.

- Use only the data

4. Comparing the results of two different sets of cluster analysis to determine which is better.

5. Determining the correct number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.

Need a framework to interpret any measure. For example, if our measure of evaluation has the value, 10, is that

good, fair, or poor?

Need a framework to interpret any measure. For example, if our measure of evaluation has the value, 10, is that

good, fair, or poor?

Statistics provide a framework for cluster validity The more atypical a clustering result is, the more likely it represents

valid structure in the data

Can compare the values of an index that result from random data or clustering to those of a clustering result.

If the value of the index is unlikely, then the cluster results are valid

These approaches are more complicated and harder to understand.

Need a framework to interpret any measure. For example, if our measure of evaluation has the value, 10, is that

good, fair, or poor?

Statistics provide a framework for cluster validity The more atypical a clustering result is, the more likely it represents

valid structure in the data

Can compare the values of an index that result from random data or clustering to those of a clustering result.

If the value of the index is unlikely, then the cluster results are valid

These approaches are more complicated and harder to understand.

For comparing the results of two different sets of cluster analysis, a framework is less necessary. However, there is the question of whether the difference between

two index values is significant

Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following two types.

External Index: Used to measure the extent to which cluster labels match

externally supplied class labels. Entropy

Internal Index: Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE)

Sometimes these are referred to as criteria instead of indices However, sometimes criterion is the general strategy and index is the numerical

measure that implements the criterion.

Internal Index:

Validate without external info

With different number of clusters

Solve the number of clusters

External Index

Validate against ground truth

Compare two clusters: (how similar)

Measuring clustering validity

?

?

? ?

External validation measures use external information not

present in the data to evaluate the extent to which the

clustering structure discovered by a clustering algorithm

matches some external structure, e.g., the one specified by

the given class labels.

They are rarely available!

External measures are based on a matrix that summarize the number of correct predictions and wrong predictions

Precision and Recall

Precision is the fraction of retrieved instances that are

relevant, while recall is the fraction of relevant instances that

are retrieved

Precision = TP / ( TP + FP )

Recall = TP / ( TP + FN )

Given a data set D with n objects, assume that there is a partition

P = P1,P2,...,PK{ } where and Pi Pj = for i i j K

and K is the number of clusters. If the true class labels for the data are given, another partition can be generated