Lesson 6 Prof. Enza Messina - .MI measures how much information a random variable can tell about

  • View
    212

  • Download
    0

Embed Size (px)

Text of Lesson 6 Prof. Enza Messina - .MI measures how much information a random variable can tell about

  • Lesson 6

    Prof. Enza Messina

  • 2

  • We discussed many clustering algorithms

    Each algorithm can partition data, but different algorithms or input parameters cause different clusters, or reveal different clustering structures.

    Problem: objectively and quantitatively evaluating the resulting clusters

    Is the resulting clustering structure meaningful ?

  • Are the clusters

    meaningful?

    4

  • Supervised classification:

    Class labels known for ground truth

    Accuracy, precision, recall

    Cluster analysis

    No class labels

    Validation need to:

    Compare clustering algorithms

    Solve number of clusters

    Avoid finding patterns in noise

    Cluster validation

    P

    Precision = 5/5 = 100%

    Recall = 5/7 = 71%

    Oranges:

    Apples:

    Precision = 3/5 = 60%

    Recall = 3/3 = 100%

  • It is necessary to find a way to validate the goodness of partitions after clustering. Otherwise, it would be difficult to make use of different clustering results

    Clustering validation has long been recognized as one of the vital issues essential to the success of clustering applications

    How to evaluate the goodness of the resulting clusters?

    The best suitable measures to use in practice remain unknown

  • 0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    x

    y

    Random

    Points

  • 0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    x

    y

    Random

    Points

    0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    x

    y

    K-means

  • 0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    x

    y

    Random

    Points

    0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    x

    y

    K-means

    0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    x

    y

    DBSCAN

  • 0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    x

    y

    Random

    Points

    0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    x

    y

    K-means

    0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    x

    y

    DBSCAN

    0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    x

    y

    Complete

    Link

  • Cluster validation refers to procedures that evaluate the results

    of clustering in a quantitative and objective fashion. [Jain &

    Dubes, 1988]

    How to be quantitative: To employ the measures.

    How to be objective: To validate the measures!

    m* INPUT: DataSet(X)

    Clustering Algorithm

    Validity Index

    Different number of clusters m

    Partitions

    Cluster validation process

  • Dataset X, Objective function F

    Algorithms: A1, A2,Ak

    Question: Which algorithm is the best for this objective function?

    R1 = A1(X), R2 = A2(X),,Rk=Ak(X)

    Compare F(R1), F(R2),,F(Rk)

  • 1. Determining the clustering tendency of a set of data, i.e., determine whether non-random structure actually exists in the data (cluster).

  • 1. Determining the clustering tendency of a set of data, i.e., determine whether non-random structure actually exists in the data (cluster).

    2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.

  • 1. Determining the clustering tendency of a set of data, i.e., determine whether non-random structure actually exists in the data (cluster).

    2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.

    3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.

    - Use only the data

  • 1. Determining the clustering tendency of a set of data, i.e., determine whether non-random structure actually exists in the data (cluster).

    2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.

    3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.

    - Use only the data

    4. Comparing the results of two different sets of cluster analysis to determine which is better.

  • 1. Determining the clustering tendency of a set of data, i.e., determine whether non-random structure actually exists in the data (cluster).

    2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.

    3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.

    - Use only the data

    4. Comparing the results of two different sets of cluster analysis to determine which is better.

    5. Determining the correct number of clusters.

  • 1. Determining the clustering tendency of a set of data, i.e., determine whether non-random structure actually exists in the data (cluster).

    2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.

    3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.

    - Use only the data

    4. Comparing the results of two different sets of cluster analysis to determine which is better.

    5. Determining the correct number of clusters.

    For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.

  • Need a framework to interpret any measure. For example, if our measure of evaluation has the value, 10, is that

    good, fair, or poor?

  • Need a framework to interpret any measure. For example, if our measure of evaluation has the value, 10, is that

    good, fair, or poor?

    Statistics provide a framework for cluster validity The more atypical a clustering result is, the more likely it represents

    valid structure in the data

    Can compare the values of an index that result from random data or clustering to those of a clustering result.

    If the value of the index is unlikely, then the cluster results are valid

    These approaches are more complicated and harder to understand.

  • Need a framework to interpret any measure. For example, if our measure of evaluation has the value, 10, is that

    good, fair, or poor?

    Statistics provide a framework for cluster validity The more atypical a clustering result is, the more likely it represents

    valid structure in the data

    Can compare the values of an index that result from random data or clustering to those of a clustering result.

    If the value of the index is unlikely, then the cluster results are valid

    These approaches are more complicated and harder to understand.

    For comparing the results of two different sets of cluster analysis, a framework is less necessary. However, there is the question of whether the difference between

    two index values is significant

  • Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following two types.

    External Index: Used to measure the extent to which cluster labels match

    externally supplied class labels. Entropy

    Internal Index: Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE)

    Sometimes these are referred to as criteria instead of indices However, sometimes criterion is the general strategy and index is the numerical

    measure that implements the criterion.

  • Internal Index:

    Validate without external info

    With different number of clusters

    Solve the number of clusters

    External Index

    Validate against ground truth

    Compare two clusters: (how similar)

    Measuring clustering validity

    ?

    ?

    ? ?

  • External validation measures use external information not

    present in the data to evaluate the extent to which the

    clustering structure discovered by a clustering algorithm

    matches some external structure, e.g., the one specified by

    the given class labels.

    They are rarely available!

  • External measures are based on a matrix that summarize the number of correct predictions and wrong predictions

  • Precision and Recall

    Precision is the fraction of retrieved instances that are

    relevant, while recall is the fraction of relevant instances that

    are retrieved

    Precision = TP / ( TP + FP )

    Recall = TP / ( TP + FN )

  • Given a data set D with n objects, assume that there is a partition

    P = P1,P2,...,PK{ } where and Pi Pj = for i i j K

    and K is the number of clusters. If the true class labels for the data are given, another partition can be generated