17
Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Embed Size (px)

Citation preview

Page 1: Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Learning the threshold in Hierarchical Agglomerative Clustering

Kristine Daniels

Christophe Giraud-Carrier

Speaker: Ngai Wang Kay

Page 2: Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Hierarchical clusteringThreshold

d1 d2d3

d1

d2

d3Dendrogram

Page 3: Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Distance metric

• Single-link distance metric – the minimum of the simple distances (e.g. Euclidean distances) between the objects in the two clusters.

Page 4: Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Distance metric

• Complete-link distance metric – the maximum of the simple distances between the objects in the two clusters.

Page 5: Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Threshold determination

• Some applications may just want a set of clusters for a particular threshold instead of a dendrogram.

• A more efficient clustering algorithm may be developed for such case.

• There are many possible thresholds.• So, it is hard to determine the threshold that gives

an accurate clustering result (based on a measure against the correct clusters).

Page 6: Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Threshold determination

• Suppose C1, …, Cn are the correct clusters and H1, …, Hm are the computed clusters.

• A F-measure is used to determine the accuracy of the computed clusters as follows:

j

ji

H

HCjiP

),(

i

ji

C

HCjiR

),(

Page 7: Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Threshold determination

),(),(

),(),(2),(

jiRjiP

jiRjiPjiF

),(max)(1

jiFiFm

j

n

i

i iFN

CF

1

)( where N is the dataset size

Page 8: Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Semi-supervised algorithm

1. Select a random subset S of the dataset.2. Label the correct clusters of the data in S.3. Cluster S using the previous algorithm.4. Compute the F-measure value for each

threshold in the dendrogram.5. Find the threshold with the highest F-

measure value.6. Cluster the dataset using this threshold.

Page 9: Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Sample set

• Preliminary experiments show that a sample set of size 50 gives reasonable clustering results.

• The time complexity of the hierarchical clustering is usually O(N2) or higher in simple distance computations and numerical comparisons.

• So, learning the threshold may be a very small cost in comparison to that of clustering the dataset.

Page 10: Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Experimental results

• Experiments are conducted by complete-link clustering on various real datasets in the UCI repository (http://www.ics.uci.edu/~mlearn/mlrepository.html).

• They are originally collected for the classification problem.

• The class labels of the data are used as the cluster labels in these experiments.

Page 11: Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Experimental results

Dataset Size # Classes

Breast-Wisconsin 699 2

Car 1728 4

Diabetes 768 2

Glass 214 2

Hepatitis 155 2

Ionosphere 351 2

Kr-vs-Kp 3196 2

Tic-Tac-Toe 958 2

Vehicle 946 4

Page 12: Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Experimental results

Dataset Target threshold Learned threshold

Breast-Wisconsin 13.17 11.91

Car 7.35 6.68

Diabetes 8.84 11.61

Glass 9.39 8.06

Hepatitis 17.12 14.50

Ionosphere 24.81 24.00

Kr-vs-Kp 1605.28 50.37

Tic-Tac-Toe 7.52 7.45

Vehicle 13.09 6.11

Page 13: Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Experimental results

• Because of the nature of the data, there may be many good threshold values.

• So, large differences between the target and learned thresholds do not have to yield large differences between the corresponding F-measure values.

Page 14: Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Experimental results

Dataset

F-measure

(Target / Learned)

# Clusters

(Target / Learned)

Breast-Wisconsin 0.97 / 0.97 2 / 2

Car 0.90 / 0.64 2 / 5

Diabetes 0.71 / 0.65 13 / 4

Glass 0.82 / 0.82 11 / 13

Hepatitis 0.77 / 0.77 1 / 2

Ionosphere 0.69 / 0.66 1 / 2

Kr-vs-Kp 0.67 / 0.67 1 / 4

Tic-Tac-Toe 0.69 / 0.58 1 / 2

Vehicle 0.46 / 0.31 3 / 36

Page 15: Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Experimental results

• The Vehicle dataset shows a huge difference in the number of clusters but a moderate difference in the F-measure.

• The Car dataset suffers from a serious loss of the F-measure, but the difference in the number of clusters is small.

• These anomalies may be explained, in part, by the sparseness of the data, the skewness of the underlying class distributions, and the cluster labels are based on the classification labels.

Page 16: Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Experimental results

• The Diabetes dataset achieves a F-measure value close to optimal with fewer clusters when using the learned threshold.

• In summary, the learned threshold achieves clustering results close to the optimal ones at a fraction of the computational cost of clustering the whole dataset.

Page 17: Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Conclusion

• Hierarchical clustering does not produce a single clustering result but a dendrogram, a series of nested clusters based on distance thresholds.

• This leads to the open problem of choosing the preferred threshold.

• An efficient semi-supervised algorithm is proposed to obtain such threshold.

• Experimental results show the clustering results obtained using the learned threshold are close to optimal.