Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay

Learning the threshold in Hierarchical Agglomerative Clustering

Kristine Daniels

Christophe Giraud-Carrier

Speaker: Ngai Wang Kay

Hierarchical clusteringThreshold

d1 d2d3

d1

d2

d3Dendrogram

Distance metric

• Single-link distance metric – the minimum of the simple distances (e.g. Euclidean distances) between the objects in the two clusters.

Distance metric

• Complete-link distance metric – the maximum of the simple distances between the objects in the two clusters.

Threshold determination

• Some applications may just want a set of clusters for a particular threshold instead of a dendrogram.

• A more efficient clustering algorithm may be developed for such case.

• There are many possible thresholds.• So, it is hard to determine the threshold that gives

an accurate clustering result (based on a measure against the correct clusters).


• Suppose C1, …, Cn are the correct clusters and H1, …, Hm are the computed clusters.

• A F-measure is used to determine the accuracy of the computed clusters as follows:

j

ji

H

HCjiP

),(

i

ji

C

HCjiR

),(


),(),(

),(),(2),(

jiRjiP

jiRjiPjiF

),(max)(1

jiFiFm

j

n

i

i iFN

CF

1

)( where N is the dataset size

Semi-supervised algorithm

1. Select a random subset S of the dataset.2. Label the correct clusters of the data in S.3. Cluster S using the previous algorithm.4. Compute the F-measure value for each

threshold in the dendrogram.5. Find the threshold with the highest F-

measure value.6. Cluster the dataset using this threshold.

Sample set

• Preliminary experiments show that a sample set of size 50 gives reasonable clustering results.

• The time complexity of the hierarchical clustering is usually O(N2) or higher in simple distance computations and numerical comparisons.

• So, learning the threshold may be a very small cost in comparison to that of clustering the dataset.

Experimental results

• Experiments are conducted by complete-link clustering on various real datasets in the UCI repository (http://www.ics.uci.edu/~mlearn/mlrepository.html).

• They are originally collected for the classification problem.

• The class labels of the data are used as the cluster labels in these experiments.

http://www.ics.uci.edu/~mlearn/mlrepository.html


Dataset Size # Classes

Breast-Wisconsin 699 2

Car 1728 4

Diabetes 768 2

Glass 214 2

Hepatitis 155 2

Ionosphere 351 2

Kr-vs-Kp 3196 2

Tic-Tac-Toe 958 2

Vehicle 946 4


Dataset Target threshold Learned threshold

Breast-Wisconsin 13.17 11.91

Car 7.35 6.68

Diabetes 8.84 11.61

Glass 9.39 8.06

Hepatitis 17.12 14.50

Ionosphere 24.81 24.00

Kr-vs-Kp 1605.28 50.37

Tic-Tac-Toe 7.52 7.45

Vehicle 13.09 6.11


• Because of the nature of the data, there may be many good threshold values.

• So, large differences between the target and learned thresholds do not have to yield large differences between the corresponding F-measure values.


Dataset

F-measure

(Target / Learned)

# Clusters

(Target / Learned)

Breast-Wisconsin 0.97 / 0.97 2 / 2

Car 0.90 / 0.64 2 / 5

Diabetes 0.71 / 0.65 13 / 4

Glass 0.82 / 0.82 11 / 13

Hepatitis 0.77 / 0.77 1 / 2

Ionosphere 0.69 / 0.66 1 / 2

Kr-vs-Kp 0.67 / 0.67 1 / 4

Tic-Tac-Toe 0.69 / 0.58 1 / 2

Vehicle 0.46 / 0.31 3 / 36


• The Vehicle dataset shows a huge difference in the number of clusters but a moderate difference in the F-measure.

• The Car dataset suffers from a serious loss of the F-measure, but the difference in the number of clusters is small.

• These anomalies may be explained, in part, by the sparseness of the data, the skewness of the underlying class distributions, and the cluster labels are based on the classification labels.


• The Diabetes dataset achieves a F-measure value close to optimal with fewer clusters when using the learned threshold.

• In summary, the learned threshold achieves clustering results close to the optimal ones at a fraction of the computational cost of clustering the whole dataset.

Conclusion

• Hierarchical clustering does not produce a single clustering result but a dendrogram, a series of nested clusters based on distance thresholds.

• This leads to the open problem of choosing the preferred threshold.

• An efficient semi-supervised algorithm is proposed to obtain such threshold.

• Experimental results show the clustering results obtained using the learned threshold are close to optimal.

Documents

Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe Giraud-Carrier Speaker: Ngai Wang Kay