Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe...

Preview:

Citation preview

Learning the threshold in Hierarchical Agglomerative Clustering

Kristine Daniels

Christophe Giraud-Carrier

Speaker: Ngai Wang Kay

Hierarchical clusteringThreshold

d1 d2d3

d1

d2

d3Dendrogram

Distance metric

• Single-link distance metric – the minimum of the simple distances (e.g. Euclidean distances) between the objects in the two clusters.

Distance metric

• Complete-link distance metric – the maximum of the simple distances between the objects in the two clusters.

Threshold determination

• Some applications may just want a set of clusters for a particular threshold instead of a dendrogram.

• A more efficient clustering algorithm may be developed for such case.

• There are many possible thresholds.• So, it is hard to determine the threshold that gives

an accurate clustering result (based on a measure against the correct clusters).

Threshold determination

• Suppose C1, …, Cn are the correct clusters and H1, …, Hm are the computed clusters.

• A F-measure is used to determine the accuracy of the computed clusters as follows:

j

ji

H

HCjiP

),(

i

ji

C

HCjiR

),(

Threshold determination

),(),(

),(),(2),(

jiRjiP

jiRjiPjiF

),(max)(1

jiFiFm

j

n

i

i iFN

CF

1

)( where N is the dataset size

Semi-supervised algorithm

1. Select a random subset S of the dataset.2. Label the correct clusters of the data in S.3. Cluster S using the previous algorithm.4. Compute the F-measure value for each

threshold in the dendrogram.5. Find the threshold with the highest F-

measure value.6. Cluster the dataset using this threshold.

Sample set

• Preliminary experiments show that a sample set of size 50 gives reasonable clustering results.

• The time complexity of the hierarchical clustering is usually O(N2) or higher in simple distance computations and numerical comparisons.

• So, learning the threshold may be a very small cost in comparison to that of clustering the dataset.

Experimental results

• Experiments are conducted by complete-link clustering on various real datasets in the UCI repository (http://www.ics.uci.edu/~mlearn/mlrepository.html).

• They are originally collected for the classification problem.

• The class labels of the data are used as the cluster labels in these experiments.

Experimental results

Dataset Size # Classes

Breast-Wisconsin 699 2

Car 1728 4

Diabetes 768 2

Glass 214 2

Hepatitis 155 2

Ionosphere 351 2

Kr-vs-Kp 3196 2

Tic-Tac-Toe 958 2

Vehicle 946 4

Experimental results

Dataset Target threshold Learned threshold

Breast-Wisconsin 13.17 11.91

Car 7.35 6.68

Diabetes 8.84 11.61

Glass 9.39 8.06

Hepatitis 17.12 14.50

Ionosphere 24.81 24.00

Kr-vs-Kp 1605.28 50.37

Tic-Tac-Toe 7.52 7.45

Vehicle 13.09 6.11

Experimental results

• Because of the nature of the data, there may be many good threshold values.

• So, large differences between the target and learned thresholds do not have to yield large differences between the corresponding F-measure values.

Experimental results

Dataset

F-measure

(Target / Learned)

# Clusters

(Target / Learned)

Breast-Wisconsin 0.97 / 0.97 2 / 2

Car 0.90 / 0.64 2 / 5

Diabetes 0.71 / 0.65 13 / 4

Glass 0.82 / 0.82 11 / 13

Hepatitis 0.77 / 0.77 1 / 2

Ionosphere 0.69 / 0.66 1 / 2

Kr-vs-Kp 0.67 / 0.67 1 / 4

Tic-Tac-Toe 0.69 / 0.58 1 / 2

Vehicle 0.46 / 0.31 3 / 36

Experimental results

• The Vehicle dataset shows a huge difference in the number of clusters but a moderate difference in the F-measure.

• The Car dataset suffers from a serious loss of the F-measure, but the difference in the number of clusters is small.

• These anomalies may be explained, in part, by the sparseness of the data, the skewness of the underlying class distributions, and the cluster labels are based on the classification labels.

Experimental results

• The Diabetes dataset achieves a F-measure value close to optimal with fewer clusters when using the learned threshold.

• In summary, the learned threshold achieves clustering results close to the optimal ones at a fraction of the computational cost of clustering the whole dataset.

Conclusion

• Hierarchical clustering does not produce a single clustering result but a dendrogram, a series of nested clusters based on distance thresholds.

• This leads to the open problem of choosing the preferred threshold.

• An efficient semi-supervised algorithm is proposed to obtain such threshold.

• Experimental results show the clustering results obtained using the learned threshold are close to optimal.

Recommended