Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe...

Learning the threshold in Hierarchical Agglomerative Clustering

Kristine Daniels

Christophe Giraud-Carrier

Speaker: Ngai Wang Kay

Hierarchical clusteringThreshold

d1 d2d3

d3Dendrogram

Distance metric

• Single-link distance metric – the minimum of the simple distances (e.g. Euclidean distances) between the objects in the two clusters.

Distance metric

• Complete-link distance metric – the maximum of the simple distances between the objects in the two clusters.

Threshold determination

• Some applications may just want a set of clusters for a particular threshold instead of a dendrogram.

• A more efficient clustering algorithm may be developed for such case.

• There are many possible thresholds.• So, it is hard to determine the threshold that gives

an accurate clustering result (based on a measure against the correct clusters).

• Suppose C1, …, Cn are the correct clusters and H1, …, Hm are the computed clusters.

• A F-measure is used to determine the accuracy of the computed clusters as follows:

),(),(

),(),(2),(

jiRjiP

jiRjiPjiF

),(max)(1

jiFiFm

)( where N is the dataset size

Semi-supervised algorithm

1. Select a random subset S of the dataset.2. Label the correct clusters of the data in S.3. Cluster S using the previous algorithm.4. Compute the F-measure value for each

threshold in the dendrogram.5. Find the threshold with the highest F-

measure value.6. Cluster the dataset using this threshold.

Sample set

• Preliminary experiments show that a sample set of size 50 gives reasonable clustering results.

• The time complexity of the hierarchical clustering is usually O(N2) or higher in simple distance computations and numerical comparisons.

• So, learning the threshold may be a very small cost in comparison to that of clustering the dataset.

Experimental results

• Experiments are conducted by complete-link clustering on various real datasets in the UCI repository (http://www.ics.uci.edu/~mlearn/mlrepository.html).

• They are originally collected for the classification problem.

• The class labels of the data are used as the cluster labels in these experiments.

Dataset Size # Classes

Breast-Wisconsin 699 2

Car 1728 4

Diabetes 768 2

Glass 214 2

Hepatitis 155 2

Ionosphere 351 2

Kr-vs-Kp 3196 2

Tic-Tac-Toe 958 2

Vehicle 946 4

Dataset Target threshold Learned threshold

Breast-Wisconsin 13.17 11.91

Car 7.35 6.68

Diabetes 8.84 11.61

Glass 9.39 8.06

Hepatitis 17.12 14.50

Ionosphere 24.81 24.00

Kr-vs-Kp 1605.28 50.37

Tic-Tac-Toe 7.52 7.45

Vehicle 13.09 6.11

• Because of the nature of the data, there may be many good threshold values.

• So, large differences between the target and learned thresholds do not have to yield large differences between the corresponding F-measure values.

Dataset

F-measure

(Target / Learned)

# Clusters

(Target / Learned)

Breast-Wisconsin 0.97 / 0.97 2 / 2

Car 0.90 / 0.64 2 / 5

Diabetes 0.71 / 0.65 13 / 4

Glass 0.82 / 0.82 11 / 13

Hepatitis 0.77 / 0.77 1 / 2

Ionosphere 0.69 / 0.66 1 / 2

Kr-vs-Kp 0.67 / 0.67 1 / 4

Tic-Tac-Toe 0.69 / 0.58 1 / 2

Vehicle 0.46 / 0.31 3 / 36

• The Vehicle dataset shows a huge difference in the number of clusters but a moderate difference in the F-measure.

• The Car dataset suffers from a serious loss of the F-measure, but the difference in the number of clusters is small.

• These anomalies may be explained, in part, by the sparseness of the data, the skewness of the underlying class distributions, and the cluster labels are based on the classification labels.

• The Diabetes dataset achieves a F-measure value close to optimal with fewer clusters when using the learned threshold.

• In summary, the learned threshold achieves clustering results close to the optimal ones at a fraction of the computational cost of clustering the whole dataset.

Conclusion

• Hierarchical clustering does not produce a single clustering result but a dendrogram, a series of nested clusters based on distance thresholds.

• This leads to the open problem of choosing the preferred threshold.

• An efficient semi-supervised algorithm is proposed to obtain such threshold.

• Experimental results show the clustering results obtained using the learned threshold are close to optimal.

Learning the threshold in Hierarchical Agglomerative Clustering Kristine Daniels Christophe...

Documents

Agglomerative clustering via maximum incremental path integralstatfe.com/papers/pr2013_path_clustering.pdf · 2013-07-06 · Agglomerative clustering via maximum incremental path

Art Strategic Market Games Giraud

Jean Henri Gaston Giraud MOEBIUS Jean Henri Gaston Giraud (born 8 May 1938) is a French comics artist. Giraud has earned worldwide fame, not only under

Competence maps using agglomerative hierarchical clustering · Competence maps using agglomerative hierarchical clustering ... a competence map of the Canadian nanotechnology indus-

The Hierarchical Agglomerative Clustering with Gower index

Multilevel Agglomerative Edge Bundling for Visualizing Large Graphsyifanhu.net/PUB/edge_bundling.pdf · Multilevel Agglomerative Edge Bundling for Visualizing Large Graphs ... NJ

Ngai, "Animatedness"

Multilevel Agglomerative Edge Bundling for Visualizing ... · Multilevel Agglomerative Edge Bundling for Visualizing ... multilevel agglomerative edge bundling method ... Our solution

quang ngai

2008 Giraud Kellen Senior Thesis

Efficient BVH Construction via Approximate Agglomerative ...graphics.cs.cmu.edu/projects/aac/aac_build.pdf · Efﬁcient BVH Construction via Approximate Agglomerative Clustering

1 Status of ECAL Mechanical and thermal R&D in Grenoble FJPPL’08 Denis Grondin (grondin@lpsc.in2p3.fr)grondin@lpsc.in2p3.fr Julien Giraud (giraud@lpsc.in2p3.fr)giraud@lpsc.in2p3.fr

Victoria Ngai - Selected Works

Victo ngai

Koh ngai r1 oct11

fastcluster: Fast Hierarchical, Agglomerative Clustering ... · JSS JournalofStatisticalSoftware May 2013, Volume 53, Issue 9. fastcluster: Fast Hierarchical, Agglomerative Clustering

Portfolio of Katherine Ngai

Agglomerative Clustering via Maximum Incremental Path Integralmmlab.ie.cuhk.edu.hk/pdf/zhangDWpr13.pdf · Agglomerative Clustering via Maximum Incremental Path Integral WeiZhanga,DeliZhaoa,XiaogangWang1b

Giraud Trimmer Inc Manuel

Robustness of three hierarchical agglomerative clustering