What’s in a Label? Business value of “soft” vs “hard” cluster
ensemblessolutions-2
Nicole Huyghe & Anita Prinzie
Answers the who and the why
Theme 1
Theme 2
...
Theme 3
Theme 9
Theme 10
Cluster Ensemble
HARD OR SOFT CLUSTER ENSEMBLE
Stability Integrity Accuracy Size
Stability
Similarity Index (Lange et al, 2004) indicates the percentage of pairs of observations that belong to the same cluster in both clustering C and clustering C’.
Cluster Integrity – Heterogeneity
Total separation of clusters: based on the distance between cluster centers
Cluster Integrity - Homogeneity
Scatter (compactness): average ratio of the cluster variance to the variance of the dataset.
Accuracy
Adjusted Rand Index (Hubert and Arabie, 1985): level of agreement between the predicted segment and the real segment correcting for the expected level of agreement.
1 2
38
7
9
4
5
6
1
2
38
7
9
4
56
Reality Prediction
Size
Uniformity deviation: average deviation from each segment from uniform segment size (1/number of segments).
Rheumatism
Osteoporosis
Software journey
Stability Heterogeneity
Accuracy Homogeneity
H>S H>S
H>S H>SS>H
S>HS>H
LC gives smaller segments
Soft CCEA
Soft LC
Hard LC
Hard CCEA
Rheumatism
OsteoporosisSoftware journey
Soft CCEA
Soft LC
Hard LC
Hard CCEA
MIXED EVIDENCE
Fixed Factors
x 10100 100 100 100
High
confidence
Low
confidence
High
confidence
Low
confidence
Sim. Index soft > hard
Sim. Index hard > soft
Stability: SOFT is better
Strong similarity
Weak similarity
High confidence
Low confidence
Homogeneity: SOFT is better
Scatter hard > soft
Strong similarity
Weak similarity
High confidence
Low confidence
Heterogeneity: Hard is better
Tot. Sep. soft > hard
Strong similarity
Weak similarity
High confidence
Low confidence
Size: Hard is better
Strong similarity
Weak similarity
Uni. dev. soft > hard
High confidence
Low confidence
HARD ENSEMBLES GIVE BETTER BUSINESS SEGMENTS
References
• Fred and Jain, Combining Multiple Clustering using Evidence Accumulation (2005), IEEE Transactions on Pattern analysis and Machine Intelligence, 27(6), 835-850.
• Lange, T., Roth., V., Braun L. And Buhmann J.M. (2004) , Stability-based validation of Clustering Solutions, Neural Computation, 16, 1299-1323.
• Haldiki, M.,Vazirgiannis M. and Batistakis, Y. (2000), Quality Scheme Assessment in the Clustering Process, Proc. Of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, 265-276.
• Hubert, L. And Arabie, P. (1985) Comparing partitions, Journal of Classification, 193-218.
• Nieweglowski, L., CLV package (2007), R software.• Martin, A., Quinn, K.M. And Park, J.H., Markov Chain Monte Carlo
Package (MCMCpack) (2003-2012), R software.