Sawtooth 2012 what's in a label

Preview:

DESCRIPTION

Sawtooth Conference 2012 Orlando - FlordidaWhat's in a label? The business value of hard versus soft clusteringby Nicole Huyghe and Anita Prinzie

Citation preview

What’s in a Label? Business value of “soft” vs “hard” cluster

ensemblessolutions-2

Nicole Huyghe & Anita Prinzie

Answers the who and the why

Theme 1

Theme 2

...

Theme 3

Theme 9

Theme 10

Cluster Ensemble

HARD OR SOFT CLUSTER ENSEMBLE

Stability Integrity Accuracy Size

Stability

Similarity Index (Lange et al, 2004) indicates the percentage of pairs of observations that belong to the same cluster in both clustering C and clustering C’.

Cluster Integrity – Heterogeneity

Total separation of clusters: based on the distance between cluster centers

Cluster Integrity - Homogeneity

Scatter (compactness): average ratio of the cluster variance to the variance of the dataset.

Accuracy

Adjusted Rand Index (Hubert and Arabie, 1985): level of agreement between the predicted segment and the real segment correcting for the expected level of agreement.

1 2

38

7

9

4

5

6

1

2

38

7

9

4

56

Reality Prediction

Size

Uniformity deviation: average deviation from each segment from uniform segment size (1/number of segments).

Rheumatism

Osteoporosis

Software journey

Stability Heterogeneity

Accuracy Homogeneity

H>S H>S

H>S H>SS>H

S>HS>H

LC gives smaller segments

Soft CCEA

Soft LC

Hard LC

Hard CCEA

Rheumatism

OsteoporosisSoftware journey

Soft CCEA

Soft LC

Hard LC

Hard CCEA

MIXED EVIDENCE

Fixed Factors

x 10100 100 100 100

High

confidence

Low

confidence

High

confidence

Low

confidence

Sim. Index soft > hard

Sim. Index hard > soft

Stability: SOFT is better

Strong similarity

Weak similarity

High confidence

Low confidence

Homogeneity: SOFT is better

Scatter hard > soft

Strong similarity

Weak similarity

High confidence

Low confidence

Heterogeneity: Hard is better

Tot. Sep. soft > hard

Strong similarity

Weak similarity

High confidence

Low confidence

Size: Hard is better

Strong similarity

Weak similarity

Uni. dev. soft > hard

High confidence

Low confidence

HARD ENSEMBLES GIVE BETTER BUSINESS SEGMENTS

risingquestionsdo we cause

Anita Prinzie, Nicole Huygheanita@solutions2.be

www.solutions2.be

References

• Fred and Jain, Combining Multiple Clustering using Evidence Accumulation (2005), IEEE Transactions on Pattern analysis and Machine Intelligence, 27(6), 835-850.

• Lange, T., Roth., V., Braun L. And Buhmann J.M. (2004) , Stability-based validation of Clustering Solutions, Neural Computation, 16, 1299-1323.

• Haldiki, M.,Vazirgiannis M. and Batistakis, Y. (2000), Quality Scheme Assessment in the Clustering Process, Proc. Of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, 265-276.

• Hubert, L. And Arabie, P. (1985) Comparing partitions, Journal of Classification, 193-218.

• Nieweglowski, L., CLV package (2007), R software.• Martin, A., Quinn, K.M. And Park, J.H., Markov Chain Monte Carlo

Package (MCMCpack) (2003-2012), R software.

Recommended