Hierarchical Stability Based Model Selection for Data Clustering

04/22/23 1

Hierarchical Stability Based Model Selection for Data ClusteringBing Yin

Advisor: Greg Hamerly

04/22/23 2

RoadmapWhat is clustering?

What is model selection for clustering algorithms?

Stability Based Model Selection: Proposals and Problems

Hierarchical Stability Based Model Selection ● Algorithm ● Unimodality Test ● Experiments

Future work

Main Contribution● Extended the concept of stability to hierarchical stability.● Solved the symmetric data sets problem.● Make stability a competing tool for model selection.

04/22/23 3

What is clustering?

Given:● data set of “objects”● some relations between those objects: similarities, distances, neighborhoods,

connections,…

Goal: Find meaningful groups of objects s. t. ● objects in the same group are “similar” ● objects in different groups are “dissimilar”

Clustering is: ● a form of unsupervised learning ● a method of data exploration

04/22/23 4

What is clustering? An ExampleImage Segmentation: Micro array Analysis:

Serum Stimulation of Human Fibroblasts(Eisen,Spellman,PNAS,1998)

● 9800 spots representing 8600 genes

● 12 samples taken over 24 hour period

● Clusters can be roughly categorized as gene involved inA: cholesterol biosynthesisB: the cell cycleC: the immediate-early responseD: signaling and angiogenesisE: wound healing and tissue remodelingDocument Clustering

Post-search GroupingData MiningSocial Network AnalysisGene Family Grouping…

04/22/23 5

What is clustering? An Algorithm

K-Means algorithm (Lloyd, 1957)Given: data points X1,…,Xn d, number K clusters to find.1. Randomly initialize the centers m1

0,…,mK0.

2. Iterate until convergence: 2.1 Assign each point to the closest center according to Euclidean distance, i.e., define clusters C1

i+1,…,CKi+1 by

Xs Cki+1 where ||Xs-mk

i||2 < ||Xs-mli||2, l=1 to K

2.2 Compute the new cluster centers bymk

i+1 = Xs / |Cki+1|

What is optimized?Minimizing within-cluster distances:

04/22/23 6

What is model selection? Clustering algorithms need to know the K before running.

The correct answer of K for a given data is unknown

So we need a better way to find this K and also the positions of the K centers

This can be intuitively called model selection for clustering algorithms.

Existing model selection method:● Bayesian Information Criterion● Gap statistics● Projection Test …● Stability based approach

04/22/23 7

Stability Based Model SelectionThe basic idea:

● scientific truth should be reproducible in experiments.

Repeatedly run a clustering algorithm on the same data with parameter K and get a collection of clustering:

● If K is the correct model, clustering should be similar to each other● If K is a wrong model, clustering may be quite different from each other

This fact is referred as the stability of K (Ulrike von Luxburg,2007)

04/22/23 8

Stability Based Model Selection(2)Example on the toy data:

If we can mathematically define this stability score for K, then stability can be used to find the correct model for the given data.

04/22/23 9

Define the Stability

Variation of Information (VI)● Clustering C1: X1,…,Xk and Clustering C2: X’1,…,X’k on date X

● The prob. point p belongs in Xi is :

● The entropy of C1:

● The joint prob. p in Xi and X’j is P(i,j) with entropy:

● The VI is defined as:

VI indicates a distance between two clustering.

04/22/23 10

Define the stability (2)

Calculate the VI score for a single K ● Clustering the data using K-Means for K clusters, run M times ● Calculate pair wise VI of these M clustering. ● Average the VI and use it as the VI score for K

The calculated VI score for K indicates instability of K

Try this over different K

The K with lowest VI score/instability is chosen as the correct model

04/22/23 11

Define the Stability(3)An good example of Stability

An bad example of Stability: symmetric data

Why?Because Clustering data into 9 clusters apparently has more grouping choices than clustering them into 3.

04/22/23 12

Hierarchical StabilityProblems with the concept of stability introduced above:

● Symmetric Data Set● Only local optimization – the smaller K

Proposed solution● Analyze the stability in an hierarchical manner ● Do Unimodality Test to detect the termination of the recursion

04/22/23 13

Hierarchical StabilityGiven: Data set XHS-means:● 1. Test if X is not a unimodal cluster

● 2. If yes, find the optimal K for X by analyzing stability; otherwise, X is a single cluster, return.

● 3. Partition X into K subsets

● 4. For each subset, recursively perform this algorithm from step 1

● 5. Merge answers from each subset as answer for current data

04/22/23 14

Unimodality Test - 2 Unimodality testFact: sum of squared Gaussians follows 2 distribution.

● If x1,…,xd are d independent Gaussian variables, then S = x1

2+…+xd2 follows 2 distribution of degree d.

For given data set X, calculate Si=Xi12+…+Xid

2

● If X is a single Gaussian, then S follows 2 of degree d● Otherwise, S is not a 2 distribution.

04/22/23 15

Unimodality Test - Gap Test

Fact: the within cluster dispersion drops most apparently with the correct K (Tibshirani, 2000)

Given: Data set X, candidate k● cluster X to k clusters and get within cluster dispersion Wk

● generate uniform data sets, cluster to k clusters, calculate W*k (averaged)● gap(k) = W*k – Wk

● select smallest k s. t. gap(k)>gap(k+1)

● we use it in another way: just ask k=1?

04/22/23 16

ExperimentsSynthetic data

● Both Gaussian Distribution and Uniform Distribution● In dimensions from 2 up to 20● c-separation between each cluster center and its nearest neighbor is 4● 200 points in each cluster, 10 clusters in total

Handwritten Digits● U.S. Postal Service handwritten digits● 9298 instances in 256 dimensions● 10 true clusters (maybe!)

KDDD Control Curves● 600 instances in 60 dimensions● 6 true clusters, each has 100 instances

Synthetic Gaussian(10 true clusters)

Synthetic Uniform(10 true clusters)

Handwritten Digits(10 true clusters)

KDDD Control Curves(6 true clusters)

HS-means 101 101 60 6.50.5Lange Stability 6.51.5 71 20 30

PG-means 101 19.51.5 201 171

04/22/23 17

Experiments – symmetric dataHS-means Lange Stability

04/22/23 18

Future Work● Better Unimodality Testing approach.● More detailed comparison on the performance with existing method like within cluster distance, VI metric and so on.● Improve the speed of the algorithm.

04/22/23 19

Questions and Comments

Thank you!

Documents

Hierarchical Stability Based Model Selection for Data Clustering