Upload
marek
View
50
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Hierarchical Stability Based Model Selection for Data Clustering. Bing Yin Advisor: Greg Hamerly. Roadmap What is clustering? What is model selection for clustering algorithms? Stability Based Model Selection: Proposals and Problems Hierarchical Stability Based Model Selection ● Algorithm - PowerPoint PPT Presentation
Citation preview
04/22/23 1
Hierarchical Stability Based Model Selection for Data ClusteringBing Yin
Advisor: Greg Hamerly
04/22/23 2
RoadmapWhat is clustering?
What is model selection for clustering algorithms?
Stability Based Model Selection: Proposals and Problems
Hierarchical Stability Based Model Selection ● Algorithm ● Unimodality Test ● Experiments
Future work
Main Contribution● Extended the concept of stability to hierarchical stability.● Solved the symmetric data sets problem.● Make stability a competing tool for model selection.
04/22/23 3
What is clustering?
Given:● data set of “objects”● some relations between those objects: similarities, distances, neighborhoods,
connections,…
Goal: Find meaningful groups of objects s. t. ● objects in the same group are “similar” ● objects in different groups are “dissimilar”
Clustering is: ● a form of unsupervised learning ● a method of data exploration
04/22/23 4
What is clustering? An ExampleImage Segmentation: Micro array Analysis:
Serum Stimulation of Human Fibroblasts(Eisen,Spellman,PNAS,1998)
● 9800 spots representing 8600 genes
● 12 samples taken over 24 hour period
● Clusters can be roughly categorized as gene involved inA: cholesterol biosynthesisB: the cell cycleC: the immediate-early responseD: signaling and angiogenesisE: wound healing and tissue remodelingDocument Clustering
Post-search GroupingData MiningSocial Network AnalysisGene Family Grouping…
04/22/23 5
What is clustering? An Algorithm
K-Means algorithm (Lloyd, 1957)Given: data points X1,…,Xn d, number K clusters to find.1. Randomly initialize the centers m1
0,…,mK0.
2. Iterate until convergence: 2.1 Assign each point to the closest center according to Euclidean distance, i.e., define clusters C1
i+1,…,CKi+1 by
Xs Cki+1 where ||Xs-mk
i||2 < ||Xs-mli||2, l=1 to K
2.2 Compute the new cluster centers bymk
i+1 = Xs / |Cki+1|
What is optimized?Minimizing within-cluster distances:
04/22/23 6
What is model selection? Clustering algorithms need to know the K before running.
The correct answer of K for a given data is unknown
So we need a better way to find this K and also the positions of the K centers
This can be intuitively called model selection for clustering algorithms.
Existing model selection method:● Bayesian Information Criterion● Gap statistics● Projection Test …● Stability based approach
04/22/23 7
Stability Based Model SelectionThe basic idea:
● scientific truth should be reproducible in experiments.
Repeatedly run a clustering algorithm on the same data with parameter K and get a collection of clustering:
● If K is the correct model, clustering should be similar to each other● If K is a wrong model, clustering may be quite different from each other
This fact is referred as the stability of K (Ulrike von Luxburg,2007)
04/22/23 8
Stability Based Model Selection(2)Example on the toy data:
If we can mathematically define this stability score for K, then stability can be used to find the correct model for the given data.
04/22/23 9
Define the Stability
Variation of Information (VI)● Clustering C1: X1,…,Xk and Clustering C2: X’1,…,X’k on date X
● The prob. point p belongs in Xi is :
● The entropy of C1:
● The joint prob. p in Xi and X’j is P(i,j) with entropy:
● The VI is defined as:
VI indicates a distance between two clustering.
04/22/23 10
Define the stability (2)
Calculate the VI score for a single K ● Clustering the data using K-Means for K clusters, run M times ● Calculate pair wise VI of these M clustering. ● Average the VI and use it as the VI score for K
The calculated VI score for K indicates instability of K
Try this over different K
The K with lowest VI score/instability is chosen as the correct model
04/22/23 11
Define the Stability(3)An good example of Stability
An bad example of Stability: symmetric data
Why?Because Clustering data into 9 clusters apparently has more grouping choices than clustering them into 3.
04/22/23 12
Hierarchical StabilityProblems with the concept of stability introduced above:
● Symmetric Data Set● Only local optimization – the smaller K
Proposed solution● Analyze the stability in an hierarchical manner ● Do Unimodality Test to detect the termination of the recursion
04/22/23 13
Hierarchical StabilityGiven: Data set XHS-means:● 1. Test if X is not a unimodal cluster
● 2. If yes, find the optimal K for X by analyzing stability; otherwise, X is a single cluster, return.
● 3. Partition X into K subsets
● 4. For each subset, recursively perform this algorithm from step 1
● 5. Merge answers from each subset as answer for current data
04/22/23 14
Unimodality Test - 2 Unimodality testFact: sum of squared Gaussians follows 2 distribution.
● If x1,…,xd are d independent Gaussian variables, then S = x1
2+…+xd2 follows 2 distribution of degree d.
For given data set X, calculate Si=Xi12+…+Xid
2
● If X is a single Gaussian, then S follows 2 of degree d● Otherwise, S is not a 2 distribution.
04/22/23 15
Unimodality Test - Gap Test
Fact: the within cluster dispersion drops most apparently with the correct K (Tibshirani, 2000)
Given: Data set X, candidate k● cluster X to k clusters and get within cluster dispersion Wk
● generate uniform data sets, cluster to k clusters, calculate W*k (averaged)● gap(k) = W*k – Wk
● select smallest k s. t. gap(k)>gap(k+1)
● we use it in another way: just ask k=1?
04/22/23 16
ExperimentsSynthetic data
● Both Gaussian Distribution and Uniform Distribution● In dimensions from 2 up to 20● c-separation between each cluster center and its nearest neighbor is 4● 200 points in each cluster, 10 clusters in total
Handwritten Digits● U.S. Postal Service handwritten digits● 9298 instances in 256 dimensions● 10 true clusters (maybe!)
KDDD Control Curves● 600 instances in 60 dimensions● 6 true clusters, each has 100 instances
Synthetic Gaussian(10 true clusters)
Synthetic Uniform(10 true clusters)
Handwritten Digits(10 true clusters)
KDDD Control Curves(6 true clusters)
HS-means 101 101 60 6.50.5Lange Stability 6.51.5 71 20 30
PG-means 101 19.51.5 201 171
04/22/23 17
Experiments – symmetric dataHS-means Lange Stability
04/22/23 18
Future Work● Better Unimodality Testing approach.● More detailed comparison on the performance with existing method like within cluster distance, VI metric and so on.● Improve the speed of the algorithm.
04/22/23 19
Questions and Comments
Thank you!