Upload
aradia
View
20
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Object Orie’d Data Analysis, Last Time. Finished Q-Q Plots Assess variability with Q-Q Envelope Plot SigClust When is a cluster “really there”? Statistic: 2-means Cluster Index Gaussian null distribution Fit to data (for HDLSS data, using invariance) P-values by simulation - PowerPoint PPT Presentation
Citation preview
Object Orie’d Data Analysis, Last Time
• Finished Q-Q Plots– Assess variability with Q-Q Envelope Plot
• SigClust– When is a cluster “really there”?
– Statistic: 2-means Cluster Index
– Gaussian null distribution
– Fit to data (for HDLSS data, using invariance)
– P-values by simulation
– Breast Cancer Data
More on K-Means Clustering
Classical Algorithm (from MacQueen,1967)
• Start with initial means
• Cluster: each data pt. to closest mean
• Recompute Class mean
• Stop when no change
Demo from:http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
More on K-Means Clustering
Raw Data
2 StartingCenters
More on K-Means Clustering
Assign Each Data Point To NearestCenter
Recompute Mean
Re-assign
More on K-Means Clustering
Recompute Mean
Re-AssignData Points To NearestCenter
More on K-Means Clustering
Recompute Mean
Re-AssignData Points To NearestCenter
More on K-Means Clustering
Recompute Mean
Final Assignment
More on K-Means Clustering
New ExampleRaw Data
DeliberatelyStrange Starting Centers
More on K-Means Clustering
Assign ClustersTo GivenMeans
Note poor clustering
More on K-Means Clustering
Recompute Mean
Re-assign
ShowsImprovement
More on K-Means Clustering
Recompute Mean
Re-assign
ShowsImprovement
Now very good
More on K-Means Clustering
Different Example
Best 2-meansCluster?
Local Minima?
More on K-Means Clustering
Assign
Recompute Mean
Re-assign
Note poor clustering
More on K-Means Clustering
Recompute Mean
Final Assignment
Stuck in Local Min
More on K-Means Clustering
Same Data
But slightly differentstarting points
Impact???
More on K-Means Clustering
Assign
Recompute Mean
Re-assign
Note poor clustering
More on K-Means Clustering
Recompute Mean
Final Assignment
Now get Global Min
More on K-Means Clustering
???Next time:
Redo above, using my own Matlab
calculations
That way can show each step
And get right answers.
More on K-Means Clustering
Now explore starting values:
• Approach randomly choose 2 data points
• Give stable solutions?
• Explore for different point configurations
• And try 100 random choices
• Do 2-d examples for easy visualization
More on K-Means Clustering2 Clusters: Raw Data (Normal mixture)
More on K-Means Clustering2 Clusters: Cluster Index, based on 100 Random Starts
More on K-Means Clustering2 Clusters: Chosen Clustering
More on K-Means Clustering
2 Clusters Results
• All starts end up with good answer
• Answer is very good (CI = 0.03)
• No obvious local minima
More on K-Means ClusteringStretched Gaussian: Raw Data
More on K-Means ClusteringStretched Gaussian : C. I., based on 100 Random Starts
More on K-Means ClusteringStretched Gaussian : Chosen Clustering
More on K-Means Clustering
Stretched Gaussian Results
• All starts end up with same answer
• Answer is less good (CI = 0.35)
• No obvious local minima
More on K-Means ClusteringStandard Gaussian: Raw Data
More on K-Means ClusteringStandard Gaussian : C. I., based on 100 Random Starts
More on K-Means ClusteringStandard Gaussian: Chosen Clustering
More on K-Means Clustering
Standard Gaussian Results
• All starts end up with same answer
• Answer even less good (CI = 0.62)
• No obvious local minima
• So still stable, despite poor CI
More on K-Means Clustering4 Balanced Clusters: Raw Data (Normal mixture)
More on K-Means Clustering4 Balanced Clusters: CI, based on 100 Random Starts
More on K-Means Clustering
4 Balanced Clusters 100 Random Starts
• Many different solutions appear
• I.e. there are many local minima
• Sorting on CI (bottom) shows how many
• 2 seem smaller than others
• What are other local minima?
Understand with deeper visualization
More on K-Means Clustering4 Balanced Clusters: Class Assignment Image Plot
More on K-Means Clustering4 Balanced Clusters: Vertically Regroup (better view?)
More on K-Means Clustering4 Balanced Clusters: Choose cases to “flip” – color cases
More on K-Means Clustering4 Balanced Clusters: Choose cases to “flip” – color cases
More on K-Means Clustering4 Balanced Clusters: “flip”, shows local min clusters
More on K-Means Clustering4 Balanced Clusters: sort columns, for better visualization
More on K-Means Clustering4 Balanced Clusters: CI, based on 100 Random Starts
More on K-Means Clustering4 Balanced Clusters: Color according to local minima
More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, smallest CI
More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, 2nd small CI
More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, larger 3rd CI
More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, larger 4th CI
More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, larger 5th CI
More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, larger 6th CI
More on K-Means Clustering
4 Balanced Clusters Results
• Many Local Minima
• Two good ones appear often (2-2 splits)
• 4 worse ones (1-3 splits less common)
• 1 with single strange point
• Overall very unstable
• Raises concern over starting values
More on K-Means Clustering4 Unbalanced Clusters: Raw Data (try for stability)
More on K-Means Clustering4 Unbalanced Clusters: CI, based on 100 Random Starts
More on K-Means Clustering4 Unbalanced Clusters: Recolor by CI
More on K-Means Clustering4 Unbalanced Clusters: Chosen Clustering, smallest CI
More on K-Means Clustering4 Unbalanced Clusters: Chosen Clustering, 2nd small CI
More on K-Means Clustering4 Unbalanced Clusters: Chosen Clustering, larger 3rd CI
More on K-Means Clustering
4 Unbalanced Clusters Results
• Fewer Local Minima (more stable)
• Two good ones appear often (2-2 splits)
• Single 1-3 split less common
• Previous instability caused by balance?
• Maybe stability OK after all?
More on K-Means ClusteringData on Circle: Raw Data (maximal instability?)
More on K-Means ClusteringData on Circle: CI, based on 100 Random Starts
More on K-Means ClusteringData on Circle: Recolor by CI
More on K-Means ClusteringData on Circle: Chosen Clustering, smallest CI
More on K-Means ClusteringData on Circle : Chosen Clustering, 2nd small CI
More on K-Means ClusteringData on Circle : Chosen Clustering, 3rd small CI
More on K-Means Clustering
Data on Circle Results
• Seems many local minima
• Several are the same?
• Could be programming error?
• But clear this is an unstable example
K-Means Clustering Caution
• This is all a personal view
• Others would present different aspects
• E.g. replace Euclidean dist. by others
• E.g. other types of clustering
• E.g. heat-map dendogram views
…
SigClust Breast Cancer Data
K-means Clustering & Starting Values
Try 100 random Starts
For full data set: Study Final CIs
• Shows just two solutions
Study changes in data, with image view• Shows little difference between these
Overall: Typical for clusters can split When Split is Clear, easily find it
SigClust Random Restarts, Full Data
SigClust Random Restarts, Full Data
SigClust Breast Cancer Data
For full Chuck Class (e.g. Luminal B): Study Final CIs
• Shows several solutions
Study changes in data, with image view• Shows multiple, divergent minima
Overall: Typical for “terminal” clusters When no clear split, many local optima appear
Could base test on number of local optima???
SigClust Random Restarts, Luminal B
SigClust Random Restarts, Luminal B
SigClust Breast Cancer Data
??? Next time: show many more of these
To better build this case….