71
Object Orie’d Data Analysis, Last Time • Finished Q-Q Plots – Assess variability with Q-Q Envelope Plot • SigClust – When is a cluster “really there”? – Statistic: 2-means Cluster Index – Gaussian null distribution – Fit to data (for HDLSS data, using invariance) – P-values by simulation – Breast Cancer Data

Object Orie’d Data Analysis, Last Time

  • Upload
    aradia

  • View
    20

  • Download
    1

Embed Size (px)

DESCRIPTION

Object Orie’d Data Analysis, Last Time. Finished Q-Q Plots Assess variability with Q-Q Envelope Plot SigClust When is a cluster “really there”? Statistic: 2-means Cluster Index Gaussian null distribution Fit to data (for HDLSS data, using invariance) P-values by simulation - PowerPoint PPT Presentation

Citation preview

Page 1: Object Orie’d Data Analysis, Last Time

Object Orie’d Data Analysis, Last Time

• Finished Q-Q Plots– Assess variability with Q-Q Envelope Plot

• SigClust– When is a cluster “really there”?

– Statistic: 2-means Cluster Index

– Gaussian null distribution

– Fit to data (for HDLSS data, using invariance)

– P-values by simulation

– Breast Cancer Data

Page 2: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Classical Algorithm (from MacQueen,1967)

• Start with initial means

• Cluster: each data pt. to closest mean

• Recompute Class mean

• Stop when no change

Demo from:http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

Page 3: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Raw Data

2 StartingCenters

Page 4: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Assign Each Data Point To NearestCenter

Recompute Mean

Re-assign

Page 5: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Recompute Mean

Re-AssignData Points To NearestCenter

Page 6: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Recompute Mean

Re-AssignData Points To NearestCenter

Page 7: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Recompute Mean

Final Assignment

Page 8: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

New ExampleRaw Data

DeliberatelyStrange Starting Centers

Page 9: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Assign ClustersTo GivenMeans

Note poor clustering

Page 10: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Recompute Mean

Re-assign

ShowsImprovement

Page 11: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Recompute Mean

Re-assign

ShowsImprovement

Now very good

Page 12: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Different Example

Best 2-meansCluster?

Local Minima?

Page 13: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Assign

Recompute Mean

Re-assign

Note poor clustering

Page 14: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Recompute Mean

Final Assignment

Stuck in Local Min

Page 15: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Same Data

But slightly differentstarting points

Impact???

Page 16: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Assign

Recompute Mean

Re-assign

Note poor clustering

Page 17: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Recompute Mean

Final Assignment

Now get Global Min

Page 18: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

???Next time:

Redo above, using my own Matlab

calculations

That way can show each step

And get right answers.

Page 19: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Now explore starting values:

• Approach randomly choose 2 data points

• Give stable solutions?

• Explore for different point configurations

• And try 100 random choices

• Do 2-d examples for easy visualization

Page 20: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering2 Clusters: Raw Data (Normal mixture)

Page 21: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering2 Clusters: Cluster Index, based on 100 Random Starts

Page 22: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering2 Clusters: Chosen Clustering

Page 23: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

2 Clusters Results

• All starts end up with good answer

• Answer is very good (CI = 0.03)

• No obvious local minima

Page 24: Object Orie’d Data Analysis, Last Time

More on K-Means ClusteringStretched Gaussian: Raw Data

Page 25: Object Orie’d Data Analysis, Last Time

More on K-Means ClusteringStretched Gaussian : C. I., based on 100 Random Starts

Page 26: Object Orie’d Data Analysis, Last Time

More on K-Means ClusteringStretched Gaussian : Chosen Clustering

Page 27: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Stretched Gaussian Results

• All starts end up with same answer

• Answer is less good (CI = 0.35)

• No obvious local minima

Page 28: Object Orie’d Data Analysis, Last Time

More on K-Means ClusteringStandard Gaussian: Raw Data

Page 29: Object Orie’d Data Analysis, Last Time

More on K-Means ClusteringStandard Gaussian : C. I., based on 100 Random Starts

Page 30: Object Orie’d Data Analysis, Last Time

More on K-Means ClusteringStandard Gaussian: Chosen Clustering

Page 31: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Standard Gaussian Results

• All starts end up with same answer

• Answer even less good (CI = 0.62)

• No obvious local minima

• So still stable, despite poor CI

Page 32: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Balanced Clusters: Raw Data (Normal mixture)

Page 33: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Balanced Clusters: CI, based on 100 Random Starts

Page 34: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

4 Balanced Clusters 100 Random Starts

• Many different solutions appear

• I.e. there are many local minima

• Sorting on CI (bottom) shows how many

• 2 seem smaller than others

• What are other local minima?

Understand with deeper visualization

Page 35: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Balanced Clusters: Class Assignment Image Plot

Page 36: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Balanced Clusters: Vertically Regroup (better view?)

Page 37: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Balanced Clusters: Choose cases to “flip” – color cases

Page 38: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Balanced Clusters: Choose cases to “flip” – color cases

Page 39: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Balanced Clusters: “flip”, shows local min clusters

Page 40: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Balanced Clusters: sort columns, for better visualization

Page 41: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Balanced Clusters: CI, based on 100 Random Starts

Page 42: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Balanced Clusters: Color according to local minima

Page 43: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, smallest CI

Page 44: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, 2nd small CI

Page 45: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, larger 3rd CI

Page 46: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, larger 4th CI

Page 47: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, larger 5th CI

Page 48: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, larger 6th CI

Page 49: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

4 Balanced Clusters Results

• Many Local Minima

• Two good ones appear often (2-2 splits)

• 4 worse ones (1-3 splits less common)

• 1 with single strange point

• Overall very unstable

• Raises concern over starting values

Page 50: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Unbalanced Clusters: Raw Data (try for stability)

Page 51: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Unbalanced Clusters: CI, based on 100 Random Starts

Page 52: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Unbalanced Clusters: Recolor by CI

Page 53: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Unbalanced Clusters: Chosen Clustering, smallest CI

Page 54: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Unbalanced Clusters: Chosen Clustering, 2nd small CI

Page 55: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering4 Unbalanced Clusters: Chosen Clustering, larger 3rd CI

Page 56: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

4 Unbalanced Clusters Results

• Fewer Local Minima (more stable)

• Two good ones appear often (2-2 splits)

• Single 1-3 split less common

• Previous instability caused by balance?

• Maybe stability OK after all?

Page 57: Object Orie’d Data Analysis, Last Time

More on K-Means ClusteringData on Circle: Raw Data (maximal instability?)

Page 58: Object Orie’d Data Analysis, Last Time

More on K-Means ClusteringData on Circle: CI, based on 100 Random Starts

Page 59: Object Orie’d Data Analysis, Last Time

More on K-Means ClusteringData on Circle: Recolor by CI

Page 60: Object Orie’d Data Analysis, Last Time

More on K-Means ClusteringData on Circle: Chosen Clustering, smallest CI

Page 61: Object Orie’d Data Analysis, Last Time

More on K-Means ClusteringData on Circle : Chosen Clustering, 2nd small CI

Page 62: Object Orie’d Data Analysis, Last Time

More on K-Means ClusteringData on Circle : Chosen Clustering, 3rd small CI

Page 63: Object Orie’d Data Analysis, Last Time

More on K-Means Clustering

Data on Circle Results

• Seems many local minima

• Several are the same?

• Could be programming error?

• But clear this is an unstable example

Page 64: Object Orie’d Data Analysis, Last Time

K-Means Clustering Caution

• This is all a personal view

• Others would present different aspects

• E.g. replace Euclidean dist. by others

• E.g. other types of clustering

• E.g. heat-map dendogram views

Page 65: Object Orie’d Data Analysis, Last Time

SigClust Breast Cancer Data

K-means Clustering & Starting Values

Try 100 random Starts

For full data set: Study Final CIs

• Shows just two solutions

Study changes in data, with image view• Shows little difference between these

Overall: Typical for clusters can split When Split is Clear, easily find it

Page 66: Object Orie’d Data Analysis, Last Time

SigClust Random Restarts, Full Data

Page 67: Object Orie’d Data Analysis, Last Time

SigClust Random Restarts, Full Data

Page 68: Object Orie’d Data Analysis, Last Time

SigClust Breast Cancer Data

For full Chuck Class (e.g. Luminal B): Study Final CIs

• Shows several solutions

Study changes in data, with image view• Shows multiple, divergent minima

Overall: Typical for “terminal” clusters When no clear split, many local optima appear

Could base test on number of local optima???

Page 69: Object Orie’d Data Analysis, Last Time

SigClust Random Restarts, Luminal B

Page 70: Object Orie’d Data Analysis, Last Time

SigClust Random Restarts, Luminal B

Page 71: Object Orie’d Data Analysis, Last Time

SigClust Breast Cancer Data

??? Next time: show many more of these

To better build this case….