View
216
Download
1
Category
Preview:
Citation preview
CLUSTER ANALYSIS
Introduction
• Cluster analysis is a technique for grouping individuals or objects hierarchically into unknown groups suggested by the data.
• Cluster analysis can be considered an alternative to Factor Analysis.
• Cluster analysis differs from discriminant analysis.
o In cluster analysis the group membership is unknown prior to the analysis.
• In the biological sciences, an area where cluster analysis has been widely used is taxonomy.
o In taxonomy individuals are classified into arbitrary groups based on measurements of the individuals.
o The classification moves from the most general to the most specific.
Kingdom Phylum Subphylum Class Order Family Genus Species
• In economics, cluster analysis can be used for data mining.
o For example, in a market survey you could classify patrons into groups based on their answers to many questions.
• Warnings for cluster analysis.
o Groupings from cluster analysis can be different based on the method of analysis used.
o Since the groups are not known a priori, it can be difficult to determine if the results make sense in the context of the research being conducted.
o Knowledge of the population you are sampling and common sense are two important tools when it comes to interpreting results from cluster analysis.
Basic Concepts of Cluster Analysis
• Cluster analysis can be divided into two basic steps,
1. Initial analysis of data.
2. Analytical clustering using one of many methods of amalgamation. Initial analysis
o It is always a good idea before any statistical analysis to plot a scatter diagram of your data to see if there are any irregularities that need to be address using a transformation.
o A common transformation in multivariate analyses is to “standardize” your data so that it has a mean of 0 and a variance of 1.0
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑧𝑒𝑑 𝑌! =(𝑌! − 𝑌)𝑆!
o If in visualizing your data you seem to see clusters that are elliptical in shape,
you want to use a transformation method that will make the resultant pooled within cluster covariance matrix spherical. v The method PROC ACELUS (Approximate Covariance Estimation for
Clustering) procedure in SAS will perform the transformation.
v Neither cluster membership nor the number of clusters needs to be known. Analytical clustering
Distance Measures o Distance measures can be studied in large data sets to determine similarities
or clusters.
o The opposite of similarity is distance.
o Distance values can be calculated for each pair of observations.
o Statistical methods to calculate distance are very sensitive to outliers. So you are encouraged to run diagnostics on your data to identify outliers and remove them if necessary.
o The most commonly used distance measurement is the Euclidian Distance.
Distance (x,y) =Σ!(𝑥! − 𝑦!)!
o Different methods to determine distance will provide different results.
Cluster Analysis Process o In the initial cluster analysis, all individuals begin in the same cluster.
o In subsequent rounds of analyses, the entries are placed into more and more
clusters.
o At the end of the cluster analysis, all individuals are in their own cluster.
o During the various rounds of cluster analysis, the distances between new clusters must be determined and we need to be ale to determine when two clusters are sufficiently close to be linked together.
o Two of the most common methods of cluster analysis are,
§ Unweighted Pair-‐Group Mean Average (UPGMA): the distance
between any two clusters is the average distance between all individuals in the different clusters.
§ Ward’s Method: a minimum variance method that uses an ANOVA approach. The method tries to minimize the sum of squares of any two clusters that are formed at each step of the cluster analysis.
Estimating the Number of Clusters
o Three methods that can be used to estimate the number of clusters are the, 1. Cubic clustering criterion (CCC) method: the estimated number of clusters occurs at the start of a peak on the graph . There may be more than one peak per plot.
2. Pseudo F: estimated number of clusters occurs at the start of peaks on the graph. There may be more than one peak per plot.
3. t2 The graph is read right to left. The estimated number of
clusters occurs at the start of a peak. There may be more than one peak per plot.
Precautions When Using Cluster Analysis
• Unless there is considerable separation between inherent groups when you view the scatter plots, it is not realistic to expect Cluster Analysis to provide clear results.
• Cluster Analysis is very sensitive to outliers.
• Results from the different Cluster Analysis methods may give you very different results.
• If you have large amounts of data, one method of simple validation of the results
from Cluster Analysis is to conduct the analysis on the two halves of your data. It would be preferable to select the individuals to be assigned to the two halves at random.
Example of Cluster Analysis • In this example, I am using data from one of my students’ (Sintayehu Daba) PhD
dissertation. Sintayehu is evaluating barley lines from three regions, Ethiopia and Kenya, ICARDA, and North Dakota, USA. Sintayehu collected data on many different plant characters, agronomic traits, and disease resistance. In the analysis, I am trying to determine if cluster analysis will successfully separate the data into distinct clusters based on the data collected.
• SAS Commands
options pageno=1; data all; input Entry Source Color Hull_cover Row Orrow DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; datalines; 11 1 1 1 2 12 81.6 131.5 2.8 4.1 48.6 7.5 7.1 109.9 7.1 42.8 59.3 4.6 14.1 71.5 8 12 1 1 1 2 12 68.4 115.9 1.2 8.1 16.4 3.4 4.6 103.9 5.5 33.3 54.8 2.6 10.7 35.7 20 13 1 2 1 2 12 86.8 136.2 3.2 4.7 25.7 7.3 7.4 126.3 7.5 51.9 59.3 3.7 12.4 88.1 13 30 1 1 1 2 12 80.8 132.4 3.4 3.2 27.9 8.1 8.1 105.6 8.0 51.6 61.8 4.4 12.1 93.0 10 39 1 1 1 2 12 80.0 123.4 1.3 8.7 26.2 6.8 6.8 108.7 6.2 53.8 61.1 4.8 11.9 90.9 25 56 1 1 1 2 12 81.0 125.4 4.3 4.7 39.2 9.8 8.8 103.7 5.2 56.4 61.0 4.6 12.3 94.1 20 72 1 1 1 2 12 85.8 134.0 1.0 5.8 33.0 9.1 10.1 96.5 6.4 37.5 60.3 3.1 10.8 70.0 10 231 1 1 1 2 12 84.9 133.8 3.1 3.4 48.2 8.4 8.7 107.1 7.2 45.0 57.4 4.9 11.8 80.6 3 232 1 1 1 2 12 82.1 133.9 2.7 4.7 43.6 4.1 6.7 108.9 7.6 63.5 61.4 4.9 14.2 92.8 5 233 1 1 1 2 12 84.1 132.9 6.7 2.7 40.6 6.1 7.7 117.9 7.6 57.3 60.9 5.2 14.0 86.2 10 241 1 1 1 2 12 84.1 131.9 6.7 2.7 36.6 8.1 8.7 130.9 6.6 59.7 63.5 5.6 13.4 86.6 10 1 1 2 1 6 16 87.9 133.4 4.5 2.0 48.8 7.5 7.7 119.2 5.6 36.4 59.0 5.0 . 64.9 0 2 1 1 1 6 16 83.4 129.3 1.3 6.1 49.0 8.0 7.9 116.2 7.4 36.5 57.1 3.9 11.0 50.6 18 3 1 2 1 6 16 80.8 123.4 3.5 3.9 44.4 3.3 3.4
105.1 7.0 39.0 59.0 3.4 9.7 74.9 15 4 1 1 1 6 16 85.7 132.9 2.9 2.8 49.1 6.1 6.1 105.6 5.2 40.2 59.9 3.6 10.7 72.2 8 5 1 1 1 6 16 83.2 128.7 2.4 7.3 47.0 5.7 5.4 104.7 5.2 35.0 54.8 3.3 11.5 65.5 15 6 1 1 1 6 16 85.4 132.4 4.7 2.2 34.0 10.0 11.0 130.2 8.0 45.8 57.0 5.5 11.7 59.6 5 7 1 1 1 6 16 83.0 126.4 2.3 3.7 44.2 6.8 5.8 107.7 9.2 44.6 60.2 4.8 13.1 84.2 5
.
.
.
. 89.5 7.6 36.1 59.0 2.5 9.6 64.5 0 82 5 1 1 2 52 79.9 131.3 1.1 7.9 23.9 11.6 12.1 80.6 6.9 35.2 60.3 2.1 10.3 66.5 0 227 5 1 1 2 52 86.2 138.9 1.1 8.1 27.6 7.2 7.7 78.3 6.8 35.4 58.9 2.1 9.3 50.8 1 228 5 1 1 2 52 85.1 145.5 1.3 6.9 26.9 11.1 11.1 89.4 7.2 48.6 64.2 2.3 9.5 95.2 0 263 5 1 1 2 52 80.2 132.3 0.9 7.7 27.1 8.8 8.8 87.9 6.9 32.9 59.0 2.0 9.5 58.8 0 210 5 1 1 6 56 84.9 135.4 0.9 7.4 53.5 4.7 4.5 79.4 7.4 36.3 57.4 2.6 12.3 93.7 0 211 5 1 1 6 56 83.7 137.5 1.0 7.7 45.3 8.2 8.3 84.7 6.0 29.9 60.8 1.6 12.1 58.0 0 212 5 1 1 6 56 91.2 137.4 1.3 7.2 47.8 5.3 5.3 81.4 6.7 29.6 58.6 1.4 11.1 48.2 0 213 5 1 1 6 56 86.4 140.4 0.8 7.6 49.5 7.3 6.9 86.9 6.8 30.5 58.6 2.5 10.6 65.9 0 214 5 1 1 6 56 85.1 139.3 1.1 7.7 53.5 6.3 6.6 87.8 7.5 34.4 59.3 2.4 10.6 71.4 0 215 5 1 1 6 56 88.4 141.3 1.0 6.6 52.9 7.3 6.6 90.4 7.2 30.8 60.4 2.9 10.4 64.2 0 216 5 1 1 6 56 80.4 133.7 1.0 7.4 42.3 6.1 6.3 84.7 6.6 31.7 58.6 2.0 11.8 76.8 0 217 5 1 1 6 56 83.6 137.2 1.3 7.7 50.5 5.8 5.9 85.7 7.0 30.2 58.2 1.8 10.7 60.5 0 218 5 1 1 6 56 86.0 140.6 1.2 7.4 48.9 7.2 7.4 85.8 7.0 33.1 57.7 1.9 12.1 90.4 0 219 5 1 1 6 56 84.0 142.0 1.0 7.6 47.6 7.9 8.3 84.7 7.3 31.4 59.6 1.4 13.3 88.9 0 220 5 1 1 6 56 82.9 140.0 1.0 6.8 56.6 6.1 6.4 104.9 7.5 31.9 62.0 2.4 10.5 62.0 0 221 5 1 1 6 56 82.7 136.1 1.0 7.2 53.4 9.1 9.7 101.9 7.2 32.6 58.6 2.2 11.7 55.8 0 229 5 1 1 6 56 85.0 140.8 0.6 7.6 52.4 6.4 6.5 103.5 7.5 33.3 60.5 2.1 11.9 64.0 0 237 5 1 1 6 56 83.5 133.7 1.0 7.6 45.8 6.3 6.4 87.6 7.0 33.5 59.2 2.5 11.1 68.0 0 ;; data two; set all;
if row=2; ods graphics on; ods rtf file='cluster.rtf'; proc cluster data=two method=ave print=15 ccc pseudo; var row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; copy orrow; title 'Cluster Analysis Using the UPGMA Method'; run; proc tree noprint ncl=3 out=out; copy row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD orrow; run; proc freq; tables cluster*orrow / nopercent norow nocol plot=none; run; proc candisc noprint out=can; class cluster; var row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; run; proc sgplot data=can; scatter y=can2 x=can1 / group=cluster; run; proc cluster data=two method=ward print=15 ccc pseudo; var row Color Hull_cover Row DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; copy orrow; title 'Cluster analysis Using Wards Method'; run; proc tree noprint ncl=3 out=out; copy row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD orrow; run; proc freq; tables cluster*orrow / nopercent norow nocol plot=none; run; proc candisc noprint out=can; class cluster; var row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; run; proc sgplot data=can; scatter y=can2 x=can1 / group=cluster; run; ods rtf close;
ods graphics off;
Cluster Analysis Using the UPGMA Method
The CLUSTER Procedure Average Linkage Cluster Analysis
Eigenvalues of the Covariance Matrix
Eigenvalue Difference Proportion Cumulative
1 295.143472 165.626898 0.5182 0.5182
2 129.516574 62.636260 0.2274 0.7456
3 66.880314 39.119795 0.1174 0.8631
4 27.760519 8.322771 0.0487 0.9118
5 19.437749 4.612745 0.0341 0.9460
6 14.825003 9.040107 0.0260 0.9720
7 5.784896 2.344945 0.0102 0.9821
8 3.439951 1.163995 0.0060 0.9882
9 2.275956 0.180191 0.0040 0.9922
10 2.095765 1.271980 0.0037 0.9959
11 0.823785 0.268282 0.0014 0.9973
12 0.555504 0.023777 0.0010 0.9983
13 0.531727 0.300812 0.0009 0.9992
14 0.230914 0.039611 0.0004 0.9996
15 0.191304 0.166683 0.0003 1.0000
16 0.024621 0.024621 0.0000 1.0000
17 0.000000 0.000000 0.0000 1.0000
18 0.000000 0.0000 1.0000
Root-Mean-Square Total-Sample Standard Deviation 5.624935
Root-Mean-Square Distance Between Observations 33.74961
Cluster History
Number of
Clusters Clusters Joined Freq
Semipartial R-Square R-Square
Approximate Expected R-Square
Cubic Clustering
Criterion Pseudo F Statistic
Pseudo t-Squared
Norm RMS Distance Tie
15 CL27 CL41 10 0.0086 .799 .812 -1.2 24.1 4.7 0.6206
14 CL52 CL39 4 0.0064 .792 .803 -.95 25.3 4.6 0.6224
13 CL19 CL23 25 0.0108 .782 .793 -1.0 26.0 4.3 0.6322
12 CL42 CL22 5 0.0089 .773 .783 -.83 27.2 3.9 0.6993
11 CL21 CL16 42 0.0371 .736 .771 -2.8 24.8 16.2 0.7249
10 CL12 OB45 6 0.0064 .729 .758 -1.9 26.9 1.6 0.7305
9 CL13 CL18 28 0.0201 .709 .743 -2.1 27.7 7.0 0.7772
8 CL11 CL14 46 0.0262 .683 .725 -2.6 28.3 8.3 0.788
Cluster Analysis Using the UPGMA Method
The CLUSTER Procedure Average Linkage Cluster Analysis
Cluster History
Number of
Clusters Clusters Joined Freq
Semipartial R-Square R-Square
Approximate Expected R-Square
Cubic Clustering
Criterion Pseudo F Statistic
Pseudo t-Squared
Norm RMS Distance Tie
7 CL9 CL10 34 0.0282 .655 .704 -2.9 29.4 7.7 0.7945
6 CL15 CL7 44 0.0619 .593 .678 -4.6 27.4 15.5 0.8501
5 CL6 CL8 90 0.2228 .370 .645 -12 14.0 49.7 0.9624
4 CL26 OB94 7 0.0218 .348 .600 -11 17.1 10.1 1.1609
3 CL5 CL33 92 0.0461 .302 .533 -8.5 21.0 6.7 1.2466
2 OB2 CL4 8 0.0305 .272 .397 -4.0 36.6 5.6 1.3979
1 CL3 CL2 100 0.2717 .000 .000 0.00 . 36.6 1.6047 • The semipartial R2 measures the homogeneity of merged clusters. This value reflects
decreasing homogeneity of members in a cluster as clusters are combined to make new clusters.
• R2 reflects the differences between clusters, so you want this value to be high. At the start of the clustering process all entries are their own cluster; thus, the R2 is 1. As more clusters are combined, the R2 value should decrease. At the end of the analysis when all observations are in the same cluster, the R2 value should theoretically be 0.
• The approximate expected R2 value is part of the output presented when the CCC value is
requested. The approximate expected R2 value reflects an estimated value given a uniform null hypothesis.
• Ties
o At each level of the clustering process, Proc Cluster identifies pairs of clusters with the
minimum distance between them. Sometimes there can be two or more pairs of clusters with the same minimum distance. This often occurs with discrete data. In such cases the tie must be broken in some arbitrary way. If there are ties, then the results of the cluster analysis depend on the order of the observations in the data set.
o A tie means that at a particular step in the cluster analysis, two pairs of clusters had the same minimum distance and possibly some of the later steps some of the clusters are not uniquely determined. Ties that occur early in the cluster analysis usually have little effect on the later stages. Ties that occur in the middle parts of the cluster analysis should be investigated. Ties that occur late in the cluster analysis are a sign that a solid or concrete solution may not be possible.
o There are routines you can run to determine if Ties are affecting the outcome of your
analyses.
Cluster Analysis Using the UPGMA Method
The CLUSTER Procedure Average Linkage Cluster Analysis
Table of CLUSTER by Orrow (Using Non-standardized Data)
CLUSTER Orrow
Frequency 12 22 32 42 52 Total
1 10 37 2 39 4 92
2 0 0 0 0 7 7
3 1 0 0 0 0 1
Total 11 37 2 39 11 100
Frequency Missing = 1
Table of CLUSTER by Orrow (Using Standardized Data)
CLUSTER Orrow
Frequency 12 22 32 42 52 Total
1 5 26 2 39 11 83
2 5 11 0 0 0 16
3 1 0 0 0 0 1
Total 11 37 2 39 11 100
Frequency Missing = 1
Cluster Analysis Using the UPGMA Method
The CLUSTER Procedure Average Linkage Cluster Analysis
Non-‐standardized Data
Cluster analysis Using Wards Method
The CLUSTER Procedure Ward's Minimum Variance Cluster Analysis
Eigenvalues of the Covariance Matrix
Eigenvalue Difference Proportion Cumulative
1 295.143472 165.626898 0.5182 0.5182
2 129.516574 62.636260 0.2274 0.7456
3 66.880314 39.119795 0.1174 0.8631
4 27.760519 8.322771 0.0487 0.9118
5 19.437749 4.612745 0.0341 0.9460
6 14.825003 9.040107 0.0260 0.9720
7 5.784896 2.344945 0.0102 0.9821
8 3.439951 1.163995 0.0060 0.9882
9 2.275956 0.180191 0.0040 0.9922
10 2.095765 1.271980 0.0037 0.9959
11 0.823785 0.268282 0.0014 0.9973
12 0.555504 0.023777 0.0010 0.9983
13 0.531727 0.300812 0.0009 0.9992
14 0.230914 0.039611 0.0004 0.9996
15 0.191304 0.166683 0.0003 1.0000
16 0.024621 0.024621 0.0000 1.0000
17 0.000000 0.000000 0.0000 1.0000
18 0.000000 0.000000 0.0000 1.0000
19 0.000000 0.0000 1.0000
Root-Mean-Square Total-Sample Standard Deviation 5.47491
Root-Mean-Square Distance Between Observations 33.74961
Cluster History
Number of
Clusters Clusters Joined Freq
Semipartial R-Square R-Square
Approximate Expected R-Square
Cubic Clustering
Criterion Pseudo F Statistic
Pseudo t-Squared Tie
15 CL22 CL39 16 0.0090 .828 .812 1.63 29.2 6.4
14 CL26 CL18 28 0.0091 .819 .803 1.55 29.9 5.9
13 CL15 CL33 19 0.0120 .807 .793 1.28 30.3 6.1
12 CL28 CL43 5 0.0143 .793 .783 0.89 30.6 5.5
11 CL17 CL16 14 0.0170 .776 .771 0.40 30.8 5.2
10 OB2 OB94 2 0.0177 .758 .758 0.01 31.3 .
9 CL31 CL19 9 0.0185 .739 .743 -.22 32.3 7.4
Cluster analysis Using Wards Method
The CLUSTER Procedure Ward's Minimum Variance Cluster Analysis
Cluster History
Number of
Clusters Clusters Joined Freq
Semipartial R-Square R-Square
Approximate Expected R-Square
Cubic Clustering
Criterion Pseudo F Statistic
Pseudo t-Squared Tie
8 CL9 CL23 15 0.0242 .715 .725 -.64 33.0 6.7
7 CL34 CL14 37 0.0244 .691 .704 -.82 34.6 14.5
6 CL13 CL12 24 0.0246 .666 .678 -.72 37.5 8.0
5 CL11 CL6 38 0.0392 .627 .645 -1.0 39.9 9.5
4 CL10 CL8 17 0.0603 .567 .600 -1.8 41.8 10.2
3 CL21 CL5 46 0.0624 .504 .533 -1.3 49.3 13.7
2 CL4 CL7 54 0.1888 .315 .397 -2.7 45.1 42.3
1 CL3 CL2 100 0.3154 .000 .000 0.00 . 45.1
Cluster analysis Using Wards Method
The FREQ Procedure
Table of CLUSTER by Orrow (Non-standardized Data)
CLUSTER Orrow
Frequency 12 22 32 42 52 Total
1 0 4 0 32 1 37
2 9 31 2 4 0 46
3 2 2 0 3 10 17
Total 11 37 2 39 11 100
Frequency Missing = 1
Table of CLUSTER by Orrow (Using Standardized Data)
CLUSTER Orrow
Frequency 12 22 32 42 52 Total
1 1 22 2 38 0 63
2 9 15 0 0 0 24
3 1 0 0 1 11 13
Total 11 37 2 39 11 100
Frequency Missing = 1
Recommended