6. Cluster Analysis

Embed Size (px)

Citation preview

Cluster AnalysisCluster analysis, also called segmentation analysis or taxonomy analysis, seeks to identify homogeneous subgroups of cases in a population. That is, cluster analysis seeks to identify a set of groups which both minimize within-group variation and maximize between-group variation. Hierarchical clustering allows users to select a definition of distance, then select a linking method of forming clusters, then determine how many clusters best suit the data. In k-means clustering the researcher specifies the number of clusters in advance, then calculates how to assign cases to the K clusters. K-means clustering is much less computer-intensive and is therefore sometimes preferred when datasets are very large (ex., > 1,000). Finally, two-step clustering creates pre-clusters, then it clusters the pre-clusters. A Famous Example of Cluster AnalysisThe discovery of white dwarfs and red giants is a famous example of cluster analysis.Stars were plotted by astronomers Hertzsprung and Russell according to the two features, log luminosity and log temperature. Three clusters emerged, white dwarfs, red giants and the main sequence in between them.ClusterTemperatureLuminosity

White Dwarfsmediumlow

Main Sequencewide rangelow when T is low, high when T is high

Red Giantsmedium lowhigh

Key Concepts and TermsCluster formation is the selection of the procedure for determining how clusters are created, and how the calculations are done. In agglomerative hierarchical clustering every case is initially considered a cluster, then the two cases with the lowest distance (or highest similarity) are combined into a cluster. The case with the lowest distance to either of the first two is considered next. If that third case is closer to a fourth case than it is to either of the first two, the third and fourth cases become the second two-case cluster; if not, the third case is added to the first cluster. The process is repeated, adding cases to existing clusters, creating new clusters, or combining clusters to get to the desired final number of clusters. (There is also divisive clustering, which works in the opposite direction, starting with all cases in one large cluster. Hierarchical cluster analysis, can use either agglomerative or divisive clustering strategies.) Distance. The first step in cluster analysis is establishment of the similarity or distance matrix. This matrix is a table in which both the rows and columns are the units of analysis and the cell entries are a measure of similarity or distance for any pair of cases. Euclidean distance is the most common distance measure. A given pair of cases is plotted on two variables, which form the x and y axes. The Euclidean distance is the square root of the sum of the square of the x difference plus the square of the y distance. Sometimes the square of Euclidean distance is used instead. When two or more variables are used to define distance, the one with the larger magnitude will dominate, so to avoid this it is common to first standardize all variables. Similarity. Distance measures how far apart two observations are. Cases which are alike share a low distance. Similarity measures how alike two cases are. Cases which are alike share a high similarity. Cluster MethodThere are different methods to compute the distance between clusters.Average linkage is the mean distance between all possible inter- or intra-cluster pairs. The average distance between all pairs in the resulting cluster is made to be as small as possibile. This method is therefore appropriate when the research purpose is homogeneity within clusters. Ward's method calculates the sum of squared Euclidean distances from each case in a cluster to the mean of all variables. The cluster to be merged is the one which will increase the sum the least. This is an ANOVA-type approach and preferred by some researchers for this reason. Centroid method. The cluster to be merged is the one with the smallest sum of Euclidean distances between cluster means for all variables. K Means algorithm. The algorithm uses the Euclidean distance and requires the user to enter the required number of clusters.Agglomeration ScheduleSAS displays this in the Cluster History table. In this table, the rows are stages of clustering, numbered from 1 to (n - 1). The (n - 1)th stage includes all the cases in one cluster. The algorithm uses the Euclidean distance to combine clusters. The 0th row (not shown) has all the observations as one-point clusters. At Stage 1, the two clusters with the least distance is combined into one cluster. The process continues until all the cases are collected into one cluster. The metric Norm RMS Dist increases as the number of clusters reduce. A jump in this metric determines the number of clusters. If there is a significant junp from Stage I to Stage (i+1) then we stop the clustering process after completing Stage i and identify the clusters.Dendrogram(Cluster Analysis Tree Charts) show the relative size of the average cluster distances at which clusters were combined. The bigger the distance, the more clustering involved combining unlike entities, which may be undesirable. Clusters with low distance/high similarity are close together. Cases showing low distance are close, with a line linking them a short distance from the left of the dendrogram, indicating that they are agglomerated into a cluster at a low distance coefficient, indicating alikeness. When, on the other hand, the linking line is to the right of the dendrogram the linkage occurs a high distance coefficient, indicating the cases/clusters were agglomerated even though much less alike. Cluster centers are the average value on all clustering variables of each cluster's members.Profiling the ClusterWe profile each cluster by interpreting the centroid (the mean of the cluster). This data is available when SAS is executed with the K-Means method. We interpret each cluster by the high/low loading on each variable.We may also profile the cluster by the variables that were not used in the clustering process either by the means of such variables or by running Discriminant analysis on these variables.Validating the ClustersBecause of the non-statistical aspects of cluster analysis, we need to conduct a validation exercise to ensure that the result is generalizable to future observations. The following are some ways to do this:1. Cluster using different distance measures2. Cluster using different methods3. Split the data into two halves, run the program on each half, profile each cluster and compare the profiles.

Example

Cluster AnalysisAverage Linkage MethodSAS Instructions1. Analyze Multivariate Cluster Analysis2. Data: Drop variables into Analyze Variable bucket3. Cluster Method: Average Linkage4. Plots: Tree Design 5. Results: Display Output6. Run

Eigenvalues of the Covariance Matrix

EigenvalueDifferenceProportionCumulative

14.024800731.144839550.58290.5829

22.879961180.41711.0000

Root-Mean-Square Total-Sample Standard Deviation1.858058

Root-Mean-Square Distance Between Observations3.716117

Cluster History

NCLClusters JoinedFREQNorm RMS DistTie

6OB5OB620.3806

5OB2OB320.5382T

4CL5OB430.6592

3CL6OB730.712

2CL4CL360.9702

1OB1CL271.3045

Average Linkage MethodSAS Instructions1. Analyze Multivariate Cluster Analysis2. Data: Drop variables into Analyze Variable bucket3. Cluster Method: Wards Min Var. Method4. Plots: Tree Design 5. Results: Display Output6. Run

Eigenvalues of the Covariance Matrix

EigenvalueDifferenceProportionCumulative

14.024800731.144839550.58290.5829

22.879961180.41711.0000

Root-Mean-Square Total-Sample Standard Deviation1.858058

Root-Mean-Square Distance Between Observations3.716117

Cluster History

NCLClusters JoinedFREQSPRSQRSQTie

6OB5OB620.0241.976

5OB2OB320.0483.928T

4CL5OB430.0805.847

3CL6OB730.1046.743

2OB1CL440.3420.401T

1CL2CL370.4006.000

K-MeansSAS Instructions1. Analyze Multivariate Cluster Analysis2. Data: Drop variables into Analyze Variable bucket3. Cluster Method: K-Means; Max # of clusters: 3 (Why?); 4. Run

Initial Seeds

ClusterStore LoyaltyBrand Loyalty

13.0000000002.000000000

27.0000000007.000000000

32.0000000007.000000000

Minimum Distance Between Initial Seeds =5

Iteration History

IterationCriterionRelative Change in Cluster Seeds

1 2 3

Criterion Based on Final Seeds =0.8729

Cluster Summary

ClusterFrequencyRMS Std DeviationMax. Distance from Seed to Obs. Radius ExceededNearest ClusterDistance Between Cluster Centroids

Statistics for Variables

VariableTotal STDWithin STDR-SquareRSQ/(1-RSQ)

Pseudo F Statistic =5.77

Approximate Expected Over-All R-Squared =.

Cubic Clustering Criterion =.

Cluster Means

ClusterStore LoyaltyBrand Loyalty

13.0000000002.000000000

26.3333333335.666666667

33.3333333336.333333333

Cluster Standard Deviations

ClusterStore LoyaltyBrand Loyalty

Output DataRespStoreBrandClusterDistanceA3210B4531.49071198C4730.94280904D2731.49071198E6620.47140452F7721.49071198G6421.69967317

5

HareOriginal ObservationsStandardized ObservationsRespStore LoyaltyBrand LoyaltyStore LoyaltyBrand LoyaltyA 32Suppose a marketing researcher wishes to determine market segments in a community based on patterns of loyalty to brands and stores. The values for 7 respondents on two measures of loyalty are given in the tableMultivariate Data Analysis, Hare et al1-0.87-1.80B 452-0.32-0.23C 473-0.320.83D 274-1.420.83E 6650.790.30F 7761.340.83G 6470.79-0.75

Avg4.57142857145.4285714286Case Squared Euclidean Distance of Original Observationss1.81265393431.90237946241:A2:B3:C4:D5:E6:F7:G1:A01026262541132:B1004851353:C2640459134:D268401725255:E2555170246:F411392520107:G13513254100

Mean-Square Distance Between Observations = 13.8095238095Root-Mean-Square Distance Between Observations =3.7161167648

Sheet1RespStore LoyaltyBrand LoyaltySSA 323.00002.00000.000B 454.83336.00001.694C 474.83336.00001.694D 274.83336.00009.028E 664.83336.00001.361F 774.83336.00005.694G 644.83336.00005.361Stage 5Current ClustersPossible ClustersCLObservationClusterCentroid 1Centroid 2SS1SS2SSCL3{5, 6, 7}{1, 2, 3, 4} & {5, 6, 7}(3.25,5.25)(6.33,5.67)19.5005.33324.833CL4{1, 2, 3}{1, 5, 6, 7} & {2, 3, 4}(5.5,4.75)(3.33, 6.33)23.7505.33329.083CL{1}{2, 3, 4, 5, 6, 7} & {1}(4.83,6)(3,2)24.833024.833

Sheet5Cluster MeansClusterStore LoyaltyBrand Loyalty13226.3333333335.66666666733.3333333336.333333333RespStoreBrandClusterDistanceA3210B4531.490711981.4907119849C4730.942809040.9428090421D2731.490711981.4907119849E6620.47140452F7721.49071198G6421.69967317