Class 6 Cluster Analysis

42510011 0010 1010 1101 0001 0100 1011

Cluster AnalysisCluster AnalysisAnalysis and Output Interpretation

using

Hierarchical Cluster Technique & SPSS 6.00

Dr. Rohit Vishal KumarDr. Rohit Vishal KumarReader, Department of MarketingReader, Department of Marketing

Xavier Institute of Social ServiceXavier Institute of Social ServicePO Box No 7, Purulia RoadPO Box No 7, Purulia Road

Ranchi - 834001Ranchi - 834001

Email: Email: [email protected]@gmail.com

All trademarks & Copyrights Acknowledged. All trademarks & Copyrights Acknowledged.

Presentation Copyright Rohit Vishal Kumar 2002Presentation Copyright Rohit Vishal Kumar 2002

4251

0011 0010 1010 1101 0001 0100 1011

Cluster Analysis - IntroductionCluster Analysis - Introduction• Cluster Analysis is a multivariate analysis technique that

seeks to organize information about variables so that relatively homogeneous groups, or "clusters," can be formed. The clusters formed with this family of methods should be highly internally homogenous (members are similar to one another) and highly externally heterogeneous (members are not like members of other clusters.

• Although cluster analysis is relatively simple, and can use a variety of input data, it is a relatively new technique and is not supported by a comprehensive body of statistical literature. So, most of the guidelines for using cluster analysis are rules of thumb and some authors caution that researchers should use cluster analysis

4251

0011 0010 1010 1101 0001 0100 1011

Cluster Analysis - Key FeaturesCluster Analysis - Key Features• Cluster analysis is not as much a typical statistical test as it is

a "collection" of different algorithms that "put objects into clusters."

• Cluster analysis methods are mostly used when we do not have any a priori hypotheses, but are still in the exploratory phase of research. In a sense, cluster analysis finds the "most significant solution possible." Therefore, statistical significance testing is really not appropriate here

4251

0011 0010 1010 1101 0001 0100 1011

Cluster Analysis - ApplicationsCluster Analysis - Applications• Medicine: clustering diseases, cures for diseases, or symptoms of

diseases can lead to very useful classification and better diagnosis.

• Psychiatry: the correct diagnosis of clusters of symptoms such as paranoia, schizophrenia, etc. is essential for successful therapy.

• Archeology: researchers have attempted to establish taxonomies of stone tools, funeral objects, etc. by applying cluster analytic techniques.

• Marketing: researchers have attempted to use cluster analysis to identify the closeness or difference (real or perceived) between brands image, identify relatively homogenous marketing segments, identify similarities in ideas of communications etc.

In general, whenever one needs to classify a "mountain" of information into manageable meaningful piles, cluster

analysis is of great utility.

4251

0011 0010 1010 1101 0001 0100 1011

Four Common Distance Measures Four Common Distance Measures • Euclidean distance. This is probably the most commonly chosen type of

distance. It simply is the geometric distance in the multidimensional space. It is computed as:

distance(x,y) = { (xi - yi)2 }½

• Note: Euclidean (and squared Euclidean) distances are usually computed from raw data, and not from standardized data.

• Advantage: the distance between any two objects is not affected by the addition of new objects to the analysis, which may be outliers.

• Disadvantage: The distances can be greatly affected by differences in scale among the dimensions from which the distances are computed.

For example, if one of the dimensions denotes a measured length in centimeters, and you then convert it to millimeters (by multiplying the values by 10), the resulting Euclidean or squared Euclidean distances (computed from multiple dimensions) can be greatly affected, and consequently, the results of cluster analyses may be very different.

4251

0011 0010 1010 1101 0001 0100 1011

Four Common Distance Measures Four Common Distance Measures • Squared Euclidean distance. One may want to square the standard

Euclidean distance in order to place progressively greater weight on objects that are further apart. This distance is computed as :

distance(x,y) = i (xi - yi)2

• City-block (Manhattan) distance. This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). The city-block distance is computed as:

distance(x,y) = i |xi - yi|

• Chebychev distance. This distance measure may be appropriate in cases when one wants to define two objects as "different" if they are different on any one of the dimensions. The Chebychev distance is computed as:

distance(x,y) = Maximum|xi - yi|

42510011 0010 1010 1101 0001 0100 1011

Cluster Analysis

The Example and SPSS Procedure

4251

0011 0010 1010 1101 0001 0100 1011

The Raw DataThe Raw DataRespondent V1 V2 V3 V4 V5 V6

1 6 4 7 3 2 32 2 3 1 4 5 43 7 2 6 4 1 34 4 6 4 5 3 65 1 3 2 2 6 46 6 4 6 3 3 47 5 3 6 3 3 48 7 3 7 4 1 49 2 4 3 3 6 310 3 5 3 6 4 611 1 3 2 3 5 312 5 4 5 4 2 413 2 2 1 5 4 414 4 6 4 6 4 715 6 5 4 2 1 416 3 5 4 6 4 717 4 4 7 2 2 518 3 7 2 6 4 319 4 6 3 7 2 720 2 3 2 4 7 2

The above data was collected from 20 respondents. The respondents were asked to rate the following statement on a 7 point scaleV1 : Shopping is Fun V2 : Shopping is bad for your budget

V3 : I combine shopping with eating out V4 : I try to get the best buys while shopping

V5 : I don’t care about shopping V6 : You can save money by comparing pricesSCALE USED

Completely Disagree Neither Agree Nor Disagree Completely Agree1 4 7

4251

0011 0010 1010 1101 0001 0100 1011

SPSS Screen 1SPSS Screen 1The data entry screen in SPSS

4251

0011 0010 1010 1101 0001 0100 1011

SPSS Screen 2 : Hierarchical ClusterSPSS Screen 2 : Hierarchical ClusterChoose Statistics -> Data Reduction -> Hierarchical Cluster

We are shown the Hierarchical Cluster Screen as follows:

1. Select All six variables (V1-V6) and transfer them to the variable(s) box

2. Select Cluster “Cases”

3. Select Display “Statistics and “Plots”

4. Press on the Statistics Button

4251

0011 0010 1010 1101 0001 0100 1011

SPSS Screen 3 : Hierarchical ClusterSPSS Screen 3 : Hierarchical ClusterOn Pressing the “Statistics” Button we are shown the following screen

1. “Agglomeration Schedule” and “Cluster Membership -> None” should be checked by default. If not select these options

2. Press “Continue”

3. Select “Plots” from the “Screen 2”

4251

0011 0010 1010 1101 0001 0100 1011

SPSS Screen 4 : Hierarchical ClusterSPSS Screen 4 : Hierarchical ClusterOn Pressing the “Plots” Button we are shown the following screen

1. Select “Dendogram”

2. Select “All Icicles”

3. Select Orientation “Vertical”

4. Select “Methods” from the “Screen 2”

4251

0011 0010 1010 1101 0001 0100 1011

SPSS Screen 5 : Hierarchical ClusterSPSS Screen 5 : Hierarchical ClusterOn Pressing the “Methods” Button we are shown the following screen

1. Choose in Cluster Method:“Between Group Linkage”

2. Select in Measure “Interval”and select “Squared Euclidean Distances”

3. Select in “Transform Values” “none” in the standardize dropdown list

4. Select Continue

5. In Screen 2 select “OK”

42510011 0010 1010 1101 0001 0100 1011

Cluster Analysis

The SPSS Output

4251

0011 0010 1010 1101 0001 0100 1011

SPSS Output 1 : Hierarchical ClusterSPSS Output 1 : Hierarchical ClusterThe following output “Proximities” is displayed by SPSS

Data Information20 unweighted cases accepted.

0 cases rejected because of missing value.Squared Euclidean measure used.* * * * * * * * * * * * * * P R O X I M I T I E S * * * * * * * * * * * * * *

Agglomeration Schedule using Average Linkage (Between Groups)

Clusters Combined Stage Cluster 1st Appears Next Stage Cluster 1 Cluster 2 Coefficient Cluster 1 Cluster 2 Stage

1 14 16 2.000000 0 0 3 2 6 7 2.000000 0 0 7 3 10 14 3.000000 0 1 8 4 2 13 3.000000 0 0 14 5 5 11 3.000000 0 0 9 6 3 8 3.000000 0 0 15 7 6 12 4.000000 2 0 10 8 4 10 4.333333 0 3 11 9 5 9 4.500000 5 0 12 10 1 6 5.000000 0 7 13 11 4 19 7.250000 8 0 17 12 5 20 7.333333 9 0 14 13 1 17 8.250000 10 0 15 14 2 5 10.750000 4 12 18 15 1 3 11.300000 13 6 16 16 1 15 14.000000 15 0 19 17 4 18 20.200001 11 0 18 18 2 4 38.611111 14 17 19 19 1 2 48.291668 16 18 0

4251

0011 0010 1010 1101 0001 0100 1011

SPSS Output 1 : Hierarchical ClusterSPSS Output 1 : Hierarchical ClusterThe Analysis : Proximities

•The "average linkage (between group)" clustering was used.

•There were a total of 20 data points. In the first stage two data point (14 and 16) were combined. This information is provided under cluster combined cluster 1 and cluster 2 column.

•The squared Euclidean distance between the data point 14 and 16 is provided and is equal to 2.00. This is shown in column “Coefficients”

•The column entitled "Stage Cluster First Appeared" indicates the stage of combining the data in which the cluster first appears. The entry of 0 and 0 implies that right now no new clusters have been demarcated. The first cluster demarcation appears at stage 3 when data point 10 and 14 are combined to form a cluster.

•The “next stage” columns gives the step in which the next data point was combined. The entry is 3. If we look at stage 3 then we find that data point 10 and 14 were combined to form the next cluster.

4251

0011 0010 1010 1101 0001 0100 1011

SPSS Output 2 : Hierarchical ClusterSPSS Output 2 : Hierarchical Cluster

Vertical Icicle Plot using Average Linkage (Between Groups)

(Down) Number of Clusters (Across) Case Label and number

C C C C C C C C C C C C C C C C C C C C a a a a a a a a a a a a a a a a a a a a s s s s s s s s s s s s s s s s s s s s e e e e e e e e e e e e e e e e e e e e

1 1 1 1 1 4 2 9 1 5 1 2 1 8 3 1 1 7 6 1 8 9 6 4 0 0 1 3 5 7 2

1 1 1 1 1 2 1 1 1 1 1 8 9 6 4 0 4 0 9 1 5 3 2 5 8 3 7 2 7 6 1 1 +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 2 +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX 3 +XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX 4 +X XXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX 5 +X XXXXXXXXXXXXX XXXXXXXXXXXXXXXX X XXXXXXXXXXXXXXXXXXX 6 +X XXXXXXXXXXXXX XXXXXXXXXXXXXXXX X XXXX XXXXXXXXXXXXX 7 +X XXXXXXXXXXXXX XXXXXXXXXX XXXX X XXXX XXXXXXXXXXXXX 8 +X XXXXXXXXXXXXX XXXXXXXXXX XXXX X XXXX X XXXXXXXXXX 9 +X XXXXXXXXXXXXX X XXXXXXX XXXX X XXXX X XXXXXXXXXX 10 +X X XXXXXXXXXX X XXXXXXX XXXX X XXXX X XXXXXXXXXX 11 +X X XXXXXXXXXX X XXXXXXX XXXX X XXXX X XXXXXXX X 12 +X X XXXXXXXXXX X X XXXX XXXX X XXXX X XXXXXXX X 13 +X X XXXXXXX X X X XXXX XXXX X XXXX X XXXXXXX X 14 +X X XXXXXXX X X X XXXX XXXX X XXXX X X XXXX X 15 +X X XXXXXXX X X X XXXX XXXX X X X X X XXXX X 16 +X X XXXXXXX X X X X X XXXX X X X X X XXXX X 17 +X X XXXXXXX X X X X X X X X X X X X XXXX X 18 +X X XXXX X X X X X X X X X X X X X XXXX X 19 +X X XXXX X X X X X X X X X X X X X X X X

The following output “Icicle Plot” is displayed by SPSS

4251

0011 0010 1010 1101 0001 0100 1011

The Analysis : Icicle Plot

•The icicle plot shows the cluster combination. It is read from bottom to top.

•Initially it was assumed that there are 20 initial cluster. Then in row labeled 19 a combination was made and 19 clusters were formed.

•The icicle plot in pictorial form represents the whole process of cluster formation. For example, if we take row labelled 7 we shall see that there are 7 clusters denoted by a series of X's:

X XXXXXXXXXXXXX XXXXXXXXXX XXXX X XXXX XXXXXXXXXXXXX

•Each subsequent step leads to a formation of new cluster in one of the following three (3) ways:

–Two individual cases are grouped together–A case is joined to an already existing cluster–Two clusters are grouped together


4251

0011 0010 1010 1101 0001 0100 1011


Dendrogram using Average Linkage (Between Groups)

Rescaled Distance Cluster Combine

C A S E 0 5 10 15 20 25 Label Num +---------+---------+---------+---------+---------+

Case 14 14 -+ Case 16 16 -+-+ Case 10 10 -+ +-+ Case 4 4 ---+ +-------------+ Case 19 19 -----+ +-------------------+ Case 18 18 -------------------+ | Case 2 2 -+-------+ +---------+ Case 13 13 -+ | | | Case 5 5 -+-+ +-----------------------------+ | Case 11 11 -+ +-+ | | Case 9 9 ---+ +---+ | Case 20 20 -----+ | Case 3 3 -+---------+ | Case 8 8 -+ | | Case 6 6 -+-+ +-+ | Case 7 7 -+ | | | | Case 12 12 ---+---+ | +-----------------------------------+ Case 1 1 ---+ +---+ | Case 17 17 -------+ | Case 15 15 -------------+

The following output “Dendogram” is displayed by SPSS

4251

0011 0010 1010 1101 0001 0100 1011

The Analysis : Dendogram

•The Dendogram is a graphical output which is useful in identifying the clusters. It is read from left to right.

•Vertical lines represent the clusters that are joined together. The position of the vertical line on the scale indicates the distance at which the clusters were joined. Because many of the distances in the early stages are of similar magnitude, it is difficult to tell the sequence in which some of the early clusters were formed. However, it is clear that in the last two stages, the distances at which the clusters are combined are large. This information is useful in deciding the number of clusters to retain.


42510011 0010 1010 1101 0001 0100 1011

Cluster Analysis

Exercises and Final Notes

4251

0011 0010 1010 1101 0001 0100 1011

Practice ExamplePractice Example• The following data was collected for US baseball champions:

– Height : Height in Inches– Weight : Weight in Pounds– FGPct : Field Goal Percentage– Points: Average Points per game– Rebounds: Average rebounds per game

Champion Height Weight FGPct Points ReboundJabbar K.A. 86 230 55.9 24.6 11.2Barry R 79 205 44.9 23.2 06.7 Baylor E 77 225 43.1 27.4 13.5 Bird L 81 220 50.3 25.0 10.2 Chamberlain W 85 275 54.0 30.1 22.9 Cousy B 73 175 37.5 18.4 05.2 Erving J 79 200 50.6 24.2 08.5 Johnson M 81 215 53.0 19.5 07.4 Jordan M 78 195 51.3 32.6 06.2 Robertson O 77 210 48.5 25.7 07.5 Russell B 82 220 44.0 15.1 22.6 West J 75 180 47.4 27.0 05.8

• Conduct a Hierarchieal Cluster Analysis using a) Height, Weight, FGPct, Points and Rebound

b) Height, FGPct, Points and Reboundc) FGPct, Points and ReboundAnalyse the Dendograms to identify how the clusters have changed between (a) and (b) and (c)

4251

0011 0010 1010 1101 0001 0100 1011

WarningWarning

• We have only shown the output of a hierarchical Cluster Analysis

• Similar Interpretations may or may not be applicable to non-hierarchical Cluster Analysis

• The analysis software used was SPSS® 6.0. The output may vary with the type of analysis tool selected

• Cluster Analysis should be run more than once using different distance measures and results compared before a final interpretation is attempted.

42510011 0010 1010 1101 0001 0100 1011

Thank YouThank YouFeel Free to revert with your comments and suggestions

Documents

Class 6 Cluster Analysis