Research Methodology: Tools
Applied Data Analysis (with SPSS)
Lecture 04: Cluster Analysis
March 2011
Prof. Dr. Jürg Schwarz [email protected]
MSc Business Administration
Slide 2
Contents
Aims ___________________________________________________________________________________________________ 5
Introduction _____________________________________________________________________________________________ 6
Outline _________________________________________________________________________________________________ 9
Concepts of Cluster Analysis______________________________________________________________________________ 10
Cluster Analysis with SPSS: A detailed example ______________________________________________________________ 24
Slide 3
Table of contents
Aims ___________________________________________________________________________________________________ 5
Aims of the lecture .................................................................................................................................................................................................5
Introduction _____________________________________________________________________________________________ 6
Example .................................................................................................................................................................................................................6
Outline _________________________________________________________________________________________________ 9
Concepts of Cluster Analysis______________________________________________________________________________ 10
Key steps in using a cluster analysis ....................................................................................................................................................................10
How to measure proximity....................................................................................................................................................................................11
Proximity measure with interval variables.............................................................................................................................................................13
Proximity measure with binary variables...............................................................................................................................................................15
How to form Clusters............................................................................................................................................................................................18
How to define similarity? ............................................................................................................................................................................................................18
Cluster formation tree (rules for cluster formation) ....................................................................................................................................................................20
Pros and cons ............................................................................................................................................................................................................................21
Example of hierarchical method: Single linkage (nearest neighbor) .........................................................................................................................................22
Example of hierarchical method: Complete linkage (furthest neighbor) ....................................................................................................................................23
Slide 4
Cluster Analysis with SPSS: A detailed example ______________________________________________________________ 24
Marketing research: Customer survey on brand awareness.................................................................................................................................24
SPSS Elements: <Analyze><Classify><Hierarchical ...........................................................................................................................................25
First step: Measure of distance or similarity between objects ...............................................................................................................................27
Output ........................................................................................................................................................................................................................................27
Second step: Formation of clusters ......................................................................................................................................................................28
Between-groups linkage ............................................................................................................................................................................................................28
Dendrogram ...............................................................................................................................................................................................................................29
Third step: Determining the number of clusters ....................................................................................................................................................31
Fourth step: Display and save cluster membership ..............................................................................................................................................33
Output table of cluster membership...........................................................................................................................................................................................33
Saving the cluster membership..................................................................................................................................................................................................34
Scatter plot: <Graphs><Chart Builder…>..................................................................................................................................................................................35
Fifth step: Interpretation of clusters ......................................................................................................................................................................37
Taking into account means ........................................................................................................................................................................................................37
Example of Lecture 01: Marketing survey on consumer buying behavior.................................................................................................................................37
Slide 5
Aims
Aims of the lecture
You know different types of measures of distance / similarity
You know the key steps in conducting a cluster analysis.
You can conduct a cluster analysis with SPSS
(Hierarchical agglomerative methods: Between-groups linkage and Ward)
In particular, you know how to …
◦ choose the appropriate measure of distance / similarity
◦ interpret the agglomeration schedule
◦ use the dendrogram to determine the number of clusters
◦ interpret the meaning of a cluster
Slide 6
Introduction
Example
Marketing research: Customer survey on brand awareness ("Markenbewusstsein")
Bra
nd a
ware
ness [
Index]
Yearly income [Index]
Survey features
Sample of n = 150 customers
Brand awareness index consist of 3 items:
◦ How likely is it that you will use the brand again in the future?
◦ How likely would you be to recommend the brand to your friends?
◦ Overall, how satisfied are you with the brand?
Also included in the dataset:
◦ yearly income
Slide 7
Question
Is there a linear relation between brand awareness and yearly income?
Hypothesis: The higher a person's income, the higher his/her brand awareness.
Conduct regression analysis with SPSS
Bra
nd a
ware
ness [
Index]
Output (summarized)
Overall model test (F-test)
Significance p = .014
Test of coefficients
Constant p = .000
Income p = .014
Coefficient of determination
R Square = .040
It is a really poor model
It seems to have structure in the
brand awareness dataset.
Yearly income [Index]
Slide 8
Question
Is there structure in the brand awareness dataset?
Are there clusters for the combination of yearly income and brand awareness?
Conduct cluster analysis
Bra
nd a
ware
ness [
Index]
Yearly income [Index]
Output
SPSS identified 3 distinct clusters
Interpretation
People with low income are least aware
because they lack money.
People with middle income have the
highest brand awareness because of
the dream of being richer.
People with high income are moderately
brand aware because they have a
certain status but don't need to show off.
Slide 9
Outline
Cluster analysis is a multivariate procedure for detecting natural groupings in data.
The grouping is based on the scores of several measures (e.g. income and awareness).
Bra
nd a
ware
ness [
Index]
Yearly income [Index]
Goals in conducting cluster analysis
Elements within a group should be as
similar as possible
<=> distance d should be small
Similarities between the groups should be
minimal
<=> distance D should be large
Features
Because all information is used for
grouping, cluster analysis is more
objective than just a subjective impression.
There is no optical illusion.
D
d
Slide 10
Concepts of Cluster Analysis
Key steps in using a cluster analysis
1. Measure of distance or similarity between objects (also called proximity measure)
◦ Depends on type of data: interval, counts, binary
◦ Distance: geometrical measure. Similarity: content-related measure
2. Formation of clusters
◦ Calculation of proximity matrix
◦ Many different procedures: Hierarchical / non-hierarchical, agglomerative / divisive etc.
3. Tools / criteria for determining the number of clusters
◦ Tools: Agglomeration schedule, structural chart, dendrogram, icicle plot ("Eiszapfen-Plot")
◦ Criteria (not available in SPSS): F-value, information criterion, etc.
4. Display and save cluster membership
◦ Done by SPSS
5. Interpretation of clusters
◦ Taking into account means (possibly variances) of cluster members
Slide 11
How to measure proximity
From dataset ...
Variable 1 Variable 2 Variable 3 : Variable j
Object 1
Object 2
Object 3
:
Object k
... to proximity matrix (done by SPSS internally)
Object 1 Object 2 Object 3 : Object k
Object 1
Object 2
Object 3
:
Object k
raw data
distance or similarity
Slide 12
Different proximity measures, depending on type of data
Measure allows specifying the distance (d) or similarity (s) to be used in clustering.
Interval (e.g. brand awareness, yearly income)
◦ Euclidean distance (d)
◦ City block distance (d)
◦ Pearson correlation (d) :
Counts (e.g. number of clients)
◦ Chi-square measure (s)
◦ Phi-square measure (s) :
Binary (e.g. yes/no, female/male)
◦ Euclidean distance (d)
◦ Russel and Rao (s)
◦ Simple matching (s)
◦ Dice (s)
(only a selection of 27!)
Slide 13
Proximity measure with interval variables
Example: Brand awareness
Theorem of Pythagoras about right triangle
cba cba 22222 =+=>=+
Distance between "pers_001" and "pers_002"
[ ][ ]
407.1
488.1490.0
73.195.297.067.1d
2/1
2/122
002,001
=
+=
−+−=
Coordinates {x-axis, y-axis}
a
bc
0.97 1.67
2.95
1.73
1.407
a
bc
0.97 1.67
2.95
1.73
1.407
{1.67, 1.73}
{0.97, 2.95}
Slide 14
Generalized equation
Minkowski distance (Hermann Minkowski, 1864 - 1909, German physicist) r/1
J
1j
r
ljkjl,k xxd
−= ∑
=
r = Minkowski's constant
dk,l = Distance between objects k and l (e.g. distance between persons 001 and 002)
J = Number of cluster variables (e.g. variables income and awareness)
xkj, xlj = Values of variable j of objects k and l (e.g. income of persons 001 and 002)
Values of Minkowski's constant
◦ r = 1: City block distance (also called L1-norm)
◦ r = 2: Euclidean distance (also called L2-norm)
City block distance
Manhattan distance
Taxi distance
L2
L1
L2
L1
Slide 15
Proximity measure with binary variables
Example: Car configuration
Identification of similarities between two objects by means of comparison
ABS Airbag ESP Navi Metallic
Mercedes 0 1 1 1 0
BMW 0 1 1 0 1
Case D A A C B
0 = feature not present 1 = feature present
Configuration
4 Cases
A = Feature exists in both comparison objects
B, C = Feature exists in one comparison object
D = Feature exists in none of the comparison objects
Non-existence is also an important similarity in proximity definition
Slide 16
Binary proximity measures
Proximity measure between two objects i and j depends on whether and how
the cases are included and how they are weighted (weights α, δi und λ).
General case: Simple Matching Coefficient*
ij
a dS
a (b c) d1
2
α ⋅ + δ ⋅=
α ⋅ + λ + + δ ⋅
Variants Description Definition
Russel und Rao Case d reduces proximity ij
aS
a b c d=
+ + +
Simple matching Case d raises proximity ij
a dS
a b c d
+=
+ + +
Dice Case d is not taken into account
Similar features are weighted more ij
2aS
2a b c=
+ +
*Sokal, R.R. and Michener, C.D., Statistical method for evaluating systematic relationships, *University of Kansas science bulletin, 38:1409--1438, 1958.
a = Number of cases of case "A" b = Number of cases of case "B" :
Slide 17
Example: Car configuration
ABS Airbag ESP Navi Metallic
Mercedes 0 1 1 1 0
BMW 0 1 1 0 1
Case D A A C B
0 = feature not present 1 = feature present
Configuration
Measure Proximity
Russel and Rao ij
2 2S 0.4
2 1 1 1 5= = =
+ + +
Simple matching ij
2 1 3S 0.6
2 1 1 1 5
+= = =
+ + +
Dice ij
2 2 4S 0.67
2 2 1 1 6
⋅= = =
⋅ + +
Some remarks
Sij varies between 0 and 1
There is no "right" proximity measure
Important question/decision:
Is non-existence important?
(<=> taking case d into account?)
Count of cases
a = 2
b = 1
c = 1
d = 1
Slide 18
How to form Clusters
Cluster A Cluster B
1.
2.
3.
How to define similarity?
Similarity between cluster A and cluster B is measured by …
1. Nearest neighbor (also called single linkage in the cluster formation tree on slide 20)
... the minimum of all possible distances between the cases in cluster A and the cases in B.
2. Centroid clustering (also called other linkage)
... the distance between the centroids of cluster A and of cluster B.
3. Furthest neighbor (also called complete linkage)
... the maximum of all possible distances between the cases in cluster A and the cases in B.
Slide 19
Similarity between cluster A and cluster B is measured by …
Between-groups linkage (also called average linkage)
... the average of all the possible distances between the cases in cluster A and the cases in B.
Within-groups linkage (also called other linkage)
... the average of all the possible distances between the cases within a single new cluster
determined by combining cluster A and cluster B.
Median clustering (also called other linkage)
... the distance between the SPSS determined median for the cases in cluster A and the median
for the cases in cluster B.
Special case, taking into account sum of squares
Ward’s method
For a cluster the sum of squares is the sum of squared distances of each case from the centroid.
d1
d 2
Sum of squared distances
∑=
=++k
1i
2
i
2
2
2
1 d ...dd
Slide 20
Cluster formation tree (rules for cluster formation)
There are several types of clustering procedures:
DivisiveAgglomerative
Variance
methods
Linkage
methods
Ward’s procedure
Singlelinkage
Clusteralgorithms
Hierarchical
Completelinkage
Averagelinkage
k-Meansprocedure
Non-hierarchical
Other linkage
Non-hierarchical clustering is also called k-means clustering.
Average linkage between groups is the default in SPSS ("Between-groups linkage")
used in this course
Slide 21
Pros and cons
Hierarchical clustering
◦ No a priori decision about the number of clusters
◦ Can be very slow
Non-hierarchical clustering
◦ Need to specify the number of clusters (can be an arbitrary number)
◦ Faster, more reliable
Features
Procedure Proximity measure Remark
Single linkage distance or similarity tendency to form chains
Complete linkage distance or similarity tendency to smaller groups of same size
Average linkage distance or similarity "between" single and complete linkage
Other linkage only distance No remark
Ward's method only distance tendency to groups of same size
Slide 22
Example of hierarchical method: Single linkage (nearest neighbor)
◦ Tendency to form chains
◦ Suitable for the detection of outliers
◦ Close groups are badly separated
Cognitive Psychology Unit at Saarland University (www.uni-saarland.de) (Date of access: March, 2011)
nearest neighbor
Step k Step k + 1
"chain"
Slide 23
Example of hierarchical method: Complete linkage (furthest neighbor)
◦ Tendency to form smaller groups with same size
◦ Not suitable for detecting outliers
Cognitive Psychology Unit at Saarland University (www.uni-saarland.de) (Date of access: March, 2011)
furthest neighbor
Step k Step k + 1
Slide 24
Cluster Analysis with SPSS: A detailed example
Marketing research: Customer survey on brand awareness
Data
Random sub-sample of n = 15
(Why this small sub-sample?
Just to keep track of what SPSS does.)
Data set: cluster_small.sav
Syntax: cluster_small.sps
Bra
nd a
ware
ness [
Index]
Yearly income [Index]
Slide 25
SPSS Elements: <Analyze><Classify><Hierarchical ..
Slide 26
Syntax
CLUSTER income awareness Variables included
/METHOD BAVERAGE Cluster method "Between-groups linkage"
/MEASURE= EUCLID Proximity measure "Euclidian distance"
/ID=person Labels cases in plots and tables
/PRINT SCHEDULE Schedule of cluster algorithm
/PRINT DISTANCE Matrix with distances ("Proximity Matrix")
/PLOT DENDROGRAM VICICLE. Tools for determining the number of clusters
Cluster method "Between-groups linkage" (which is default)
<=> A better choice would be Ward's method. Here "Between-groups linkage" was used to show in detail how SPSS runs a cluster analysis.
Proximity measure "Euclidian distance" (used to show how SPSS performs a cluster analysis)
<=> The squared Euclidean measure (which is default) should be used when the BAVERAGE, CENTROID, MEDIAN, or WARD cluster method is requested.
Slide 27
First step: Measure of distance or similarity between objects
Output
Proximity Matrix (Distances or similarities between items)
Values represent Euclidian distances
Example:
Distance between cases 9 and 7
:
:
Slide 28
Second step: Formation of clusters
Between-groups linkage
Stage 1: Cases 7 and 9 have smallest distance ("Coefficients" = .203) => first cluster {7,9}
First cluster {7,9} will be clustered with case 10 in stage 5 => cluster {7,9,10}
Stage 2: Cases 13 and 14 have second smallest distance => second cluster {13,14}
Second cluster {13,14} will be clustered with case 11 in stage 3 => cluster {11,13,14}
:
Agglomeration schedule: Displays
the clusters combined at each stage.
Slide 29
Dendrogram
Stage 1
Stage 5
Stage 2
Stage 3
Slide 30
Icicle plot
14 clusters: Cases 7 and 9 in one cluster, all others each in their own clusters.
13 clusters: 7 and 9 in one cluster, 13 and 14 in one cluster, all others each in their clusters.
12 clusters: 7 and 9 in one cluster, 11, 13 and 14 in one cluster, all others each in their clusters.
:
The figure is called an
icicle plot because the
columns look like icicles
hanging from above.
The plot shows how cases
are merged into clusters.
Read it from bottom to top
Slide 31
Third step: Determining the number of clusters
0) Theoretical and empirical reasons (But, be careful about optical illusion!)
In the case of brand awareness there are some indications for three clusters.
A) Elbow criterion in the structure chart (can't be done with SPSS, but with Excel)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Please note: Mostly there is large effect from cluster 1 to cluster 2 which is not the "elbow".
Pro
xim
ity (
"Coeff
icie
nts
")
Number of clusters (=sample size - "Stage")
elbow => choose 3 clusters
Slide 32
B) Dendrogram
Choose the number of clusters within the largest increase in heterogeneity
Standardized distance
Largest increase in heterogeneity
Slide 33
Fourth step: Display and save cluster membership
Output table of cluster membership
Example of brand awareness: assumed 3 clusters
If you're not sure
about the number of
clusters, choose a full
range
Slide 34
Saving the cluster membership
Used for drawing a scatter plot, for example.
Range of solutions: 2 to 5
Example of brand awareness: assumed 3 clusters
Slide 35
Scatter plot: <Graphs><Chart Builder…>
Slide 36
One point was assigned incorrectly
Slide 37
Fifth step: Interpretation of clusters
In the case of the brand awareness example, the interpretation is obvious and straightforward.
Taking into account means
The means of the clusters with respect to the original variables
indicate how the clusters can be interpreted.
Example of Lecture 01: Marketing survey on consumer buying behavior
Questionnaire to ask people about their attitudes.
Among other questions:
"What is your general attitude to life?" (variable x1)
"What is your attitude to innovation?" (variable x2)
"What is your willingness to take risks?" (variable x3)
Scales of variables vary
from extremely negative (1)
to extremely positive (7)
general
attitude to life
attitude to
innovation
willingness
to take risks
Person A 1 2 2
Person B 1 3 3
Person C 2 4 2
Person D 5 4 3
Person E 5 4 4
Person F 7 6 7
Ob
jec
ts
Attributes
Data of 6 people
Slide 38
general
attitude to life
attitude to
innovation
willingness
to take risks
(A, B, C) 1.3 3 2.3
(D, E) 5 4 3.5
(F) 7 6 7Clu
ste
r
Attributes
Mean of clusters, regarding the cluster variables
Cluster 1 (A, B, C): pessimistic people who live in fear
Cluster 2 (D, E): slightly optimistic normalos
Cluster 3 (F): life-affirming adventurer