Download pdf - Research Methodology: Tools 04_EN.pdfResearch Methodology: Tools Applied Data Analysis (with SPSS) Lecture 04: Cluster Analysis March 2011 Prof. Dr. Jürg Schwarz juerg.schwarz @hslu.ch

Research Methodology: Tools

Applied Data Analysis (with SPSS)

Lecture 04: Cluster Analysis

March 2011

Prof. Dr. Jürg Schwarz [email protected]

MSc Business Administration

Slide 2

Contents

Aims ___________________________________________________________________________________________________ 5

Introduction _____________________________________________________________________________________________ 6

Outline _________________________________________________________________________________________________ 9

Concepts of Cluster Analysis______________________________________________________________________________ 10

Cluster Analysis with SPSS: A detailed example ______________________________________________________________ 24

Slide 3

Table of contents

Aims ___________________________________________________________________________________________________ 5

Aims of the lecture .................................................................................................................................................................................................5

Introduction _____________________________________________________________________________________________ 6

Example .................................................................................................................................................................................................................6

Outline _________________________________________________________________________________________________ 9

Concepts of Cluster Analysis______________________________________________________________________________ 10

Key steps in using a cluster analysis ....................................................................................................................................................................10

How to measure proximity....................................................................................................................................................................................11

Proximity measure with interval variables.............................................................................................................................................................13

Proximity measure with binary variables...............................................................................................................................................................15

How to form Clusters............................................................................................................................................................................................18

How to define similarity? ............................................................................................................................................................................................................18

Cluster formation tree (rules for cluster formation) ....................................................................................................................................................................20

Pros and cons ............................................................................................................................................................................................................................21

Example of hierarchical method: Single linkage (nearest neighbor) .........................................................................................................................................22

Example of hierarchical method: Complete linkage (furthest neighbor) ....................................................................................................................................23

Slide 4

Cluster Analysis with SPSS: A detailed example ______________________________________________________________ 24

Marketing research: Customer survey on brand awareness.................................................................................................................................24

SPSS Elements: <Analyze><Classify><Hierarchical ...........................................................................................................................................25

First step: Measure of distance or similarity between objects ...............................................................................................................................27

Output ........................................................................................................................................................................................................................................27

Second step: Formation of clusters ......................................................................................................................................................................28

Between-groups linkage ............................................................................................................................................................................................................28

Dendrogram ...............................................................................................................................................................................................................................29

Third step: Determining the number of clusters ....................................................................................................................................................31

Fourth step: Display and save cluster membership ..............................................................................................................................................33

Output table of cluster membership...........................................................................................................................................................................................33

Saving the cluster membership..................................................................................................................................................................................................34

Scatter plot: <Graphs><Chart Builder…>..................................................................................................................................................................................35

Fifth step: Interpretation of clusters ......................................................................................................................................................................37

Taking into account means ........................................................................................................................................................................................................37

Example of Lecture 01: Marketing survey on consumer buying behavior.................................................................................................................................37

Slide 5

Aims

Aims of the lecture

You know different types of measures of distance / similarity

You know the key steps in conducting a cluster analysis.

You can conduct a cluster analysis with SPSS

(Hierarchical agglomerative methods: Between-groups linkage and Ward)

In particular, you know how to …

◦ choose the appropriate measure of distance / similarity

◦ interpret the agglomeration schedule

◦ use the dendrogram to determine the number of clusters

◦ interpret the meaning of a cluster

Slide 6

Introduction

Example

Marketing research: Customer survey on brand awareness ("Markenbewusstsein")

Bra

nd a

ware

ness [

Index]

Yearly income [Index]

Survey features

Sample of n = 150 customers

Brand awareness index consist of 3 items:

◦ How likely is it that you will use the brand again in the future?

◦ How likely would you be to recommend the brand to your friends?

◦ Overall, how satisfied are you with the brand?

Also included in the dataset:

◦ yearly income

Slide 7

Question

Is there a linear relation between brand awareness and yearly income?

Hypothesis: The higher a person's income, the higher his/her brand awareness.

Conduct regression analysis with SPSS

Bra

nd a

ware

ness [

Index]

Output (summarized)

Overall model test (F-test)

Significance p = .014

Test of coefficients

Constant p = .000

Income p = .014

Coefficient of determination

R Square = .040

It is a really poor model

It seems to have structure in the

brand awareness dataset.


Slide 8

Question

Is there structure in the brand awareness dataset?

Are there clusters for the combination of yearly income and brand awareness?

Conduct cluster analysis

Bra

nd a

ware

ness [

Index]


Output

SPSS identified 3 distinct clusters

Interpretation

People with low income are least aware

because they lack money.

People with middle income have the

highest brand awareness because of

the dream of being richer.

People with high income are moderately

brand aware because they have a

certain status but don't need to show off.

Slide 9

Outline

Cluster analysis is a multivariate procedure for detecting natural groupings in data.

The grouping is based on the scores of several measures (e.g. income and awareness).

Bra

nd a

ware

ness [

Index]


Goals in conducting cluster analysis

Elements within a group should be as

similar as possible

<=> distance d should be small

Similarities between the groups should be

minimal

<=> distance D should be large

Features

Because all information is used for

grouping, cluster analysis is more

objective than just a subjective impression.

There is no optical illusion.

D

d

Slide 10

Concepts of Cluster Analysis

Key steps in using a cluster analysis

1. Measure of distance or similarity between objects (also called proximity measure)

◦ Depends on type of data: interval, counts, binary

◦ Distance: geometrical measure. Similarity: content-related measure

2. Formation of clusters

◦ Calculation of proximity matrix

◦ Many different procedures: Hierarchical / non-hierarchical, agglomerative / divisive etc.

3. Tools / criteria for determining the number of clusters

◦ Tools: Agglomeration schedule, structural chart, dendrogram, icicle plot ("Eiszapfen-Plot")

◦ Criteria (not available in SPSS): F-value, information criterion, etc.

4. Display and save cluster membership

◦ Done by SPSS

5. Interpretation of clusters

◦ Taking into account means (possibly variances) of cluster members

Slide 11

How to measure proximity

From dataset ...

Variable 1 Variable 2 Variable 3 : Variable j

Object 1

Object 2

Object 3

:

Object k

... to proximity matrix (done by SPSS internally)

Object 1 Object 2 Object 3 : Object k

Object 1

Object 2

Object 3

:

Object k

raw data

distance or similarity

Slide 12

Different proximity measures, depending on type of data

Measure allows specifying the distance (d) or similarity (s) to be used in clustering.

Interval (e.g. brand awareness, yearly income)

◦ Euclidean distance (d)

◦ City block distance (d)

◦ Pearson correlation (d) :

Counts (e.g. number of clients)

◦ Chi-square measure (s)

◦ Phi-square measure (s) :

Binary (e.g. yes/no, female/male)

◦ Euclidean distance (d)

◦ Russel and Rao (s)

◦ Simple matching (s)

◦ Dice (s)

(only a selection of 27!)

Slide 13

Proximity measure with interval variables

Example: Brand awareness

Theorem of Pythagoras about right triangle

cba cba 22222 =+=>=+

Distance between "pers_001" and "pers_002"

[ ][ ]

407.1

488.1490.0

73.195.297.067.1d

2/1

2/122

002,001

=

+=

−+−=

Coordinates {x-axis, y-axis}

a

bc

0.97 1.67

2.95

1.73

1.407

a

bc

0.97 1.67

2.95

1.73

1.407

{1.67, 1.73}

{0.97, 2.95}

Slide 14

Generalized equation

Minkowski distance (Hermann Minkowski, 1864 - 1909, German physicist) r/1

J

1j

r

ljkjl,k xxd

−= ∑

=

r = Minkowski's constant

dk,l = Distance between objects k and l (e.g. distance between persons 001 and 002)

J = Number of cluster variables (e.g. variables income and awareness)

xkj, xlj = Values of variable j of objects k and l (e.g. income of persons 001 and 002)

Values of Minkowski's constant

◦ r = 1: City block distance (also called L1-norm)

◦ r = 2: Euclidean distance (also called L2-norm)

City block distance

Manhattan distance

Taxi distance

L2

L1

L2

L1

Slide 15

Proximity measure with binary variables

Example: Car configuration

Identification of similarities between two objects by means of comparison

ABS Airbag ESP Navi Metallic

Mercedes 0 1 1 1 0

BMW 0 1 1 0 1

Case D A A C B

0 = feature not present 1 = feature present

Configuration

4 Cases

A = Feature exists in both comparison objects

B, C = Feature exists in one comparison object

D = Feature exists in none of the comparison objects

Non-existence is also an important similarity in proximity definition

Slide 16

Binary proximity measures

Proximity measure between two objects i and j depends on whether and how

the cases are included and how they are weighted (weights α, δi und λ).

General case: Simple Matching Coefficient*

ij

a dS

a (b c) d1

2

α ⋅ + δ ⋅=

α ⋅ + λ + + δ ⋅

Variants Description Definition

Russel und Rao Case d reduces proximity ij

aS

a b c d=

+ + +

Simple matching Case d raises proximity ij

a dS

a b c d

+=

+ + +

Dice Case d is not taken into account

Similar features are weighted more ij

2aS

2a b c=

+ +

*Sokal, R.R. and Michener, C.D., Statistical method for evaluating systematic relationships, *University of Kansas science bulletin, 38:1409--1438, 1958.

a = Number of cases of case "A" b = Number of cases of case "B" :

Slide 17

Example: Car configuration

ABS Airbag ESP Navi Metallic

Mercedes 0 1 1 1 0

BMW 0 1 1 0 1

Case D A A C B

0 = feature not present 1 = feature present

Configuration

Measure Proximity

Russel and Rao ij

2 2S 0.4

2 1 1 1 5= = =

+ + +

Simple matching ij

2 1 3S 0.6

2 1 1 1 5

+= = =

+ + +

Dice ij

2 2 4S 0.67

2 2 1 1 6

⋅= = =

⋅ + +

Some remarks

Sij varies between 0 and 1

There is no "right" proximity measure

Important question/decision:

Is non-existence important?

(<=> taking case d into account?)

Count of cases

a = 2

b = 1

c = 1

d = 1

Slide 18

How to form Clusters

Cluster A Cluster B

1.

2.

3.

How to define similarity?

Similarity between cluster A and cluster B is measured by …

1. Nearest neighbor (also called single linkage in the cluster formation tree on slide 20)

... the minimum of all possible distances between the cases in cluster A and the cases in B.

2. Centroid clustering (also called other linkage)

... the distance between the centroids of cluster A and of cluster B.

3. Furthest neighbor (also called complete linkage)

... the maximum of all possible distances between the cases in cluster A and the cases in B.

Slide 19

Similarity between cluster A and cluster B is measured by …

Between-groups linkage (also called average linkage)

... the average of all the possible distances between the cases in cluster A and the cases in B.

Within-groups linkage (also called other linkage)

... the average of all the possible distances between the cases within a single new cluster

determined by combining cluster A and cluster B.

Median clustering (also called other linkage)

... the distance between the SPSS determined median for the cases in cluster A and the median

for the cases in cluster B.

Special case, taking into account sum of squares

Ward’s method

For a cluster the sum of squares is the sum of squared distances of each case from the centroid.

d1

d 2

Sum of squared distances

∑=

=++k

1i

2

i

2

2

2

1 d ...dd

Slide 20

Cluster formation tree (rules for cluster formation)

There are several types of clustering procedures:

DivisiveAgglomerative

Variance

methods

Linkage

methods

Ward’s procedure

Singlelinkage

Clusteralgorithms

Hierarchical

Completelinkage

Averagelinkage

k-Meansprocedure

Non-hierarchical

Other linkage

Non-hierarchical clustering is also called k-means clustering.

Average linkage between groups is the default in SPSS ("Between-groups linkage")

used in this course

Slide 21

Pros and cons

Hierarchical clustering

◦ No a priori decision about the number of clusters

◦ Can be very slow

Non-hierarchical clustering

◦ Need to specify the number of clusters (can be an arbitrary number)

◦ Faster, more reliable

Features

Procedure Proximity measure Remark

Single linkage distance or similarity tendency to form chains

Complete linkage distance or similarity tendency to smaller groups of same size

Average linkage distance or similarity "between" single and complete linkage

Other linkage only distance No remark

Ward's method only distance tendency to groups of same size

Slide 22

Example of hierarchical method: Single linkage (nearest neighbor)

◦ Tendency to form chains

◦ Suitable for the detection of outliers

◦ Close groups are badly separated

Cognitive Psychology Unit at Saarland University (www.uni-saarland.de) (Date of access: March, 2011)

nearest neighbor

Step k Step k + 1

"chain"

Slide 23

Example of hierarchical method: Complete linkage (furthest neighbor)

◦ Tendency to form smaller groups with same size

◦ Not suitable for detecting outliers

Cognitive Psychology Unit at Saarland University (www.uni-saarland.de) (Date of access: March, 2011)

furthest neighbor

Step k Step k + 1

Slide 24

Cluster Analysis with SPSS: A detailed example

Marketing research: Customer survey on brand awareness

Data

Random sub-sample of n = 15

(Why this small sub-sample?

Just to keep track of what SPSS does.)

Data set: cluster_small.sav

Syntax: cluster_small.sps

Bra

nd a

ware

ness [

Index]


Slide 25

SPSS Elements: <Analyze><Classify><Hierarchical ..

Slide 26

Syntax

CLUSTER income awareness Variables included

/METHOD BAVERAGE Cluster method "Between-groups linkage"

/MEASURE= EUCLID Proximity measure "Euclidian distance"

/ID=person Labels cases in plots and tables

/PRINT SCHEDULE Schedule of cluster algorithm

/PRINT DISTANCE Matrix with distances ("Proximity Matrix")

/PLOT DENDROGRAM VICICLE. Tools for determining the number of clusters

Cluster method "Between-groups linkage" (which is default)

<=> A better choice would be Ward's method. Here "Between-groups linkage" was used to show in detail how SPSS runs a cluster analysis.

Proximity measure "Euclidian distance" (used to show how SPSS performs a cluster analysis)

<=> The squared Euclidean measure (which is default) should be used when the BAVERAGE, CENTROID, MEDIAN, or WARD cluster method is requested.

Slide 27

First step: Measure of distance or similarity between objects

Output

Proximity Matrix (Distances or similarities between items)

Values represent Euclidian distances

Example:

Distance between cases 9 and 7

:

:

Slide 28

Second step: Formation of clusters

Between-groups linkage

Stage 1: Cases 7 and 9 have smallest distance ("Coefficients" = .203) => first cluster {7,9}

First cluster {7,9} will be clustered with case 10 in stage 5 => cluster {7,9,10}

Stage 2: Cases 13 and 14 have second smallest distance => second cluster {13,14}

Second cluster {13,14} will be clustered with case 11 in stage 3 => cluster {11,13,14}

:

Agglomeration schedule: Displays

the clusters combined at each stage.

Slide 29

Dendrogram

Stage 1

Stage 5

Stage 2

Stage 3

Slide 30

Icicle plot

14 clusters: Cases 7 and 9 in one cluster, all others each in their own clusters.

13 clusters: 7 and 9 in one cluster, 13 and 14 in one cluster, all others each in their clusters.

12 clusters: 7 and 9 in one cluster, 11, 13 and 14 in one cluster, all others each in their clusters.

:

The figure is called an

icicle plot because the

columns look like icicles

hanging from above.

The plot shows how cases

are merged into clusters.

Read it from bottom to top

Slide 31

Third step: Determining the number of clusters

0) Theoretical and empirical reasons (But, be careful about optical illusion!)

In the case of brand awareness there are some indications for three clusters.

A) Elbow criterion in the structure chart (can't be done with SPSS, but with Excel)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Please note: Mostly there is large effect from cluster 1 to cluster 2 which is not the "elbow".

Pro

xim

ity (

"Coeff

icie

nts

")

Number of clusters (=sample size - "Stage")

elbow => choose 3 clusters

Slide 32

B) Dendrogram

Choose the number of clusters within the largest increase in heterogeneity

Standardized distance

Largest increase in heterogeneity

Slide 33

Fourth step: Display and save cluster membership

Output table of cluster membership

Example of brand awareness: assumed 3 clusters

If you're not sure

about the number of

clusters, choose a full

range

Slide 34

Saving the cluster membership

Used for drawing a scatter plot, for example.

Range of solutions: 2 to 5

Example of brand awareness: assumed 3 clusters

Slide 35

Scatter plot: <Graphs><Chart Builder…>

Slide 36

One point was assigned incorrectly

Slide 37

Fifth step: Interpretation of clusters

In the case of the brand awareness example, the interpretation is obvious and straightforward.

Taking into account means

The means of the clusters with respect to the original variables

indicate how the clusters can be interpreted.

Example of Lecture 01: Marketing survey on consumer buying behavior

Questionnaire to ask people about their attitudes.

Among other questions:

"What is your general attitude to life?" (variable x1)

"What is your attitude to innovation?" (variable x2)

"What is your willingness to take risks?" (variable x3)

Scales of variables vary

from extremely negative (1)

to extremely positive (7)

general

attitude to life

attitude to

innovation

willingness

to take risks

Person A 1 2 2

Person B 1 3 3

Person C 2 4 2

Person D 5 4 3

Person E 5 4 4

Person F 7 6 7

Ob

jec

ts

Attributes

Data of 6 people

Slide 38

general

attitude to life

attitude to

innovation

willingness

to take risks

(A, B, C) 1.3 3 2.3

(D, E) 5 4 3.5

(F) 7 6 7Clu

ste

r

Attributes

Mean of clusters, regarding the cluster variables

Cluster 1 (A, B, C): pessimistic people who live in fear

Cluster 2 (D, E): slightly optimistic normalos

Cluster 3 (F): life-affirming adventurer