4
Stability of Cluster Analysis STATISTIK AUSTRIA Die Infor mationsmanager Matthias Templ & Peter Filzmoser Vienna University of Technology Vienna, June 16, 2006 Stability of Cluster Analysis For real data sets without obvious grouping structure the stability of clusters depends on: 1. Input data - the selection of variables 2. Preparation of the data 3. Distance measure used * 4. Clustering method 5. Number of clusters Changing one parameter may result in complete different cluster results. * if a distance measure must be chosen 1. Input Data - Variable Selection > library(mvoutlier) > library(cluster) > data(humus) > a <- agnes(t(prepare(humus[, -c(1:3)]))) > plot(a, which.plots = 2, col = c(4), col.main = 3, col.sub = 2) 1. Input Data - Variable Selection > library(mvoutlier) > data(humus) > a <- agnes(t(prepare(humus[, -c(1:3)]))) > plot(a, which.plots = 2, col = c(4), col.main = 3, col.sub = 2) Ag Bi Pb Rb Tl Ba Si K Mn Zn Al Be La Y Th U Cr V Fe Sc As Co Cu Ni Mo Cd P B Mg Na pH Ca Sr Hg S N C H LOI Sb Cond 5 10 15 20 25 30 35 Dendrogram of agnes(x = t(prepare(humus[, -c(1:3)]))) Agglomerative Coefficient = 0.47 t(prepare(humus[, -c(1:3)])) Height A chemical process can be seen in more detail in a map (later) by choosing similar variables.

Stability of Cluster Analysis - R: The R Project for Statistical …Filzmoser.pdf · Stability of Cluster Analysis ST A TISTIK AUSTRIA Die Infor mationsmanager Matthias Templ & Peter

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Stability of Cluster Analysis - R: The R Project for Statistical …Filzmoser.pdf · Stability of Cluster Analysis ST A TISTIK AUSTRIA Die Infor mationsmanager Matthias Templ & Peter

Stability of Cluster Analysis

S T A T I S T I K A U S T R I A

D i e I n f o r m a t i o n s m a n a g e r

Matthias Templ & Peter FilzmoserVienna University of Technology

Vienna, June 16, 2006

Stability of Cluster Analysis

For real data sets without obvious grouping structure the stability of clusters depends on:

1. Input data - the selection of variables

2. Preparation of the data

3. Distance measure used ∗

4. Clustering method

5. Number of clusters

Changing one parameter may result in complete different cluster results.

∗if a distance measure must be chosen

1. Input Data - Variable Selection

> library(mvoutlier)

> library(cluster)

> data(humus)

> a <- agnes(t(prepare(humus[, -c(1:3)])))> plot(a, which.plots = 2, col = c(4), col.main = 3, col.sub = 2)

1. Input Data - Variable Selection

> library(mvoutlier)

> data(humus)

> a <- agnes(t(prepare(humus[, -c(1:3)])))> plot(a, which.plots = 2, col = c(4), col.main = 3, col.sub = 2)

Ag

Bi

Pb

Rb Tl

Ba Si

KM

n ZnA

lB

eLa Y

Th UC

r VFe S

cA

sC

oC

u Ni

Mo

Cd

PB

Mg

Na

pHC

a Sr

Hg

S NC H

LOI

Sb

Con

d

510

1520

2530

35

Dendrogram of agnes(x = t(prepare(humus[, −c(1:3)])))

Agglomerative Coefficient = 0.47t(prepare(humus[, −c(1:3)]))

Hei

ght

A chemical process can be seen

in more detail in a map (later) by

choosing similar variables.

Page 2: Stability of Cluster Analysis - R: The R Project for Statistical …Filzmoser.pdf · Stability of Cluster Analysis ST A TISTIK AUSTRIA Die Infor mationsmanager Matthias Templ & Peter

1. Input Data - Variable Selection

A selection of variables may be useful when clustering high-dimensional data because . . .

• clustering with all variables may hide underlying processes

• we want to see some processes in more detail

• the inclusion of one irrelevant variable may hide the real clusters in the data

One easy way (amongst others) for variable selection can be done by graphical inspection

of a dendrogram which results from hierarchical clustering of variables.

2. Data Preparation

Most of real data in practice can have some or all of these properties:

• neither normal nor log-normal

• strongly skewed

• often multi-modal distributions

• dependencies between observations

• weak clustering structures

• data includes outliers

• variables show a striking difference in the amount of variability

2. Data Preparation

If a good clustering structure for a variable exists we expect a distribution with two or more

modes. A transformation (e.g with a box-cox transformation) will preserve the modes

but remove large skewness.

Standardisation of the variables is needed if the variables show a striking difference in

the amount of variability.

Outliers can influence the clustering (depends on which clustering algorithms is chosen)

Removing outliers before clustering may be useful.

Finding outliers is not a trivial task, especially in high dimensions. (you can do this e.g.

with Package mvoutlier from Filzmoser et al. (2005))

3. Distance Measure

pam.euclidean

pam.manhattan

pam.gower

pam.rf

kmeans.euclidean

kmeans.manhattan

kmeans.gower

kmeans.rf

0.2 0.4 0.6 0.8 1.0

Rand Index for 500 bootstrap samples

Rand Index

Comparing clustered data and

clustered subsets of the data with

Rand Index.

Distance measures which results

in high Rand Indices should be

chosen.

Page 3: Stability of Cluster Analysis - R: The R Project for Statistical …Filzmoser.pdf · Stability of Cluster Analysis ST A TISTIK AUSTRIA Die Infor mationsmanager Matthias Templ & Peter

4. Clustering Method

pam.euclidean

pam.manhattan

pam.gower

pam.rf

kmeans.euclidean

kmeans.manhattan

kmeans.gower

kmeans.rf

0.2 0.4 0.6 0.8 1.0

Rand Index for 500 bootstrap samples

Rand Index

Comparing clustered data and

clustered subsets of the data with

Rand Index.

Algorithms which results in high

Rand Indices may be chosen.

5. Number of Clusters

number of clusters

wb.

ratio

2 4 6 8

0.4

0.6

0.8

bclust clara

2 4 6 8

cmeans

kccaKmedians

2 4 6 8

Mclust speccPolydot

euclideangowermanhattannonerf

Data: HumusCoNiCuAsMo

Pollution

5. Number of Clusters

number of clusters

wb.

ratio

2 4 6 8

0.4

0.6

0.8

bclust clara

2 4 6 8

cmeans

kccaKmedians

2 4 6 8

Mclust speccPolydot

euclideangowermanhattannonerf

Data: HumusCoNiCuAsMo

Pollution

−→

Example

obs = 27separation = 2.021

obs = 35separation = 2.672

obs = 126separation = 3.813

obs = 22separation = 2.464

obs = 74separation = 2.555

obs = 34separation = 2.366

obs = 166separation = 2.027

obs = 100separation = 2.118

obs = 33separation = 2.49

−→

Page 4: Stability of Cluster Analysis - R: The R Project for Statistical …Filzmoser.pdf · Stability of Cluster Analysis ST A TISTIK AUSTRIA Die Infor mationsmanager Matthias Templ & Peter

obs = 27separation = 2.021

obs = 35separation = 2.672

obs = 126separation = 3.813

obs = 22separation = 2.464

obs = 74separation = 2.555

obs = 34separation = 2.366

obs = 166separation = 2.027

obs = 100separation = 2.118

obs = 33separation = 2.49

AgAl

As

B

Ba

Be

Bi

Ca

Cd

Co

Cr

Cu

FeHgK

La

Mg

MnMoNa

Ni

P

Pb

Rb

S

Sb

ScSi

Sr

ThTl

U

VYZn

CH

N

LO

pH

Co

1

Ag

Al

As

B

Ba

Be

Bi

Ca

CdCoCr

Cu

Fe

Hg

K

La

Mg

Mn

MoNaNi

P

PbRb

S

Sb

Sc

Si

Sr

Th

Tl

U

V

Y

Zn

CH

N

LOpH

Co

2

AgAlAsB

Ba

BeBiCaCdCoCrCu

FeHgKLaMgMn

Mo

NaNi

PPbRbSSbScSiSrThTlUVYZnCHNLOpHCo

3

Ag

Al

AsBBa

Be

Bi

CaCd

Co

Cr

Cu

Fe

Hg

K

La

MgMn

Mo

Na

Ni

P

PbRb

S

Sb

Sc

SiSr

Th

Tl

UVY

Zn

CH

N

LO

pH

Co

4

Ag

AlAs

B

Ba

Be

Bi

Ca

CdCoCrCuFe

HgK

La

Mg

MnMo

Na

Ni

P

PbRb

S

SbScSi

Sr

Th

Tl

U

V

Y

Zn

CHNLOpH

Co

5

Ag

Al

As

BBaBe

BiCaCd

Co

Cr

Cu

FeHgKLa

MgMn

Mo

Na

Ni

PPbRbSSbScSiSr

ThTlU

V

YZnCHNLO

pH

Co

6

Ag

AlAsBBaBeBiCaCdCo

Cr

Cu

FeHgKLaMgMnMoNa

Ni

PPbRbSSb

ScSiSr

ThTlUVY

ZnCHNLOpHCo

7

Ag

AlAsB

BaBe

Bi

CaCd

Co

Cr

CuFeHgK

LaMg

Mn

MoNaNiP

PbRb

S

SbScSi

Sr

Th

Tl

UV

YZnCH

N

LO

pH

Co

8

AgAl

As

BBaBeBi

Ca

CdCoCr

Cu

Fe

Hg

K

La

Mg

Mn

Mo

Na

Ni

PPb

RbSSb

ScSi

Sr

ThTl

UVY

Zn

CH

N

LO

pH

Co

9

obs = 27separation = 2.021

obs = 35separation = 2.672

obs = 126separation = 3.813

obs = 22separation = 2.464

obs = 74separation = 2.555

obs = 34separation = 2.366

obs = 166separation = 2.027

obs = 100separation = 2.118

obs = 33separation = 2.49

AgAl

As

B

Ba

Be

Bi

Ca

Cd

Co

Cr

Cu

FeHgK

La

Mg

MnMoNa

Ni

P

Pb

Rb

S

Sb

ScSi

Sr

ThTl

U

VYZn

CH

N

LO

pH

Co

1

Ag

Al

As

B

Ba

Be

Bi

Ca

CdCoCr

Cu

Fe

Hg

K

La

Mg

Mn

MoNaNi

P

PbRb

S

Sb

Sc

Si

Sr

Th

Tl

U

V

Y

Zn

CH

N

LOpH

Co

2

AgAlAsB

Ba

BeBiCaCdCoCrCu

FeHgKLaMgMn

Mo

NaNi

PPbRbSSbScSiSrThTlUVYZnCHNLOpHCo

3

Ag

Al

AsBBa

Be

Bi

CaCd

Co

Cr

Cu

Fe

Hg

K

La

MgMn

Mo

Na

Ni

P

PbRb

S

Sb

Sc

SiSr

Th

Tl

UVY

Zn

CH

N

LO

pH

Co

4

Ag

AlAs

B

Ba

Be

Bi

Ca

CdCoCrCuFe

HgK

La

Mg

MnMo

Na

Ni

P

PbRb

S

SbScSi

Sr

Th

Tl

U

V

Y

Zn

CHNLOpH

Co

5

Ag

Al

As

BBaBe

BiCaCd

Co

Cr

Cu

FeHgKLa

MgMn

Mo

Na

Ni

PPbRbSSbScSiSr

ThTlU

V

YZnCHNLO

pH

Co

6

Ag

AlAsBBaBeBiCaCdCo

Cr

Cu

FeHgKLaMgMnMoNa

Ni

PPbRbSSb

ScSiSr

ThTlUVY

ZnCHNLOpHCo

7

Ag

AlAsB

BaBe

Bi

CaCd

Co

Cr

CuFeHgK

LaMg

Mn

MoNaNiP

PbRb

S

SbScSi

Sr

Th

Tl

UV

YZnCH

N

LO

pH

Co

8

AgAl

As

BBaBeBi

Ca

CdCoCr

Cu

Fe

Hg

K

La

Mg

Mn

Mo

Na

Ni

PPb

RbSSb

ScSi

Sr

ThTl

UVY

Zn

CH

N

LO

pH

Co

9

Pollution

Highest pollution visualised by cluster 9

This can be seen in the graphic on the right

e.g. Co, Cu, Ni typical elements for reflecting pollution

Mclust on scaled and transformed humus data

Validity measure on each cluster

cluster size

Visualising all clusters each in an own map

Seaspray

Cluster 5

(Greyscale depends on validity measure in each cluster)

Conclusions

• Applying cluster analysis on real data results in highly non-stable results for many

reasons

• The selection of variables and the selection of the optimal number of clusters on real

data is a non-trivial task.

• Cluster analysis can be seen as explorative data analysis to get ideas about your data

• Interactive tools which allow for various methods are very helpful