Hierarchical clustering
Rebecca C. Steorts, Duke University
STA 325, Chapter 10 ISL
1 / 63
Agenda
I K-means versus Hierarchical clusteringI Agglomerative vs divisive clusteringI Dendogram (tree)I Hierarchical clustering algorithmI Single, Complete, and Average linkageI Application to genomic (PCA versus Hierarchical clustering)
2 / 63
From K-means to Hierarchical clusteringRecall two properties of K-means clustering:
1. It fits exactly K clusters (as specified)2. Final clustering assignment depends on the chosen initial
cluster centers
I Assume pairwise dissimilarites dij between data points.I Hierarchical clustering produces a consistent result, without the
need to choose initial starting positions (number of clusters).
Catch: choose a way to measure the dissimilarity between groups,called the linkage
I Given the linkage, hierarchical clustering produces a sequenceof clustering assignments.
I At one end, all points are in their own cluster, at the other end,all points are in one cluster
3 / 63
Agglomerative vs divisive clustering
Agglomerative (i.e., bottom-up):
I Start with all points in their own groupI Until there is only one cluster, repeatedly: merge the two
groups that have the smallest dissimilarity
Divisive (i.e., top-down):
I Start with all points in one clusterI Until all points are in their own cluster, repeatedly: split the
group into two resulting in the biggest dissimilarity
Agglomerative strategies are simpler, we’ll focus on them. Divisivemethods are still important, but you can read about these on yourown if you want to learn more.
4 / 63
Simple example
Given these data points, an agglomerative algorithm might decideon a clustering sequence as follows:
●
●●
●
●●
●
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.2
0.4
0.6
0.8
Dimension 1
Dim
ensi
on 2
1
23
4
56
7
Step 1: {1}, {2}, {3}, {4}, {5}, {6}, {7};Step 2: {1}, {2, 3}, {4}, {5}, {6}, {7};Step 3: {1, 7}, {2, 3}, {4}, {5}, {6};Step 4: {1, 7}, {2, 3}, {4, 5}, {6};Step 5: {1, 7}, {2, 3, 6}, {4, 5};Step 6: {1, 7}, {2, 3, 4, 5, 6};Step 7: {1, 2, 3, 4, 5, 6, 7}.
5 / 63
Algorithm
1. Place each data point into its own singleton group.2. Repeat: iteratively merge the two closest groups3. Until: all the data are merged into a single cluster
6 / 63
Example
7 / 63
Iteration 2
8 / 63
Iteration 3
9 / 63
Iteration 4
10 / 63
Iteration 5
11 / 63
Iteration 6
12 / 63
Iteration 7
13 / 63
Iteration 8
14 / 63
Iteration 9
15 / 63
Iteration 10
16 / 63
Iteration 11
17 / 63
Iteration 12
18 / 63
Iteration 13
19 / 63
Iteration 14
20 / 63
Iteration 15
21 / 63
Iteration 16
22 / 63
Iteration 17
23 / 63
Iteration 18
24 / 63
Iteration 19
25 / 63
Iteration 20
26 / 63
Iteration 21
27 / 63
Iteration 22
28 / 63
Iteration 23
29 / 63
Iteration 24
30 / 63
Clustering
Suppose you are using the above algorithm to cluster the datapoints in groups.
I How do you know when to stop?I How should we compare the data points?
Let’s investigate this further!
31 / 63
Agglomerative clustering
I Each level of the resulting tree is a segmentation of the dataI The algorithm results in a sequence of groupingsI It is up to the user to choose a “natural” clustering from this
sequence
32 / 63
DendogramWe can also represent the sequence of clustering assignments as adendrogram:
●
●
●
●
●●
●
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.2
0.4
0.6
0.8
Dimension 1
Dim
ensi
on 2
1
23
4
56
7
1 7 4 5 6 2 3
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Hei
ght
Note that cutting the dendrogram horizontally partitions the datapoints into clusters
33 / 63
Dendogram
I Agglomerative clustering is monotonicI The similarity between merged clusters is monotone decreasing
with the level of the merge.I Dendrogram: Plot each merge at the (negative) similarity
between the two merged groupsI Provides an interpretable visualization of the algorithm and
dataI Useful summarization tool, part of why hierarchical clustering is
popular
34 / 63
Group similarity
Given a distance similarity measure (say, Eucliclean) between points,the user has many choices on how to define intergroup similarity.
1. Single linkage: the similiarity of the closest pair
dSL(G , H) = mini∈G,j∈H
di ,j
2. Complete linkage: the similarity of the furthest pair
dCL(G , H) = maxi∈G,j∈H
di ,j
3. Group-average: the average similarity between groups
dGA = 1NGNH
∑i∈G
∑j∈H
di ,j
35 / 63
Single LinkageIn single linkage (i.e., nearest-neighbor linkage), the dissimilaritybetween G , H is the smallest dissimilarity between two points inopposite groups:
dsingle(G , H) = mini∈G, j∈H
dij
Example (dissimilarities dij aredistances, groups are markedby colors): single linkage scoredsingle(G , H) is the distance ofthe closest pair
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
−2 −1 0 1 2
−2
−1
01
2
36 / 63
Single Linkage ExampleHere n = 60, Xi ∈ R2, dij = ‖Xi − Xj‖2. Cutting the tree ath = 0.9 gives the clustering assignments marked by colors
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
−2 −1 0 1 2 3
−2
−1
01
23
0.0
0.2
0.4
0.6
0.8
1.0
Hei
ght
Cut interpretation: for each point Xi , there is another point Xj in itscluster with dij ≤ 0.9
37 / 63
Complete LinkageIn complete linkage (i.e., furthest-neighbor linkage), dissimilaritybetween G , H is the largest dissimilarity between two points inopposite groups:
dcomplete(G , H) = maxi∈G, j∈H
dij
Example (dissimilarities dij aredistances, groups are markedby colors): complete linkagescore dcomplete(G , H) is the dis-tance of the furthest pair
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
−2 −1 0 1 2
−2
−1
01
2
38 / 63
Complete Linkage ExampleSame data as before. Cutting the tree at h = 5 gives the clusteringassignments marked by colors
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
−2 −1 0 1 2 3
−2
−1
01
23
01
23
45
6
Hei
ght
Cut interpretation: for each point Xi , every other point Xj in itscluster satisfies dij ≤ 5
39 / 63
Average LinkageIn average linkage, the dissimilarity between G , H is the averagedissimilarity over all points in opposite groups:
daverage(G , H) = 1nG · nH
∑i∈G, j∈H
dij
Example (dissimilarities dij aredistances, groups are markedby colors): average linkagescore daverage(G , H) is the av-erage distance across all pairs
(Plot here only shows distancesbetween the blue points andone red point)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
−2 −1 0 1 2
−2
−1
01
2
40 / 63
Average linkage example
Same data as before. Cutting the tree at h = 2.5 gives clusteringassignments marked by the colors
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
−2 −1 0 1 2 3
−2
−1
01
23
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Hei
ght
Cut interpretation: there really isn’t a good one!
41 / 63
Properties of intergroup similarity
I Single linkage can produce “chaining,” where a sequence ofclose observations in different groups cause early merges ofthose groups
I Complete linkage has the opposite problem. It might not mergeclose groups because of outlier members that are far apart.
I Group average represents a natural compromise, but dependson the scale of the similarities. Applying a monotonetransformation to the similarities can change the results.
42 / 63
Things to consider
I Hierarchical clustering should be treated with caution.I Different decisions about group similarities can lead to vastly
different dendrograms.I The algorithm imposes a hierarchical structure on the data,
even data for which such structure is not appropriate.
43 / 63
Application on genomic data
I Unsupervised methods are often used in the analysis ofgenomic data.
I PCA and hierarchical clustering are very common tools. Wewill explore both on a genomic data set.
I We illustrate these methods on the NCI60 cancer cell linemicroarray data, which consists of 6,830 gene expressionmeasurements on 64 cancer cell lines.
44 / 63
Application on genomic data
library(ISLR)nci.labs <- NCI60$labsnci.data <- NCI60$data
I Each cell line is labeled with a cancer type.I We do not make use of the cancer types in performing PCA
and clustering, as these are unsupervised techniques.I After performing PCA and clustering, we will check to see the
extent to which these cancer types agree with the results ofthese unsupervised techniques.
45 / 63
Exploring the data
dim(nci.data)
## [1] 64 6830
# cancer types for the cell linesnci.labs[1:4]
## [1] "CNS" "CNS" "CNS" "RENAL"
table(nci.labs)
## nci.labs## BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA## 7 5 7 1 1 6## MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE## 1 1 8 9 6 2## RENAL UNKNOWN## 9 1
46 / 63
PCA
pr.out <- prcomp(nci.data, scale=TRUE)
47 / 63
PCA
We now plot the first few principal component score vectors, inorder to visualize the data.
First, we create a simple function that assigns a distinct color toeach element of a numeric vector. The function will be used toassign a color to each of the 64 cell lines, based on the cancer typeto which it corresponds.
48 / 63
Simple color function
#Input: positive integer, vector#Output: vector containing that
#number of distinct colorsCols=function(vec){
cols<-rainbow(length(unique(vec)))return(cols[as.numeric(as.factor(vec))])}
49 / 63
PCA
−40 −20 0 20 40 60
−60
−40
−20
020
Z1
Z2
−40 −20 0 20 40 60−
40−
200
2040
Z1
Z3
On the whole, cell lines corresponding to a single cancer type dotend to have similar values on the first few PC score vectors. Thisindicates that cell lines from the same cancer type tend to havepretty similar gene expression levels. 50 / 63
Proportion of Variance Explainedsummary(pr.out)
## Importance of components%s:## PC1 PC2 PC3 PC4 PC5## Standard deviation 27.8535 21.48136 19.82046 17.03256 15.97181## Proportion of Variance 0.1136 0.06756 0.05752 0.04248 0.03735## Cumulative Proportion 0.1136 0.18115 0.23867 0.28115 0.31850## PC6 PC7 PC8 PC9 PC10## Standard deviation 15.72108 14.47145 13.54427 13.14400 12.73860## Proportion of Variance 0.03619 0.03066 0.02686 0.02529 0.02376## Cumulative Proportion 0.35468 0.38534 0.41220 0.43750 0.46126## PC11 PC12 PC13 PC14 PC15## Standard deviation 12.68672 12.15769 11.83019 11.62554 11.43779## Proportion of Variance 0.02357 0.02164 0.02049 0.01979 0.01915## Cumulative Proportion 0.48482 0.50646 0.52695 0.54674 0.56590## PC16 PC17 PC18 PC19 PC20## Standard deviation 11.00051 10.65666 10.48880 10.43518 10.3219## Proportion of Variance 0.01772 0.01663 0.01611 0.01594 0.0156## Cumulative Proportion 0.58361 0.60024 0.61635 0.63229 0.6479## PC21 PC22 PC23 PC24 PC25 PC26## Standard deviation 10.14608 10.0544 9.90265 9.64766 9.50764 9.33253## Proportion of Variance 0.01507 0.0148 0.01436 0.01363 0.01324 0.01275## Cumulative Proportion 0.66296 0.6778 0.69212 0.70575 0.71899 0.73174## PC27 PC28 PC29 PC30 PC31 PC32## Standard deviation 9.27320 9.0900 8.98117 8.75003 8.59962 8.44738## Proportion of Variance 0.01259 0.0121 0.01181 0.01121 0.01083 0.01045## Cumulative Proportion 0.74433 0.7564 0.76824 0.77945 0.79027 0.80072## PC33 PC34 PC35 PC36 PC37 PC38## Standard deviation 8.37305 8.21579 8.15731 7.97465 7.90446 7.82127## Proportion of Variance 0.01026 0.00988 0.00974 0.00931 0.00915 0.00896## Cumulative Proportion 0.81099 0.82087 0.83061 0.83992 0.84907 0.85803## PC39 PC40 PC41 PC42 PC43 PC44## Standard deviation 7.72156 7.58603 7.45619 7.3444 7.10449 7.0131## Proportion of Variance 0.00873 0.00843 0.00814 0.0079 0.00739 0.0072## Cumulative Proportion 0.86676 0.87518 0.88332 0.8912 0.89861 0.9058## PC45 PC46 PC47 PC48 PC49 PC50## Standard deviation 6.95839 6.8663 6.80744 6.64763 6.61607 6.40793## Proportion of Variance 0.00709 0.0069 0.00678 0.00647 0.00641 0.00601## Cumulative Proportion 0.91290 0.9198 0.92659 0.93306 0.93947 0.94548## PC51 PC52 PC53 PC54 PC55 PC56## Standard deviation 6.21984 6.20326 6.06706 5.91805 5.91233 5.73539## Proportion of Variance 0.00566 0.00563 0.00539 0.00513 0.00512 0.00482## Cumulative Proportion 0.95114 0.95678 0.96216 0.96729 0.97241 0.97723## PC57 PC58 PC59 PC60 PC61 PC62## Standard deviation 5.47261 5.2921 5.02117 4.68398 4.17567 4.08212## Proportion of Variance 0.00438 0.0041 0.00369 0.00321 0.00255 0.00244## Cumulative Proportion 0.98161 0.9857 0.98940 0.99262 0.99517 0.99761## PC63 PC64## Standard deviation 4.04124 2.148e-14## Proportion of Variance 0.00239 0.000e+00## Cumulative Proportion 1.00000 1.000e+00
51 / 63
Proportion of Variance Explainedplot(pr.out)
pr.out
Var
ianc
es
020
040
060
0
52 / 63
PCA
0 10 20 30 40 50 60
02
46
810
Principal Component
PV
E
0 10 20 30 40 50 6020
4060
8010
0
Principal Component
Cum
ulat
ive
PV
E
Theproportion of variance explained (PVE) of the principal components of the NCI60 cancer cell line microarray dataset. Left: the PVE of each principal component is shown. Right: the cumulative PVE of the principal componentsis shown. Together, all principal components explain 100 % of the variance.
53 / 63
Conclusions from the Scree Plot
I We see that together, the first seven principal componentsexplain around 40% of the variance in the data.
I This is not a huge amount of the variance.I Looking at the scree plot, we see that while each of the first
seven principal components explain a substantial amount ofvariance, there is a marked decrease in the variance explainedby further principal components.
I That is, there is an elbow in the plot after approximately theseventh principal component. This suggests that there may belittle benefit to examining more than seven or so principalcomponents (though even examining seven principalcomponents may be difficult).
54 / 63
Hierarchical Clustering to NCI60 Data
I We now proceed to hierarchically cluster the cell lines in theNCI60 data, with the goal of finding out whether or not theobservations cluster into distinct types of cancer.
I To begin, we standardize the variables to have mean zero andstandard deviation one.
I As mentioned earlier, this step is optional and should beperformed only if we want each gene to be on the same scale.
sd.data=scale(nci.data)
55 / 63
Hierarchical Clustering to NCI60 Data
BR
EA
ST
BR
EA
ST
CN
SC
NS R
EN
AL
BR
EA
ST
NS
CLC
RE
NA
LM
ELA
NO
MA
OV
AR
IAN
OV
AR
IAN
NS
CLC
OV
AR
IAN
CO
LON
CO
LON
OV
AR
IAN
PR
OS
TAT
EN
SC
LCN
SC
LCN
SC
LCP
RO
STA
TE
NS
CLC
ME
LAN
OM
AR
EN
AL
RE
NA
LR
EN
AL O
VA
RIA
NU
NK
NO
WN
OV
AR
IAN
NS
CLC
CN
SC
NS
CN
SN
SC
LCR
EN
AL
RE
NA
LR
EN
AL
RE
NA
LN
SC
LCM
ELA
NO
MA
ME
LAN
OM
AM
ELA
NO
MA
ME
LAN
OM
AM
ELA
NO
MA
ME
LAN
OM
AB
RE
AS
TB
RE
AS
TC
OLO
NC
OLO
NC
OLO
N CO
LON
CO
LON
BR
EA
ST
MC
F7A
−re
pro
BR
EA
ST
MC
F7D
−re
pro
LEU
KE
MIA
LEU
KE
MIA
LEU
KE
MIA
LEU
KE
MIA
K56
2B−
repr
oK
562A
−re
pro
LEU
KE
MIA
LEU
KE
MIA
4060
8010
012
014
016
0
CompleteLinkage
56 / 63
Hierarchical Clustering to NCI60 Data
LEU
KE
MIA
LEU
KE
MIA
LEU
KE
MIA
LEU
KE
MIA
LEU
KE
MIA
LEU
KE
MIA
K56
2B−
repr
oK
562A
−re
pro
RE
NA
LN
SC
LCB
RE
AS
TN
SC
LCB
RE
AS
TM
CF
7A−
repr
oB
RE
AS
TM
CF
7D−
repr
oC
OLO
NC
OLO
NC
OLO
NR
EN
AL
ME
LAN
OM
AM
ELA
NO
MA
BR
EA
ST
BR
EA
ST
ME
LAN
OM
AM
ELA
NO
MA
ME
LAN
OM
AM
ELA
NO
MA
ME
LAN
OM
AO
VA
RIA
NO
VA
RIA
NN
SC
LCO
VA
RIA
NU
NK
NO
WN
OV
AR
IAN
NS
CLC
ME
LAN
OM
AC
NS
CN
SC
NS RE
NA
LR
EN
AL
RE
NA
LR
EN
AL
RE
NA
LR
EN
AL
RE
NA
LP
RO
STA
TE
NS
CLC
NS
CLC
NS
CLC
NS
CLC
OV
AR
IAN
PR
OS
TAT
EN
SC
LCC
OLO
NC
OLO
NO
VA
RIA
NC
OLO
NC
OLO
NC
NS
CN
SB
RE
AS
TB
RE
AS
T
4060
8010
012
0
Average Linkage
57 / 63
Hierarchical Clustering to NCI60 Data
LEU
KE
MIA
RE
NA
LB
RE
AS
TLE
UK
EM
IALE
UK
EM
IAC
NS
LEU
KE
MIA
LEU
KE
MIA
K56
2B−
repr
oK
562A
−re
pro
NS
CLC
LEU
KE
MIA
OV
AR
IAN
NS
CLC
CN
SB
RE
AS
TN
SC
LCO
VA
RIA
NC
OLO
NB
RE
AS
TM
ELA
NO
MA
RE
NA
LM
ELA
NO
MA
BR
EA
ST
BR
EA
ST
ME
LAN
OM
AM
ELA
NO
MA
ME
LAN
OM
AM
ELA
NO
MA
ME
LAN
OM
AB
RE
AS
TO
VA
RIA
NC
OLO
NM
CF
7A−
repr
oB
RE
AS
TM
CF
7D−
repr
oU
NK
NO
WN
OV
AR
IAN
NS
CLC
NS
CLC
PR
OS
TAT
EM
ELA
NO
MA
CO
LON
OV
AR
IAN
NS
CLC
RE
NA
LC
OLO
NP
RO
STA
TE
CO
LON
OV
AR
IAN
CO
LON
CO
LON
NS
CLC
NS
CLC
RE
NA
LN
SC
LCR
EN
AL
RE
NA
LR
EN
AL
RE
NA
LR
EN
AL CN
SC
NS
CN
S
4050
6070
8090
100
110
Single Linkage
We see that the choice of linkage certainly does affect the resultsobtained.
58 / 63
Hierarchical Clustering to NCI60 Data
I Typically, single linkage will tend to yield trailing clusters: verylarge clusters onto which individual observations attachone-by-one.
I On the other hand, complete and average linkage tend to yieldmore balanced, attractive clusters.
I For this reason, complete and average linkage are generallypreferred to single linkage.
59 / 63
Complete linkage
I We will use complete linkage hierarchical clustering for theanalysis that follows.
I We can cut the dendrogram at the height that will yield aparticular number of clusters, say four.
hc.out=hclust(dist(sd.data))hc.clusters=cutree(hc.out,4)table(hc.clusters,nci.labs)
## nci.labs## hc.clusters BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA MCF7A-repro## 1 2 3 2 0 0 0 0## 2 3 2 0 0 0 0 0## 3 0 0 0 1 1 6 0## 4 2 0 5 0 0 0 1## nci.labs## hc.clusters MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE RENAL UNKNOWN## 1 0 8 8 6 2 8 1## 2 0 0 1 0 0 1 0## 3 0 0 0 0 0 0 0## 4 1 0 0 0 0 0 0
60 / 63
Complete linkageAll the leukemia cell lines fall in cluster 3, while the breast cancercell lines are spread out over three different clusters. We can plotthe cut on the dendrogram that produces these four clusters.This is the height that results in four clusters. (It is easy to verifythat the resulting clusters are the same as the ones we obtainedusing cutree(hc.out,4))
BR
EA
ST
BR
EA
ST
CN
SC
NS R
EN
AL
BR
EA
ST
NS
CLC
RE
NA
LM
ELA
NO
MA
OV
AR
IAN
OV
AR
IAN
NS
CLC
OV
AR
IAN
CO
LON
CO
LON
OV
AR
IAN
PR
OS
TAT
EN
SC
LCN
SC
LCN
SC
LCP
RO
STA
TE
NS
CLC
ME
LAN
OM
AR
EN
AL
RE
NA
LR
EN
AL O
VA
RIA
NU
NK
NO
WN
OV
AR
IAN
NS
CLC
CN
SC
NS
CN
SN
SC
LCR
EN
AL
RE
NA
LR
EN
AL
RE
NA
LN
SC
LCM
ELA
NO
MA
ME
LAN
OM
AM
ELA
NO
MA
ME
LAN
OM
AM
ELA
NO
MA
ME
LAN
OM
AB
RE
AS
TB
RE
AS
TC
OLO
NC
OLO
NC
OLO
N CO
LON
CO
LON
BR
EA
ST
MC
F7A
−re
pro
BR
EA
ST
MC
F7D
−re
pro
LEU
KE
MIA
LEU
KE
MIA
LEU
KE
MIA
LEU
KE
MIA
K56
2B−
repr
oK
562A
−re
pro
LEU
KE
MIA
LEU
KE
MIA
4060
8010
012
014
016
0
Cluster Dendrogram
hclust (*, "complete")dist(sd.data)
Hei
ght
61 / 63
K-means versus complete linkage?
How do these NCI60 hierarchical clustering results compare to whatwe get if we perform K-means clustering with K = 4?
set.seed (2)km.out <- kmeans(sd.data, 4, nstart=20)km.clusters <- km.out$clustertable(km.clusters, hc.clusters )
## hc.clusters## km.clusters 1 2 3 4## 1 11 0 0 9## 2 0 0 8 0## 3 9 0 0 0## 4 20 7 0 0
62 / 63
K-means versus complete linkage?
I We see that the four clusters obtained using hierarchicalclustering and K-means clustering are somewhat different.
I Cluster 2 in K-means clustering is identical to cluster 3 inhierarchical clustering.
I However, the other clusters differ: for instance, cluster 4 inK-means clustering contains a portion of the observationsassigned to cluster 1 by hierarchical clustering, as well as all ofthe observations assigned to cluster 2 by hierarchical clustering.
63 / 63