Lecture 6
Statistical Lecture
─ Cluster Analysis
Cluster Analysis
• Grouping similar objects to produce a classification
• Useful when the priori the structure of the data is unknown
• Involving the assessment of the relative distances between points
Clustering Algorithms
• Partitioning :
Divide the data set into k clusters where k needs to be specified beforehand, e.g.
k-means.
Clustering Algorithms• Hierarchical :
– Agglomerative methods :
Start with the situation where each object forms its own little cluster, and then successively merges clusters until only one large cluster left
– Divisive methods :
Start by considering the whole data set as one cluster, and then splits up clusters until each object is separate
Caution• Most users are interested in the main structure
of their data, consisting of a few large clusters
• When forming larger clusters, agglomerative methods might makes wrong decisions in the first step. (Once one step is wrong, the whole thing is wrong)
• For divisive methods, the larger clusters are determined first, so they are less likely to suffer from earlier steps
Agglomerative Hierarchical Clustering Procedure
(1) Each observation begins in a cluster by itself
(2) The two closest clusters are merged to from a new cluster that replaces the two old clusters
(3) Repeat (2) until only one cluster is left
The various clustering methods differ in how
the distance between two clusters is computed.
Remarks• For coordinate data, variables with large
variances tend to have more effect on the resulting clusters than those with small variance
• Scaling or transforming the variables might be needed
• Standardization (standardize the variables to mean 0 and standard deviation 1) or principle components is useful but not always appropriate
• Outliers should be removed before analysis
Remarks(cont.)
• Nonlinear transformations of the variables may change the number of population clusters and should therefore be approached with caution
• For most applications, the variables should be transformed so that equal differences are of equal practical importance
• An interval scale of measurement is required if raw data are used as input. Ordinal or ranked coordinate data are generally not appropriate
Notation
n number of observation
v number of variables if data are coordinates
G number of clusters at any given level of the hierarchy
xi ith observation
Ck kth cluster, subset of {1, 2, …, n}
Nk number of observations in Ck
Notation(cont.)
sample mean vector
mean vector for cluster Ck
||x|| Euclidean length of the vector x, that is the square root of the sum of the squares of the elements of x
T
Wk
x
kx
n
ii xx
1
2||||
kCi
ki xx 2||||
Notation(cont.)
PG Wj, where summation is over the G cl
usters at the Gth level of the hierarchy
Bkl Wm – Wk – Wl if Cm=CkCl
d(x, y) any distance or dissimilarity measure between observations or vectors x and y
Dkl any distance or dissimilarity measure b
etween clusters Ck and Cl
Clustering Method ─ Average Linkage
The distance between two clusters is defined by
If d(x, y)=||x – y||2, then
The combinatorial formula is
if Cm=CkCl
k lCi Cj
lkjikl NNxxdD /,
llkklkkl NWNWxxD //|||| 2
mjlljkkjm NDNDND /
Average Linkage
• The distance between clusters is the average distance between pairs of observations, one in each cluster
• It tends to join clusters with small variance and is slightly biased toward producing clusters with the same variance
Centroid Method
The distance between two clusters is defined
by
If d(x, y)=||x – y||2, then the combinatorial
formula is
2|||| lkkl xxD
2// mkllkmjlljkkjm NDNNNDNDND
Centroid Method
• The distance between two clusters is defined as the squared Euclidean distance between their centroids or means
• It is more robust to outliers than most other hierarchical methods but in other respects may not perform as well as Ward’s method or average linkage
Complete Linkage
The distance between two clusters is defined
by
The combinatorial formula is
jiCjCi
kl xxdDlk
,maxmax
jljkjm DDD ,max
Complete Linkage
• The distance between two cluster is the maximum distance between an observation in one cluster and an observation in the other cluster
• It is strongly biased toward producing clusters with roughly equal diameters and can be severely distorted by moderate outliers
Single Linkage
The distance between two clusters is defined
by
The combinatorial formula is
jljkjm DDD ,min
ji
CjCikl xxdD
lk
,minmin
Single Linkage
• The distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster
• It sacrifices performance in the recovery of compact clusters in return for the ability to detect elongated and irregular clusters
Ward’s Minimum-Variance Method
The distance between two clusters is defined
by
If d(x, y)=||x – y||2, then the combinatorial
formula is
lk
lkklkl
NN
xxBD 11
|||| 2
mj
kljjlljjkkjjm NN
DNDNNDNND
Ward’s Minimum-Variance Method
• The distance between two clusters is the ANOVA sum of squares between the two clusters added up over all the variables
• It tends to join clusters with a small number of observation
• It is strongly biased toward producing clusters with roughly the same number of observations
• It is also very sensitive to outliers
Assumptions for WMVM
• Multivariate normal mixture
• Equal spherical covariance matrices
• Equal sampling probabilities
Remarks
• Single linkage tends to lead to the formation of long straggly clusters
• Average, complete linkage and Ward’s method often find spherical clusters even when the data appear to contain clusters of other shapes
McQuitty’s Similarity Analysis
The combinatorial formula is
Median MethodIf d(x, y)=||x – y||2, then the combinatorial
formula is
2/jljkjm DDD
4/2/ kljljkjm DDDD
Kth-nearest Neighbor Method
• Prespecify k
• Let rk(x) be the distance from point x to the kth nearest observation
• Consider a closed sphere centered at x with radius rk(x), say Sk(x)
Kth-nearest Neighbor Method
• The estimated density at x is defined by
• For any two observations xi and xj
xS
xSxf
k
k
of Volume within nsobservatio of #
otherwise
,max, if11
21
,* jkikjijiji
xrxrxxdxfxfxxd
K-Means Algorithm• It is intended for use with large data sets,
from approximately 100 to 100000 observations
• With small data sets, the results may be highly sensitive to the order of the observations in the data set
• It combines an effective method for finding initial clusters with a standard iterative algorithm for minimizing the sum of squared distance from the cluster means
K-Means Algorithm
• Specify the number of clusters, say k• A set of k points called cluster seeds is
selected as a first guess of the means of the k clusters
• Each observation is assigned to the nearest seed to form temporary clusters
• The seeds are then replaced by the means of the temporary clusters
• The process is repeated until no further changes occur in the clusters
Cluster Seeds• Select the first complete (no missing values)
observation as the first seed
• The next complete observation that is separated from the first seed by at least the prespecified distance becomes the second seed
• Later observations are selected as new seeds if they are separated from all previous seeds by at least the radius, as long as the maximum number of seeds is not exceeded
Cluster Seeds
If an observation is complete but fails to
qualify as a new seed, two tests can be made
to see if the observation can replace one of
the old seeds
Cluster Seeds(cont.)
• An old seed is replaced if the distance between the observation and the closest seed is greater than the minimum distance between seeds. The seed that is replaced is selected from the two seeds that are closest to each other. The seed that is replaced is the one of these two with the shortest distance to the closest of the remaining seed when the other seed is replaced by the current observation
Cluster Seeds(cont.)
• If the observation fails the first test for seed replacement, a second test is made. The observation replaces the nearest seed if the smallest distance from the observation to all seeds other than the nearest one is greater than the shortest distance from the nearest seed to all other seeds. If this test is failed, go on to the next observation.
Dissimilarity Matrices
n n dissimilarity matrix
where d(i, j)=d(j, i) measures the “difference”
or dissimilarity between the objects i and j.
01,2,1,
02,31,3
01,2
0
nndndnd
dd
d
Dissimilarity Matrices
d usually satisfies
• d(i, i) = 0
• d(i, j) 0
• d(i, j) = d(j, i)
Dissimilarity
Interval-scaled variables-continuous
measurements on a (roughly) linear scale
(temperature, height, weight, etc.)
•
•
distance) (Euclidean ,1
2
v
fjfif xxjid
distance) (Manhattan ||,1
v
fjfif xxjid
Dissimilarity(cont.)
• The choice of measurement units strongly affects the resulting clustering
• The variable with the large dispersion will have the largest impact on clustering
• If all variables are considered equally important, the data need to be standardized first
Standardization
• Mean absolute deviation (Robust)
• Median absolute deviation (Robust)
• Usual standard deviation
n
ififf
n
iiff
f
fifij mxnsxnms
mxZ
11||
1 and
1 where
n
ififfiff
f
fifij mxnsxms
mxZ
1||
1 and}{median where
n
ififf
n
iiff
f
fifij mxnsxnms
mxZ
1
2
1 11
and1
where
Continuous Ordinal Variables
These are continuous measurements on an unknown scale, or where only the ordering is known but not the actual magnitude.• Replace the xif by their rank rif {1, …, Mf}• Transform the scale to [0,1] as follows :
• Compute the dissimilarities as for interval-scaled variables
11
f
ifif M
rZ
Ratio-Scaled Variables
These are positive continuous measurements on a
nonlinear scale, such as an exponential scale. One
example would be the growth of a bacterial population
(say, with a growth function AeBt).• Simple as interval-scaled variables, though this is not
recommended as it can distort the measurement scale• As continuous ordinal data• By first transforming the data (perhaps by taking loga
rithms), and then treating the results as interval-scaled variables
Discrete Ordinal Variables
A variable of this type has M possible values
(scores) which are ordered.
The dissimilarities are computed in the same
way as for continuous ordinal variables.
Nominal Variables
• Such a variable has M possible values, which are not ordered.
• The dissimilarity between objects i and j is usually defined as
variablesofnumber total and for valuesdifferent taking variables#
,ji
jid
Symmetric Binary Variables
Two possible values, coded 0 and 1, whichare equally important (s.t. a male and female).Consider the contingency table of the objects iand j :
dc
ba
ji
0
1
01\
dcbacb
jid ,
Asymmetric Binary Variables
Two possible values, one of which carries
more importance than the other.
The most meaningful outcome is coded as 1,
and the less meaningful outcome as 0.
Typically, 1 stands for the presence of a
certain attribute (e.g., a particular distance),
and 0 for its absence.
Asymmetric Binary Variables
dc
ba
ji
0
1
01\
cbacb
jid ,
Cluster Analysis of Flying Mileages Between 10 American Cities
0
ATLANTA 587 0
CHICAGO1212 920 0
DENVER 701 940 879 0
HOUSTON1936 1745 831 1374 0
LOS ANGELES 604 1188 1726 968 2339 0
MIAMI 748 713 1631 1420 2451 1092 0
NEW YORK2139 1858 949 1645 347 2594 2571 0
SAN FRANCISCO2182 1737 1021 1891 959 2734 2408 678 0
SEATTLE 543 597 1494 1220 2300 923 205 2442 2329 0
WASHINGTON D.C.
The CLUSTER ProcedureAverage Linkage Cluster Analysis
Cluster History
NCL Clusters Joined FREQ PSF PST2
NormRMSDist
Tie
9 NEW YORK WASHINGTON D.C. 2 66.7 . 0.1297
8 LOS ANGELES SAN FRANCISCO 2 39.2 . 0.2196
7 ATLANTA CHICAGO 2 21.7 . 0.3715
6 CL7 CL9 4 14.5 3.4 0.4149
5 CL8 SEATTLE 3 12.4 7.3 0.5255
4 DENVER HOUSTON 2 13.9 . 0.5562
3 CL6 MIAMI 5 15.5 3.8 0.6185
2 CL3 CL4 7 16.0 5.3 0.8005
1 CL2 CL5 10 . 16.0 1.2967
Root-Mean-Square Distance Between Observations = 1580.242
Average Linkage Cluster Analysis
The CLUSTER ProcedureCentroid Hierarchical Cluster Analysis
Cluster History
NCL Clusters Joined FREQ PSF PST2
NormCentDist
Tie
9 NEW YORK WASHINGTON D.C. 2 66.7 . 0.1297
8 LOS ANGELES SAN FRANCISCO 2 39.2 . 0.2196
7 ATLANTA CHICAGO 2 21.7 . 0.3715
6 CL7 CL9 4 14.5 3.4 0.3652
5 CL8 SEATTLE 3 12.4 7.3 0.5139
4 DENVER CL5 4 12.4 2.1 0.5337
3 CL6 MIAMI 5 14.2 3.8 0.5743
2 CL3 HOUSTON 6 22.1 2.6 0.6091
1 CL2 CL4 10 . 22.1 1.173
Root-Mean-Square Distance Between Observations = 1580.242
Centroid Hierarchical Cluster Analysis
The CLUSTER ProcedureSingle Linkage Cluster Analysis
Cluster History
NCL Clusters Joined FREQ
NormMinDist
Tie
9 NEW YORK WASHINGTON D.C. 2 0.1447
8 LOS ANGELES SAN FRANCISCO 2 0.2449
7 ATLANTA CL9 3 0.3832
6 CL7 CHICAGO 4 0.4142
5 CL6 MIAMI 5 0.4262
4 CL8 SEATTLE 3 0.4784
3 CL5 HOUSTON 6 0.4947
2 DENVER CL4 4 0.5864
1 CL3 CL2 10 0.6203
Mean Distance Between Observations = 1417.133
Single Linkage Cluster Analysis
The CLUSTER ProcedureWard's Minimum Variance Cluster Analysis
Cluster History
NCL Clusters Joined FREQ SPRSQ RSQ PSF PST2
Tie
9 NEW YORKWASHINGTON D.C.
2 0.0019 .998 66.7 .
8LOS ANGELES
SAN FRANCISCO 2 0.0054 .993 39.2 .
7 ATLANTA CHICAGO 2 0.0153 .977 21.7 .
6 CL7 CL9 4 0.0296 .948 14.5 3.4
5 DENVER HOUSTON 2 0.0344 .913 13.2 .
4 CL8 SEATTLE 3 0.0391 .874 13.9 7.3
3 CL6 MIAMI 5 0.0586 .816 15.5 3.8
2 CL3 CL5 7 0.1488 .667 16.0 5.3
1 CL2 CL4 10 0.6669 .000 . 16.0
Root-Mean-Square Distance Between Observations = 1580.242
Ward's Minimum Variance Cluster Analysis
Fisher (1936) Iris Data
Initial Seeds
Cluster SepalLength SepalWidth PetalLength PetalWidth
1 43.00000000 30.00000000 11.00000000 1.00000000
2 77.00000000 26.00000000 69.00000000 23.00000000
Minimum Distance Between Initial Seeds = 70.85196
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02
Fisher (1936) Iris Data
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02
Iteration History
Iteration Criterion
Relative Change in ClusterSeeds
1 2
1 11.0638 0.1904 0.3163
2 5.3780 0.0596 0.0264
3 5.0718 0.0174 0.00766
Convergence criterion is satisfied.
Criterion Based on Final Seeds = 5.0417
Fisher (1936) Iris Data
The FASTCLUS Procedure
Cluster Summary
Cluster
Frequency
RMS Std Deviation
Maximum Distance
from Seed to Observation
RadiusExceed
edNearest Cluster
Distance Between
Cluster Centroids
1 53 3.7050 21.1621 2 39.2879
2 97 5.6779 24.6430 1 39.2879
Fisher (1936) Iris Data
The FASTCLUS Procedure
Statistics for Variables
Variable Total STD Within STD R-Square RSQ/(1-RSQ)
SepalLength 8.28066 5.49313 0.562896 1.287784
SepalWidth 4.35866 3.70393 0.282710 0.394137
PetalLength 17.65298 6.80331 0.852470 5.778291
PetalWidth 7.62238 3.57200 0.781868 3.584390
OVER-ALL 10.69224 5.07291 0.776410 3.472463
Pseudo F Statistic = 513.92
Approximate Expected Over-All R-Squared = 0.51539
Cubic Clustering Criterion = 14.806
WARNING: The two above values are invalid for correlated variables
c: number of clusters
n: number of observations
)/()1(
)1/(2
2
cnR
cRF
Fisher (1936) Iris Data
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02
Cluster Means
Cluster SepalLength SepalWidth PetalLength PetalWidth
1 50.05660377 33.69811321 15.60377358 2.90566038
2 63.01030928 28.86597938 49.58762887 16.95876289
Cluster Standard Deviations
Cluster SepalLength SepalWidth PetalLength PetalWidth
1 3.427350930 4.396611045 4.404279486 2.105525249
2 6.336887455 3.267991438 7.800577673 4.155612484
Fisher (1936) Iris Data
The FREQ Procedure
Frequency Percent Row Pct Col Pct
Table of CLUSTER by Species
CLUSTER(Cluster)
Species
TotalSetosa Versicolor Virginica 1 50
33.3394.34
100.00
32.005.666.00
00.000.000.00
5335.33
2 00.000.000.00
4731.3348.4594.00
5033.3351.55
100.00
9764.67
Total 5033.33
5033.33
5033.33
150100.00
Fisher (1936) Iris Data
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02
Initial Seeds
Cluster SepalLength SepalWidth PetalLength PetalWidth1 58.00000000 40.00000000 12.00000000 2.00000000
2 77.00000000 38.00000000 67.00000000 22.00000000
3 49.00000000 25.00000000 45.00000000 17.00000000
Minimum Distance Between Initial Seeds = 38.23611
Fisher (1936) Iris Data
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02
Iteration History
Iteration Criterion
Relative Change in Cluster Seeds
1 2 3
1 6.7591 0.2652 0.3205 0.2985
2 3.7097 0 0.0459 0.0317
3 3.6427 0 0.0182 0.0124
Convergence criterion is satisfied.
Criterion Based on Final Seeds = 3.6289
Fisher (1936) Iris Data
Cluster Summary
Cluster
Frequency
RMS Std Deviation
Maximum Distance
from Seedto Observation
RadiusExceeded
Nearest Cluster
Distance Between
Cluster Centroids
1 50 2.7803 12.4803 3 33.5693
2 38 4.0168 14.9736 3 17.9718
3 62 4.0398 16.9272 2 17.9718
Fisher (1936) Iris Data
Statistics for Variables
Variable Total STD Within STD R-Square RSQ/(1-RSQ)
SepalLength 8.28066 4.39488 0.722096 2.598359
SepalWidth 4.35866 3.24816 0.452102 0.825156
PetalLength 17.65298 4.21431 0.943773 16.784895
PetalWidth 7.62238 2.45244 0.897872 8.791618
OVER-ALL 10.69224 3.66198 0.884275 7.641194
Pseudo F Statistic = 561.63
Approximate Expected Over-All R-Squared = 0.62728
Cubic Clustering Criterion = 25.021
WARNING: The two above values are invalid for correlated variables.
Fisher (1936) Iris Data
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02
Cluster Means
Cluster SepalLength SepalWidth PetalLength PetalWidth1 50.06000000 34.28000000 14.62000000 2.46000000
2 68.50000000 30.73684211 57.42105263 20.71052632
3 59.01612903 27.48387097 43.93548387 14.33870968
Cluster Standard Deviations
Cluster SepalLength SepalWidth PetalLength PetalWidth1 3.524896872 3.790643691 1.736639965 1.053855894
2 4.941550255 2.900924461 4.885895746 2.798724562
3 4.664100551 2.962840548 5.088949673 2.974997167
Fisher (1936) Iris Data
The FREQ ProcedureFrequency Percent Row Pct Col Pct
Table of CLUSTER by Species
CLUSTER(Cluster)
Species
TotalSetosa Versicolor Virginica 1 50
33.33100.00100.00
00.000.000.00
00.000.000.00
5033.33
2 00.000.000.00
21.335.264.00
3624.0094.7472.00
3825.33
3 00.000.000.00
4832.0077.4296.00
149.33
22.5828.00
6241.33
Total 5033.33
5033.33
5033.33
150100.00