Cluster Analysis (Lecture# 07-08)

Cluster AnalysisCluster Analysis (Lecture# 07-08)(Lecture# 07-08)

Dr. Tahseen Ahmed JilaniDr. Tahseen Ahmed JilaniAssistant ProfessorAssistant Professor

Member IEEE-CIS, IFSA, IRSSMember IEEE-CIS, IFSA, IRSS

Department of Computer ScienceDepartment of Computer ScienceUniversity of KarachiUniversity of Karachi

References:References:1.1. Richard A. Johnson and Dean W. Wishern, “Applied Multivariate Richard A. Johnson and Dean W. Wishern, “Applied Multivariate

Statistical Analysis”, Pearson Education.Statistical Analysis”, Pearson Education.2.2. Mehmed Kantardzic, “Data Mining: Concepts, Models, Methods and Mehmed Kantardzic, “Data Mining: Concepts, Models, Methods and

Algorithms”.Algorithms”.

Dr. Tahseen A. Jilani-DCS-UokDr. Tahseen A. Jilani-DCS-Uok 22

• Its is a set of methodologies for automatic classification of Its is a set of methodologies for automatic classification of samples into a number of groups using a measure of samples into a number of groups using a measure of association, so that the samples in one group are similar association, so that the samples in one group are similar and samples belonging to different groups are not similar. and samples belonging to different groups are not similar.

• The input for a system of cluster analysis is a set of The input for a system of cluster analysis is a set of samples and a measure of similarity (or dissimilarity) samples and a measure of similarity (or dissimilarity) between between two samplestwo samples. The . The outputoutput from cluster analysis is from cluster analysis is a number of groups (clusters) that form a partition, or a a number of groups (clusters) that form a partition, or a structure of partitions, of the data set.structure of partitions, of the data set.

• One additional result of cluster analysis is a generalized One additional result of cluster analysis is a generalized description of every cluster, and this is especially description of every cluster, and this is especially important for a deeper analysis of the data set's important for a deeper analysis of the data set's characteristics.characteristics.

Cluster AnalysisCluster Analysis


Clustering ConceptsClustering Concepts• Samples for clustering are represented as a vector of Samples for clustering are represented as a vector of

measurements, or more formally, as a point in a measurements, or more formally, as a point in a multidimensional space. multidimensional space.

• Samples within a valid cluster are more similar to each Samples within a valid cluster are more similar to each other than they are to a sample belonging to a different other than they are to a sample belonging to a different cluster. cluster. Thus, Clustering methodology is particularly Thus, Clustering methodology is particularly appropriate for the exploration of interrelationships among appropriate for the exploration of interrelationships among samples to make a preliminary assessment of the sample samples to make a preliminary assessment of the sample structure. structure.

• Humans perform competitively with automatic-clustering Humans perform competitively with automatic-clustering procedures in one, two, or three dimensions, but most real procedures in one, two, or three dimensions, but most real problems involve clustering in higher dimensions. It is very problems involve clustering in higher dimensions. It is very difficult for humans to intuitively interpret data embedded difficult for humans to intuitively interpret data embedded in a high-dimensional space.in a high-dimensional space.


• Table 6.1 shows a simple Table 6.1 shows a simple example of clustering example of clustering information for nine information for nine customers, distributed customers, distributed across three clusters. Two across three clusters. Two features describe features describe customers: the first customers: the first feature is the number of feature is the number of items the customers items the customers bought, and the second bought, and the second feature shows the price feature shows the price they paid for each.they paid for each.

Clustering Concepts: ExampleClustering Concepts: ExampleTable 6.1: Sample set of clusters

consisting of similar objects

# of items Price

2 1700

Cluster Cluster 11

3 2000

4 2300

10 1800

Cluster Cluster 22

12 2100

11 2500

2 100

Cluster Cluster 33

3 200

3 350


• Even this simple example and interpretation of a cluster's Even this simple example and interpretation of a cluster's characteristics shows that clustering analysis (in some references characteristics shows that clustering analysis (in some references also called unsupervised classification) refers to situations in also called unsupervised classification) refers to situations in which which the objective is to construct the objective is to construct decision boundaries decision boundaries (classification surfaces)(classification surfaces) based on unlabeled training data setbased on unlabeled training data set. The . The samples in these data sets have only input dimensions, and the samples in these data sets have only input dimensions, and the learning process is classified as unsupervised.learning process is classified as unsupervised.

• Clustering is a very difficult problem because data can reveal Clustering is a very difficult problem because data can reveal clusters with different shapes and sizes in an n-dimensional data clusters with different shapes and sizes in an n-dimensional data space. space.

• To compound the problem further, the number of clusters in the To compound the problem further, the number of clusters in the data often depends on the resolution (fine vs. coarse) with which data often depends on the resolution (fine vs. coarse) with which we view the data. we view the data.

Decision Surfaces, Coarse and fine Decision Surfaces, Coarse and fine Clustering Clustering


• Figure 6.1a shows a set of points (samples in a two-dimensional Figure 6.1a shows a set of points (samples in a two-dimensional space) scattered on a 2D plane.space) scattered on a 2D plane.

• This kind of arbitrariness for the number of clusters as shown in This kind of arbitrariness for the number of clusters as shown in Figure (b) and Figure (c) is a major problem in clustering. What Figure (b) and Figure (c) is a major problem in clustering. What will happen in 3D or Kwill happen in 3D or Kthth Dimension. Dimension.

Visual Clustering for Low Dimensional Data Visual Clustering for Low Dimensional Data with Small Number of Sampleswith Small Number of Samples

Figure: Cluster analysis of points in a 2D-space


• Example on last slid is for 2D data set only. How to perform Example on last slid is for 2D data set only. How to perform clustering for a dataset with 15 fields (15-D) for each record clustering for a dataset with 15 fields (15-D) for each record (sample)(sample)

• Accordingly, we need an Accordingly, we need an objective criterionobjective criterion for clustering. To for clustering. To describe this criterion, we have to introduce a more formalized describe this criterion, we have to introduce a more formalized approach in describing the basic concepts and the clustering approach in describing the basic concepts and the clustering process.process.

• An input to a cluster analysis can be described as an ordered pair An input to a cluster analysis can be described as an ordered pair (X, (X, ss), or (X, ), or (X, dd), where X is a set of descriptions of samples, and ), where X is a set of descriptions of samples, and ss and and dd are measures for similarity or dissimilarity (distance) are measures for similarity or dissimilarity (distance) between samples, respectively. Output from the clustering system between samples, respectively. Output from the clustering system is a partition Λ = {Gis a partition Λ = {G11, G, G22, …, G, …, GNN} where G} where Gkk, k = 1, …, N is a crisp , k = 1, …, N is a crisp

subset of X and also are non-overlapping.subset of X and also are non-overlapping.

How Clustering Will work in Higher How Clustering Will work in Higher Dimensions?Dimensions?


There are several schemata for a formal description of discovered There are several schemata for a formal description of discovered clusters:clusters:

• Represent a cluster of points in an n-dimensional space (samples) Represent a cluster of points in an n-dimensional space (samples) by their centroid or by a set of distant (border) points in a cluster.by their centroid or by a set of distant (border) points in a cluster.

• Represent a cluster graphically using nodes in a clustering tree.Represent a cluster graphically using nodes in a clustering tree.

• Represent clusters by using logical expression on sample Represent clusters by using logical expression on sample attributes.attributes.

Types of Cluster RepresentationsTypes of Cluster Representations


• The availability of a vast collection of clustering algorithms in the The availability of a vast collection of clustering algorithms in the literature and also in different software environments can easily literature and also in different software environments can easily confound a user attempting to select an approach suitable for the confound a user attempting to select an approach suitable for the problem at hand. problem at hand.

• It is important to mention that there is no clustering technique that It is important to mention that there is no clustering technique that is universally applicable in uncovering the variety of structures is universally applicable in uncovering the variety of structures present in multidimensional data sets. The user's understanding of present in multidimensional data sets. The user's understanding of the problem and the, corresponding data types will be the best the problem and the, corresponding data types will be the best criteria to select the appropriate method. criteria to select the appropriate method.

Common Problem with Clustering AlgorithmsCommon Problem with Clustering Algorithms


• Hierarchical Clustering:They organize data in a nested sequence of groups, which can be displayed in the form of a dendrogram or a tree structure.

• Iterative Square-Error Partitional Clustering Attempt to obtain that partition which minimizes the within-cluster scatter or maximizes the between-cluster scatter. These methods are nonhierarchical because all resulting clusters are groups of samples at the same level of partition. To guarantee that an optimum solution has been obtained, one has to examine all possible partitions of N samples of n-dimensions into K clusters (for a given K), but that retrieval process is not computationally feasible.

Most Common Clustering ApproachesMost Common Clustering Approaches


• Since similarity is fundamental to the definition of a cluster, so this Since similarity is fundamental to the definition of a cluster, so this measure must be chosen very carefully because the quality of a measure must be chosen very carefully because the quality of a clustering process depends on this decision.clustering process depends on this decision.

• A sample A sample xx (or feature vector, observation) is a single data vector (or feature vector, observation) is a single data vector used by the clustering algorithm in a space of samples X. used by the clustering algorithm in a space of samples X.

• We assume that each sample We assume that each sample xi € X; i=1,…,n,xi € X; i=1,…,n, is represented by a is represented by a vector vector xxii={x={xi1i1, x, xi2i2, …, x, …, ximim}}. The value m is the number of dimensions . The value m is the number of dimensions (features) of samples, while n is the total number of samples prepared (features) of samples, while n is the total number of samples prepared for a clustering process that belongs to the sample domain X.for a clustering process that belongs to the sample domain X.

Different Measures of Similarity/ dissimilarity in Different Measures of Similarity/ dissimilarity in Clustering algorithmsClustering algorithms


• It is most common to calculate, instead of the It is most common to calculate, instead of the similarity measure similarity measure s(x,x’)s(x,x’), the dissimilarity , the dissimilarity d(x,x’) d(x,x’) between two samples using a distance measure between two samples using a distance measure defined on the feature space. defined on the feature space.

• A distance measure may be a metric or a quasi-A distance measure may be a metric or a quasi-metric on the sample space, and it is used to metric on the sample space, and it is used to quantify the dissimilarity of samples. quantify the dissimilarity of samples.

• A distance A distance d(x,x′)d(x,x′) is small when x and x′ are similar; is small when x and x′ are similar; if x and x′ are not similar d(x, x′) is large. We assume if x and x′ are not similar d(x, x′) is large. We assume without loss of generality that distance measure is without loss of generality that distance measure is also symmetric:also symmetric:

Similarity Measure OR Dissimilarity MeasureSimilarity Measure OR Dissimilarity Measure


• Most efforts to produce a rather simple group Most efforts to produce a rather simple group structure from a complex data set require a structure from a complex data set require a measure of “closeness” or “similarity”. There is measure of “closeness” or “similarity”. There is often a great deal of subjectivity involved in the often a great deal of subjectivity involved in the choice of a similarity measure.choice of a similarity measure.

• Important considerations include the nature of the Important considerations include the nature of the variables (discrete, continuous, binary), scales of variables (discrete, continuous, binary), scales of measurement (nominal, ordinal, interval, ratio) and measurement (nominal, ordinal, interval, ratio) and subject matter knowledge.subject matter knowledge.

• When samples are clustered, proximity is usually When samples are clustered, proximity is usually indicated by some sort of distance. indicated by some sort of distance.

Similarity Measure OR Dissimilarity MeasureSimilarity Measure OR Dissimilarity Measure


Well-known Dissimilarity Measures: Well-known Dissimilarity Measures: Distance/MetricDistance/Metric

• Euclidean Distance:Euclidean Distance:

• LL11 or city block distance: or city block distance:

• Minkowski metric Minkowski metric

2

12n

1kjkikji2 xx'x,xd

n

1kjkikji1 xx'x,xd

p

1pn

1kjkikjip xx'x,xd

21

2m

1kjk

2m

1kik

m

1kjkik

jicos

x.x

xxx,xs

.

• Cosine-Correlation metric:Cosine-Correlation metric:

• Canberra Metric:Canberra Metric:

• Czekanowski Coeff:Czekanowski Coeff:

n

1k jkik

jkik

ji1 xx

xx'x,xd

p

1kjkik

p

1kjkki

ji1

xx

x,xmin21'x,xd


• Computing distances or measures of similarity Computing distances or measures of similarity between samples that have some or all features that between samples that have some or all features that are non`-continuous is problematic, since the are non`-continuous is problematic, since the different types of features are not comparable and different types of features are not comparable and one standard measure is not applicable.one standard measure is not applicable.

• A conventional method for obtaining a distance A conventional method for obtaining a distance measure between two samples measure between two samples xxii and and xxjj represented represented with binary features is to use the 2 × 2 contingency with binary features is to use the 2 × 2 contingency table for samples table for samples xxii and and xxjj, as shown in Table 6.2., as shown in Table 6.2.

Similarity for Qualitative FeaturesSimilarity for Qualitative Features


Qualitative Similarity MeauresQualitative Similarity Meaures

Xj

1 0

Xi

1 a b

0 c d

• Simple Matching Coefficient (SMC)

• Jaccard’s Coefficient

• Rao's Coefficient

dcba

dax,xS jismc

cba

ax,xS jijc

dcba

ax,xS jirc


• Consider five individuals possesses the following characteristics:Consider five individuals possesses the following characteristics:

• Define six variables XDefine six variables X11, X, X22, X, X33, X, X44, X, X55 and X and X66

Example: Qualitative Similarity MeauresExample: Qualitative Similarity Meaures

Height Weight Eye Color Hair Color Handedness Gender

Individual 1 68 140 green blond right female

Individual 2 73 185 brown brown right male

Individual 3 67 165 blue blond right male

Individual 4 64 120 brown brown right female

Individual 5 76 210 brown brown left male

72height0

72height1X1

otherwise0

hairblond1X4

150weight0

150weight1X 2

otherwise0

hrightright1X5

otherwise0

eyesbrwon1X3

male0

female1X6


• The score for The score for individual 1 and 2 on individual 1 and 2 on the p=6 binary the p=6 binary variables. Applying variables. Applying similarity coefficient, similarity coefficient, which gives equal which gives equal weight to matches, we weight to matches, we compute (a+d)/6 compute (a+d)/6

= (1+0)/6 =1/6= (1+0)/6 =1/6

• The second table The second table shows the dissimilarity shows the dissimilarity matrix for all the five matrix for all the five individuals (pair wise).individuals (pair wise).

Indiv 2 Total

1 0 3

Indiv 1

1 1 2 3

0 3 0

Total 4 2 6

Individual

1 2 3 4 5

1 1 -- -- -- --

2 1/6 1 -- -- --

Individual 3 4/6 3/6 1 -- --

4 4/6 3/6 2/6 1 --

5 0 5/6 2/6 2/6 1


Mutual Neighbor Distance (MND) for Categorical Mutual Neighbor Distance (MND) for Categorical SamplesSamples


• Most procedures for hierarchical clustering are not Most procedures for hierarchical clustering are not based on the concept of optimization, and the goal is based on the concept of optimization, and the goal is to find some approximate, suboptimal solution, using to find some approximate, suboptimal solution, using iterations for improvement of partitions until iterations for improvement of partitions until convergence. convergence.

• Algorithms of hierarchical cluster analysis are divided Algorithms of hierarchical cluster analysis are divided into the two categories divisible algorithms and into the two categories divisible algorithms and agglomerative algorithms. agglomerative algorithms.

– A A Divisible AlgorithmDivisible Algorithm starts from the entire set of starts from the entire set of samples X and divides it into a partition of subsets, samples X and divides it into a partition of subsets, then divides each subset into smaller sets, and so then divides each subset into smaller sets, and so on. Thus, a divisible algorithm generates a on. Thus, a divisible algorithm generates a sequence of partitions that is ordered from a sequence of partitions that is ordered from a coarser one to a finer one. coarser one to a finer one.

Types of Hierarchical ClusteringTypes of Hierarchical Clustering


• An An agglomerative algorithmagglomerative algorithm first regards each first regards each object as an initial cluster. The clusters are merged object as an initial cluster. The clusters are merged into a coarser partition, and the merging process into a coarser partition, and the merging process proceeds until the trivial partition is obtained: all proceeds until the trivial partition is obtained: all objects are in one large cluster.objects are in one large cluster.

• This process of clustering is a This process of clustering is a bottom-up processbottom-up process, , where partitions from a finer one to a coarser one. where partitions from a finer one to a coarser one.

• In general, agglomerative algorithms are more In general, agglomerative algorithms are more frequently used in real-world applications than frequently used in real-world applications than divisible methods.divisible methods.

Agglomerative AlgorithmAgglomerative Algorithm


• Most agglomerative hierarchical clustering algorithms are variants Most agglomerative hierarchical clustering algorithms are variants of the of the single-linksingle-link or or complete-linkcomplete-link algorithms. These two basic algorithms. These two basic algorithms differ only in the way they characterize the similarity algorithms differ only in the way they characterize the similarity between a pair of clusters. between a pair of clusters.

• In the single-link method, the distance between two clusters is the In the single-link method, the distance between two clusters is the minimumminimum of the distances between all pairs of samples drawn of the distances between all pairs of samples drawn from the two clusters (one element from the first cluster, the other from the two clusters (one element from the first cluster, the other from the second). from the second).

• In the complete-link algorithm, the distance between two clusters In the complete-link algorithm, the distance between two clusters is the is the maximummaximum of all distances between all pairs drawn from the of all distances between all pairs drawn from the two clusters. A graphical illustration of these two distance two clusters. A graphical illustration of these two distance measures is given in Figure 6.5.measures is given in Figure 6.5.

Types of Agglomerative Hierarchical Clustering Types of Agglomerative Hierarchical Clustering AlgoAlgo


• Linkage Methods are suitable for clustering samples as well as Linkage Methods are suitable for clustering samples as well as variables. variables.

Diagrammatic Presentation of Single and Diagrammatic Presentation of Single and Complete Link Agglomerative Hierarchical Complete Link Agglomerative Hierarchical Clustering AlgorithmsClustering Algorithms


• Place each sample in its own cluster. Construct the list Place each sample in its own cluster. Construct the list of inter-cluster distances for all distinct unordered pairs of inter-cluster distances for all distinct unordered pairs of samples, and sort this list in ascending order.of samples, and sort this list in ascending order.

• Step through the sorted list of distances, forming for Step through the sorted list of distances, forming for each distinct threshold value deach distinct threshold value dkk a graph of the samples a graph of the samples where pairs of samples closer than dwhere pairs of samples closer than dkk are connected are connected into a new cluster by a graph edge. If all the samples into a new cluster by a graph edge. If all the samples are members of a connected graph, stop. Otherwise, are members of a connected graph, stop. Otherwise, repeat this step.repeat this step.

• The output of the algorithm is a nested hierarchy of The output of the algorithm is a nested hierarchy of graphs, which can be cut at the desired dissimilarity graphs, which can be cut at the desired dissimilarity level forming a partition (clusters) identified by simple level forming a partition (clusters) identified by simple connected components in the corresponding subgraph. connected components in the corresponding subgraph.

The Basic Steps of the Agglomerative The Basic Steps of the Agglomerative ClusteringClustering


• The results of both divisible and agglomerative clustering methods may The results of both divisible and agglomerative clustering methods may be displayed in the form of a two-dimensional diagram known as be displayed in the form of a two-dimensional diagram known as Dendrogram/ tree diagram.Dendrogram/ tree diagram.



• The inputs to a single link method can be distances or similarities The inputs to a single link method can be distances or similarities between samples. between samples.

• Groups are formed from the individual samples by merging the Groups are formed from the individual samples by merging the corresponding objects, say U and V, to get the cluster (UV).corresponding objects, say U and V, to get the cluster (UV).

• The distances between UV and any other cluster W are computer The distances between UV and any other cluster W are computer by by dd(UV)W (UV)W = min{d= min{dUWUW, d, dVWVW}}..

• The result of the merging of clusters to form new clusters can be The result of the merging of clusters to form new clusters can be shown graphically using Dendrogram or tree diagram.shown graphically using Dendrogram or tree diagram.

Single Linkage MethodSingle Linkage Method


ExampleExample: Single Link Agglomerative Clustering Method: Single Link Agglomerative Clustering Method

• Consider the hypothetical distances Consider the hypothetical distances between pairs of five objects as between pairs of five objects as follows.follows.

• Treating each object as a cluster. We Treating each object as a cluster. We commence clustering by merging the commence clustering by merging the two closest items. two closest items.

• Since Since min(dmin(dikik)=d)=d5353=2=2. So samples 3 . So samples 3

and 5 are merged to form a new and 5 are merged to form a new cluster (35)cluster (35)..

• To implement the next level/ To implement the next level/ iteration of the clustering, we need iteration of the clustering, we need the distances between the the distances between the cluster cluster (35)(35) and the remaining and the remaining samples/clusters, 1,2 and 4. samples/clusters, 1,2 and 4.

D=

11 22 33 44 55

1 0

2 9 0

3 3 7 0

4 6 5 9 0

5 11 10 2 8 0


• dd(35)1(35)1=min{d=min{d3131,d,d5151}= min{3,11}= 3}= min{3,11}= 3

(So merge) cluster (35) and 1(So merge) cluster (35) and 1

• dd(35)2(35)2=min{d=min{d3232,d,d5252}= min{7,10}= 7}= min{7,10}= 7

• d(35)4d(35)4=min{d=min{d3434,d,d5454}= min{9,8}= 8}= min{9,8}= 8

• Deleting the initial distance rows Deleting the initial distance rows and columns corresponding to and columns corresponding to objects 3 and 5, and adding a row objects 3 and 5, and adding a row and column for the cluster (35), and column for the cluster (35), we obtain the new distance we obtain the new distance matrix.matrix.

• Here Here min (d)=3= dmin (d)=3= d(35)1(35)1

• dd(135)2(135)2=min{d=min{d(35)2(35)2,d,d2121}= min{7,9}= 7}= min{7,9}= 7

(So merge) cluster (135) and 4(So merge) cluster (135) and 4

• dd(135)4(135)4=min{d=min{d(35)4(35)4,d,d1414}= min{8,6}= 6 }= min{8,6}= 6

(35(35))

11 22 44

(35) 0

1 3 0

2 7 9 0

4 8 6 5 0

Example Example (Continued)(Continued): Step #02 and #03: Step #02 and #03

(135(135))

22 44

(135) 0

2 7 0

4 6 5 0


• The minimum nearest The minimum nearest neighbor distance between neighbor distance between pairs of clusters is dpairs of clusters is d4242=5, and =5, and so we merge samples 4 and 2 so we merge samples 4 and 2 to form (24).to form (24).

Finally,Finally,

• dd(135)24(135)24=min{d=min{d(135)2(135)2,d,d(135)4(135)4}= 6}= 6

(135(135))

(24(24))

(135) 0

(24) 6 0


Data Collected on 22 U.S. Public Utility Companies for the year 1975

Company X1 X2 X3 X4 X5 X6 X7 X8

1 Arizona Public Service 1.06 9.2 151 54.4 1.6 9077 0 0.628

2 Bosston Edison Co. 0.89 10.3 202 57.9 2.2 5088 25.3 1.555

3 Central Louisiana Electric Co. 1.43 6.4 113 53 3.4 9212 0 1.058

4 Commonwealth Edison Co. 1.02 11.2 168 56 0.3 6423 34.3 0.7

5 Consolidated Edison Co. 1.49 8.8 192 51.2 1 3300 15.6 2.044

6 Florida Power and Light Co. 1.32 13.5 111 60 -2.2 11127 22.5 1.241

7 Hawaiian Electric Co. 1.22 12.2 175 67.6 2.2 7642 0 1.652

8 Idaho Power Co. 1.1 9.2 245 57 3.3 13082 0 0.309

9 Kentucky Utilities Co. 1.34 13 168 60.4 7.2 8406 0 0.862

10 Madison Gas and Electric Co. 1.12 12.4 197 53 2.7 6455 39.2 0.623

11 Nevada Power Co. 0.75 7.5 173 51.5 6.5 17441 0 0.768

12 New England Electric Co. 1.13 10.9 178 62 3.7 6154 0 1.897

13 Northern States Power Co. 1.15 12.7 199 53.7 6.4 7179 50.2 0.527

14 Oklahoma Gas and Electric Co. 1.09 12 96 49.8 1.4 9673 0 0.588

15 Pacific gas and Electric Co. 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4

16 Puget Sound Power and Light Co. 1.16 6.4 252 56 9.2 15991 0 0.62

17 San Diego Gas and Electric Co. 0.76 9.9 136 61.9 9 5714 8.3 1.92

18 The Southern Co. 1.05 12.6 150 56.7 2.7 10140 0 1.108

19 Texas Utilities Co. 1.16 11.7 104 54 2.1 13507 0 0.636

20 Wisconsin Electric Power Co. 1.2 11.8 148 59.9 3.5 7287 41.1 0.702

21 United Illuminating Co. 1.04 8.6 204 61 3.5 6650 0 2.116

22 Virginia Electric & Power Co. 1.07 9.3 174 54.3 5.9 10093 26.6 1.306


• XX11= Fixed-charge coverage ration (income/debt)= Fixed-charge coverage ration (income/debt)

• XX22= Rate of Return on capital= Rate of Return on capital

• XX33= Cost per KW capacity in place= Cost per KW capacity in place

• XX44= Annual load Factor= Annual load Factor

• XX55= Peak kWh demand growth from 1974 to 1975= Peak kWh demand growth from 1974 to 1975

• XX66= Sales (kWh use per year)= Sales (kWh use per year)

• XX77= Percent Nuclear= Percent Nuclear

• XX88= Total fuel cost (cents per kWh)= Total fuel cost (cents per kWh)

Type hereType here


Correlation Matrix between pairs of variables using Correlation Matrix between pairs of variables using MATLABMATLAB>> X=[put all data inside the square]>> X=[put all data inside the square]>> Y=corr(X) >> Y=corr(X) % output will be 8x8 matrix of correlations% output will be 8x8 matrix of correlations

1.00001.0000 0.1598 -0.1028 -0.0820 -0.2618 -0.1517 0.0448 - 0.1598 -0.1028 -0.0820 -0.2618 -0.1517 0.0448 -0.01340.0134

0.1598 1.00000.1598 1.0000 -0.3108 0.1881 -0.2618 -0.2486 0.3973 - -0.3108 0.1881 -0.2618 -0.2486 0.3973 -0.14320.1432

-0.1028 -0.3108 1.0000-0.1028 -0.3108 1.0000 0.1003 0.3611 0.0280 0.1147 0.1003 0.3611 0.0280 0.1147 0.00520.0052

-0.0820 0.1881 0.1003 1.0000-0.0820 0.1881 0.1003 1.0000 -0.0100 -0.2879 -0.1642 -0.0100 -0.2879 -0.1642 0.48550.4855

-0.2618 -0.2618 0.3611 -0.0100 1.0000 -0.2618 -0.2618 0.3611 -0.0100 1.0000 0.2793 -0.0699 -0.2793 -0.0699 -0.06560.0656

-0.1517 -0.2486 0.0280 -0.2879 0.2793 1.0000 -0.1517 -0.2486 0.0280 -0.2879 0.2793 1.0000 -0.3737 --0.3737 -0.56050.5605

0.0448 0.3973 0.1147 -0.1642 -0.0699 -0.3737 1.00000.0448 0.3973 0.1147 -0.1642 -0.0699 -0.3737 1.0000 - -0.18510.1851

-0.0134 -0.1432 0.0052 0.4855 -0.0656 -0.5605 -0.1851 -0.0134 -0.1432 0.0052 0.4855 -0.0656 -0.5605 -0.1851 1.00001.0000


MATLAB code for Single Linkage Method for MATLAB code for Single Linkage Method for Agglomerative ClusteringAgglomerative Clustering

MATLAB CODEMATLAB CODE

• >> X=[ all data in this matrix]>> X=[ all data in this matrix]• >> Y=pdist(X)>> Y=pdist(X)• >> Z=linkage(Y)>> Z=linkage(Y)• >> [H,T] = dendrogram >> [H,T] = dendrogram (Z, 'colorthreshold', 'default');(Z, 'colorthreshold', 'default');



4 10 15 21 12 17 7 13 20 2 1 3 14 18 22 9 6 5 8 19 11 160

500

1000

1500

2000

2500


• DO It YourselfDO It Yourself

Complete Linkage Method Complete Linkage Method


• Every partitional-clustering algorithm obtains a Every partitional-clustering algorithm obtains a single partition of the data instead of the clustering single partition of the data instead of the clustering structure, such as a dendrogram, produced by a structure, such as a dendrogram, produced by a hierarchical technique. hierarchical technique.

• Partitional methods have the advantage in Partitional methods have the advantage in applications involving large data sets for which the applications involving large data sets for which the construction of a dendrogram is computationally construction of a dendrogram is computationally very complex.very complex.

Partitional ClusteringPartitional Clustering


• The partitional techniques usually produce clusters by The partitional techniques usually produce clusters by optimizing a optimizing a criterion functioncriterion function defined either locally defined either locally (on a subset of samples) or globally (defined over all (on a subset of samples) or globally (defined over all of the samples). Thus we say that a clustering of the samples). Thus we say that a clustering criterion can be either global or local.criterion can be either global or local.

Global Criteria and Local CriteriaGlobal Criteria and Local Criteria

• A global criterion, such as the Euclidean square-error A global criterion, such as the Euclidean square-error measure, represents each cluster by a prototype or measure, represents each cluster by a prototype or centroid and assigns the samples to clusters centroid and assigns the samples to clusters according to the most similar prototypes.according to the most similar prototypes.

• A local criterion, such as the minimal mutual neighbor A local criterion, such as the minimal mutual neighbor distance (MND), forms clusters by utilizing the local distance (MND), forms clusters by utilizing the local structure or context in the data. Therefore, identifying structure or context in the data. Therefore, identifying high-density regions in the data space is a basic high-density regions in the data space is a basic criterion for forming clusters.criterion for forming clusters.

Criterion /Performance/ Objective FunctionCriterion /Performance/ Objective Function


• A A global criterionglobal criterion, such as the Euclidean square-error , such as the Euclidean square-error measure, represents each cluster by a prototype or measure, represents each cluster by a prototype or centroid and assigns the samples to clusters centroid and assigns the samples to clusters according to the most similar prototypes.according to the most similar prototypes.

• A A local criterionlocal criterion, such as the minimal mutual neighbor , such as the minimal mutual neighbor distance (MND), forms clusters by utilizing the local distance (MND), forms clusters by utilizing the local structure or context in the data. Therefore, identifying structure or context in the data. Therefore, identifying high-density regions in the data space is a basic high-density regions in the data space is a basic criterion for forming clusters.criterion for forming clusters.

Criterion Function:Global Criteria and Local Criterion Function:Global Criteria and Local CriteriaCriteria


• MSEMSE

• RMSERMSE

• Absolute MSEAbsolute MSE

• Other Statistical CriterionOther Statistical Criterion

Other Criterion FunctionOther Criterion Function


• The most commonly used partitional-clustering The most commonly used partitional-clustering strategy is based on the square-error criterion.strategy is based on the square-error criterion.

• The The general objectivegeneral objective is to obtain the partition that, is to obtain the partition that, for a fixed number of clusters, minimizes the total for a fixed number of clusters, minimizes the total square-error. square-error.

• Suppose that the given set of N samples in an n-Suppose that the given set of N samples in an n-dimensional space has somehow been partitioned dimensional space has somehow been partitioned into K clusters {Cinto K clusters {C11, C, C22, …, C, …, Ckk}. Each C}. Each Ckk has n has nkk samples samples and each sample is in exactly one cluster, so that ∑ nand each sample is in exactly one cluster, so that ∑ nkk = N, where k = 1,…,K. = N, where k = 1,…,K.

Mean Square Error based on Euclidean Mean Square Error based on Euclidean DistanceDistance


• The mean vector MThe mean vector Mkk of cluster C of cluster Ckk is defined as the is defined as the centroidcentroid of of the cluster or Mean of the clusterthe cluster or Mean of the cluster

• where xwhere xikik is the i is the ithth sample belonging to cluster C sample belonging to cluster Ckk..

• The square-error for cluster CThe square-error for cluster Ckk is the sum of the is the sum of the squared Euclidean distances between each sample in squared Euclidean distances between each sample in CCkk and its centroid. This error is also called the and its centroid. This error is also called the within-cluster variationwithin-cluster variation::

k

n

1iik

k n

xM

k

kn

1i

2kik

2ik Mxe

MSE based on Euclidean DistanceMSE based on Euclidean Distance


• The square-error for the entire clustering space containing The square-error for the entire clustering space containing K clusters is the sum of the within-cluster variations:K clusters is the sum of the within-cluster variations:

• The objective of a square-error clustering method is to find The objective of a square-error clustering method is to find a partition containing K clusters that minimize E2k for a a partition containing K clusters that minimize E2k for a given K.given K.

• The The K-means partitional-clusteringK-means partitional-clustering algorithm algorithm is the is the simplest and most commonly used algorithm employing a simplest and most commonly used algorithm employing a square-error criterion. square-error criterion.

• It starts with a random, initial partition and keeps It starts with a random, initial partition and keeps reassigning the samples to clusters, based on the reassigning the samples to clusters, based on the similarity between samples and clusters, until a similarity between samples and clusters, until a convergence criterion is met. convergence criterion is met.

K-means partitional-clusteringK-means partitional-clustering algorithm algorithm

K

1k

2k

2k eE


Diagrammatic Presentation of K-Mean/Centroids Diagrammatic Presentation of K-Mean/Centroids AlgorithmAlgorithm


• Typically, this criterion is met when there is no Typically, this criterion is met when there is no reassignment of any sample from one cluster to reassignment of any sample from one cluster to another that will cause a decrease of the total another that will cause a decrease of the total squared error.squared error.

• K-means algorithm is popular because it is easy to K-means algorithm is popular because it is easy to implement, and its time and space complexity is implement, and its time and space complexity is relatively small. relatively small.

• A major problem with this algorithm is that it is A major problem with this algorithm is that it is sensitive to the selection of the initial partition and sensitive to the selection of the initial partition and may converge to a local minimum of the criterion may converge to a local minimum of the criterion function if the initial partition is not properly chosen.function if the initial partition is not properly chosen.

• The simple K-means partitional-clustering algorithm is The simple K-means partitional-clustering algorithm is computationally efficient and gives surprisingly good computationally efficient and gives surprisingly good results if the clusters are results if the clusters are compactcompact: : hypersphericalhyperspherical in shapein shape, and , and well separatedwell separated in the feature space. in the feature space.

K-means partitional-clustering algorithmK-means partitional-clustering algorithm


1.1. Select an initial partition with K clusters containing Select an initial partition with K clusters containing randomly chosen samples, and compute the randomly chosen samples, and compute the centroids of the clusters,centroids of the clusters,

2.2. Generate a new partition by assigning each sample Generate a new partition by assigning each sample to the closest cluster center,to the closest cluster center,

3.3. Compute new cluster centers as the centroids of Compute new cluster centers as the centroids of the clusters,the clusters,

4.4. Repeat steps 2 and 3 until an optimum value of the Repeat steps 2 and 3 until an optimum value of the criterion function is found (or until the cluster criterion function is found (or until the cluster membership stabilizes).membership stabilizes).

Basic steps of the K-means algorithm:Basic steps of the K-means algorithm:


• Let us analyze the steps of the K-means algorithm on Let us analyze the steps of the K-means algorithm on the simple data set given in Figure 6.6. Suppose that the the simple data set given in Figure 6.6. Suppose that the required number of clusters is two, and initially, clusters required number of clusters is two, and initially, clusters are formed from random distribution of samples.are formed from random distribution of samples.

• Step#01Step#01 XX11=(0,2), X=(0,2), X22=(0,0), X=(0,0), X33=(1.5,0), X=(1.5,0), X44=(5,0), =(5,0), XX55=(5,2)=(5,2)

CC11 = {x = {x11, x, x22, x, x44}} and. and. CC22 = {x = {x33, x, x55}}. The centriods for . The centriods for these two clusters are.these two clusters are.

Mean value of Cluster# 01= Mean value of Cluster# 01=

Mean value of Cluster# 02=Mean value of Cluster# 02=

Within-cluster variations, after initial random distribution Within-cluster variations, after initial random distribution of samples, areof samples, are

ExampleExample

66.0,66.13002,3

500M1

00.1,25.3220,2

55.1M2


• Within-cluster variations, after initial random Within-cluster variations, after initial random distribution of samples, aredistribution of samples, are

•And the total square error isAnd the total square error is

•ESS=SSE= =19.36+8.12= ESS=SSE= =19.36+8.12= 27.4827.48

Example Example (Continue)(Continue)

36.1966.0066.15

66.0066.1066.0266.10e22

222221

12.81225.351025.35.1e 222222

22

21

21 eeE


• Step#02Step#02

• When we reassign all samples, depending on a When we reassign all samples, depending on a minimum distance from centroids M1 and M2, the minimum distance from centroids M1 and M2, the new redistribution of sample inside clusters will be new redistribution of sample inside clusters will be

Example: K-Mean Clustering Example: K-Mean Clustering (Continue)(Continue)

1C 11211 x40.3x,Mdand14.2x,Md

1C 22221 x40.3x,Mdand79.1x,Md

1C 33231 x01.2x,Mdand83.0x,Md

244241 x01.2x,Mdand41.3x,Md C

255211 x01.2x,Mdand60.3x,Md C


• C1 = {x1, x2, x3} and C2 = {x4, x5} have new C1 = {x1, x2, x3} and C2 = {x4, x5} have new centroids are centroids are

and and

The corresponding within-cluster variations and the The corresponding within-cluster variations and the total square error are , so total square error are , so

See that after the first iteration, the total square error is See that after the first iteration, the total square error is significantly reduced (from the value 27.48 to 6.17). In significantly reduced (from the value 27.48 to 6.17). In this simple example, the first iteration was at the same this simple example, the first iteration was at the same time the final one because if we analyze the distances time the final one because if we analyze the distances between the new centroids and the samples, the latter between the new centroids and the samples, the latter will all be assigned to the same clusters. There is no will all be assigned to the same clusters. There is no reassignment and therefore the algorithm halts.reassignment and therefore the algorithm halts.

67.0,5.0M1 0.1,0.5M2

,17.4e21 17.4e2

2 17.6eeE 22

21

22

Example: K-Mean Clustering (Continue)Example: K-Mean Clustering (Continue)


In summary, only the K-means algorithm and its In summary, only the K-means algorithm and its equivalent in an artificial neural networks domain-the equivalent in an artificial neural networks domain-the Kohonen neural networks -have been applied for Kohonen neural networks -have been applied for clustering on large data sets. Other approaches have clustering on large data sets. Other approaches have been tested, typically, on small data sets.been tested, typically, on small data sets.

• Its time complexity is O(nkl), where n is the number Its time complexity is O(nkl), where n is the number of samples, k is the number of clusters, and 1 is the of samples, k is the number of clusters, and 1 is the number of iterations taken by the algorithm to number of iterations taken by the algorithm to converge. Typically, k and l are fixed in advance and converge. Typically, k and l are fixed in advance and so the algorithm has linear time complexity in the so the algorithm has linear time complexity in the size of the data set.size of the data set.

• Its space complexity is O(k+n), and if it is possible to Its space complexity is O(k+n), and if it is possible to store all the data in the primary memory, access store all the data in the primary memory, access time to all elements is very fast and the algorithm is time to all elements is very fast and the algorithm is very efficient.very efficient.

The reasons behind the popularity of K-means The reasons behind the popularity of K-means algorithmalgorithm


• It is an order-independent algorithm. For a given initial It is an order-independent algorithm. For a given initial distribution of clusters, it generates the same partition distribution of clusters, it generates the same partition of the data at the end of the partitioning process of the data at the end of the partitioning process irrespective of the order in which the samples are irrespective of the order in which the samples are presented to the algorithm.presented to the algorithm.

A big frustrationA big frustration in using iterative partitional-clustering in using iterative partitional-clustering programs is the lack of guidelines available for programs is the lack of guidelines available for choosing K-number of clusters apart from choosing K-number of clusters apart from

• The ambiguity about the best direction for initial The ambiguity about the best direction for initial partition, partition,

• Updating the partition, Updating the partition, • Adjusting the number of clusters, and Adjusting the number of clusters, and • The stopping criterion. The stopping criterion. • Presence of outliers alternate is Presence of outliers alternate is k-mediods algorithmk-mediods algorithm

The reasons behind the popularity of K-means The reasons behind the popularity of K-means algorithmalgorithm


Data Collected on 22 U.S. Public Utility Companies for the year 1975

Company X1 X2 X3 X4 X5 X6 X7 X8

1 Arizona Public Service 1.06 9.2 151 54.4 1.6 9077 0 0.628

2 Bosston Edison Co. 0.89 10.3 202 57.9 2.2 5088 25.3 1.555

3 Central Louisiana Electric Co. 1.43 6.4 113 53 3.4 9212 0 1.058

4 Commonwealth Edison Co. 1.02 11.2 168 56 0.3 6423 34.3 0.7

5 Consolidated Edison Co. 1.49 8.8 192 51.2 1 3300 15.6 2.044

6 Florida Power and Light Co. 1.32 13.5 111 60 -2.2 11127 22.5 1.241

7 Hawaiian Electric Co. 1.22 12.2 175 67.6 2.2 7642 0 1.652

8 Idaho Power Co. 1.1 9.2 245 57 3.3 13082 0 0.309

9 Kentucky Utilities Co. 1.34 13 168 60.4 7.2 8406 0 0.862

10 Madison Gas and Electric Co. 1.12 12.4 197 53 2.7 6455 39.2 0.623

11 Nevada Power Co. 0.75 7.5 173 51.5 6.5 17441 0 0.768

12 New England Electric Co. 1.13 10.9 178 62 3.7 6154 0 1.897

13 Northern States Power Co. 1.15 12.7 199 53.7 6.4 7179 50.2 0.527

14 Oklahoma Gas and Electric Co. 1.09 12 96 49.8 1.4 9673 0 0.588

15 Pacific gas and Electric Co. 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4

16 Puget Sound Power and Light Co. 1.16 6.4 252 56 9.2 15991 0 0.62

17 San Diego Gas and Electric Co. 0.76 9.9 136 61.9 9 5714 8.3 1.92

18 The Southern Co. 1.05 12.6 150 56.7 2.7 10140 0 1.108

19 Texas Utilities Co. 1.16 11.7 104 54 2.1 13507 0 0.636

20 Wisconsin Electric Power Co. 1.2 11.8 148 59.9 3.5 7287 41.1 0.702

21 United Illuminating Co. 1.04 8.6 204 61 3.5 6650 0 2.116

22 Virginia Electric & Power Co. 1.07 9.3 174 54.3 5.9 10093 26.6 1.306


Matlab Code For Kmeans Algorithm and Silhouette Matlab Code For Kmeans Algorithm and Silhouette PlotPlot

• idx4 = kmeans(X,4, 'dist','city', 'display','iter');idx4 = kmeans(X,4, 'dist','city', 'display','iter');• OROR

• idx5 = kmeans(x,5,'dist','city','replicates',5);idx5 = kmeans(x,5,'dist','city','replicates',5);

• [silh5,h] = silhouette(x,idx5,'city');[silh5,h] = silhouette(x,idx5,'city');

• xlabel('Silhouette Value')xlabel('Silhouette Value')

• ylabel('Cluster')ylabel('Cluster')

• mean(silh5)mean(silh5)


X2 X3 X4 X5 X6 X7 X8 Cluster

1 9.2 151 54.4 1.6 9077 0 0.628 2

2 10.3 202 57.9 2.2 5088 25.3 1.555 4

3 6.4 113 53 3.4 9212 0 1.058 4

4 11.2 168 56 0.3 6423 34.3 0.7 4

5 8.8 192 51.2 1 3300 15.6 2.044 4

6 13.5 111 60 -2.2 11127 22.5 1.241 4

7 12.2 175 67.6 2.2 7642 0 1.652 4

8 9.2 245 57 3.3 13082 0 0.309 2

9 13 168 60.4 7.2 8406 0 0.862 2

10 12.4 197 53 2.7 6455 39.2 0.623 4

11 7.5 173 51.5 6.5 17441 0 0.768 2

12 10.9 178 62 3.7 6154 0 1.897 1

13 12.7 199 53.7 6.4 7179 50.2 0.527 3

14 12 96 49.8 1.4 9673 0 0.588 1

15 7.6 164 62.2 -0.1 6468 0.9 1.4 1

16 6.4 252 56 9.2 15991 0 0.62 3

17 9.9 136 61.9 9 5714 8.3 1.92 3

18 12.6 150 56.7 2.7 10140 0 1.108 1

19 11.7 104 54 2.1 13507 0 0.636 2

20 11.8 148 59.9 3.5 7287 41.1 0.702 1

21 8.6 204 61 3.5 6650 0 2.116 --

22 9.3 174 54.3 5.9 10093 26.6 1.306 --


Thank YouThank You

Documents

Cluster Analysis (Lecture# 07-08)