48
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar

Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar

Embed Size (px)

Citation preview

Introducing Textural Data Visualization to Students in Computational Mathematics Major

Partitional and Hierarchical Based clustering

Lecture 22

Based on Slides of Dr. Ikle& chapter 8 of Tan, Steinbach, Kumar

11. What is Cluster Analysis?Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groupsInter-cluster distances are maximizedIntra-cluster distances are minimized Biology: kingdom, phylum, class, order, family, genus, and species Information Retrieval: search engine query = movie, clusters = reviews, trailers, stars, theaters Climate: Clusters = regions of similar climate Psychology and Medicine: patterns in spatial or temporal distribution of a disease Business: Segment customers into groups for marketing activitiesExamples of Clustering3 Clustering for Understanding(see examples from previous slide) Clustering for UtilitySummarizing: different algorithms can run faster on a data set summarized by clusteringCompression: storing cluster information is more efficient that storing the entire data - example: quantizationFinding Nearest NeighborsTwo Reasons for Clustering4Applications of Cluster AnalysisUnderstandingGroup related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations

SummarizationReduce the size of large data sets

Clustering precipitation in AustraliaCLUSTERING: IntroductionClusteringNo class to be predictedGroups objects based solely on their attributesObjects within clusters similar to each otherObjects in different clusters dissimilar to each otherDepends on similarity measureClustering as unsupervised classification

6CLUSTERING: IntroductionWhat constitute clusters?The answer is not always clear

7CLUSTERING: IntroductionGrouping of text clusters

8CLUSTERING: IntroductionTypes of clusteringPartitionalSimple division of instances into mutually exclusive (non-overlapping) clustersDetermine desired number of clustersIteratively reallocate objects to clusters

91: IntroductionTypes of clusteringHierarchicalClusters allowed to have (nested) subclustersUses previous clusters to find subclustersTypes of hierarchical algorithmsAgglomerative (bottom-up)Merge smaller clustersDivisive (top-down)Divide larger clusters

10CLUSTERING: Common AlgorithmsPartitionalK-meansK-medoidsDensity-basedDBSCANSNN

11 K-means clustering is one of the most common/popular techniques Each cluster is associated with a centroid (center point) this is often the mean it is the cluster prototype Each point is assigned to the cluster with the closest centroid The number of clusters, K, must be specified ahead of time

2. K-Means Clustering 12 The most common version of k-means minimizes the sum of the squared distances of each point from its cluster center (page 500)

For a given set of cluster centers, (obviously) each point should be matched to the nearest center For a given cluster, the best center is the mean The basic algorithm is to iterate over these two relationships

K-Means Clustering 132. CLUSTERING: K-meansPseudo-codeChoose number of clusters, kInitialize k centroids (randomly, for example)RepeatForm k clusters by assigning each point to its nearest centroidRecalculate centroidsUntil convergence (centroids move less than some amount)14CLUSTERING: K-meansTime complexityO(n*k*l*d)n=number of pointsk=number of clustersl=number of iterationsd=number of attributesSpace complexityO(k+n)15CLUSTERING: K-meansTime complexityO(n*k*l*d)n=number of pointsk=number of clustersl=number of iterationsd=number of attributesSpace complexityO(k+n)16K-means disadvantagesBad choice for k may yield poor resultsFixed number of clusters makes it difficult to determine best value for kDependent upon choice of initial centroidsReally only works well for spherical cluster shapes17CLUSTERING: K-means advantagesSimple and effectiveConceptuallyTo implementTo runOther algorithms require more parameters to adjustRelatively quick18CLUSTERING: K-means solutionsOne does not always need high quality clustersCan use multiple runs to help with initial centroidsChoose best final resultProbability is not your friendRun with different values for k and obtain the best resultProduce hierarchical clusteringSet k=2Repeat recursively within each cluster19CLUSTERING: K-means solutionsOne does not always need high quality clustersCan use multiple runs to help with initial centroidsChoose best final resultProbability is not your friendRun with different values for k and obtain the best resultProduce hierarchical clusteringSet k=2Repeat recursively within each cluster203. Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical treeCan be visualized as a dendrogramA tree like diagram that records the sequences of merges or splits

Strengths of Hierarchical ClusteringDo not have to assume any particular number of clustersAny desired number of clusters can be obtained by cutting the dendogram at the proper level

They may correspond to meaningful taxonomiesExample in biological sciences (e.g., animal kingdom, phylogeny reconstruction, )Hierarchical ClusteringTwo main types of hierarchical clusteringAgglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left

Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters)

Traditional hierarchical algorithms use a similarity or distance matrixMerge or split one cluster at a time

Agglomerative Clustering AlgorithmMore popular hierarchical clustering technique

Basic algorithm is straightforwardCompute the proximity matrixLet each data point be a clusterRepeatMerge the two closest clustersUpdate the proximity matrixUntil only a single cluster remains Key operation is the computation of the proximity of two clustersDifferent approaches to defining the distance between clusters distinguish the different algorithmsStarting Situation Start with clusters of individual points and a proximity matrix

p1p3p5p4p2p1p2p3p4p5. . ....Proximity Matrix

Intermediate SituationAfter some merging steps, we have some clusters

C1C4C2C5C3C2C1C1C3C5C4C2C3C4C5Proximity Matrix

Intermediate SituationWe want to merge the two closest clusters (C2 and C5) and update the proximity matrix.

C1C4C2C5C3C2C1C1C3C5C4C2C3C4C5Proximity Matrix

After MergingThe question is How do we update the proximity matrix?

C1C4C2 U C5C3? ? ? ? ???C2 U C5C1C1C3C4C2 U C5C3C4Proximity Matrix

How to Define Inter-Cluster Similarity p1p3p5p4p2p1p2p3p4p5. . ....Similarity?MINMAXGroup AverageDistance Between CentroidsOther methods driven by an objective functionWards Method uses squared errorProximity MatrixHow to Define Inter-Cluster Similarity p1p3p5p4p2p1p2p3p4p5. . ....Proximity MatrixMINMAXGroup AverageDistance Between CentroidsOther methods driven by an objective functionWards Method uses squared errorHow to Define Inter-Cluster Similarity p1p3p5p4p2p1p2p3p4p5. . ....Proximity MatrixMINMAXGroup AverageDistance Between CentroidsOther methods driven by an objective functionWards Method uses squared errorHow to Define Inter-Cluster Similarity p1p3p5p4p2p1p2p3p4p5. . ....Proximity MatrixMINMAXGroup AverageDistance Between CentroidsOther methods driven by an objective functionWards Method uses squared errorHow to Define Inter-Cluster Similarity p1p3p5p4p2p1p2p3p4p5. . ....Proximity MatrixMINMAXGroup AverageDistance Between CentroidsOther methods driven by an objective functionWards Method uses squared errorCluster Similarity: MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clustersDetermined by one pair of points, i.e., by one link in the proximity graph.

12345Hierarchical Clustering: MINNested ClustersDendrogram12345612345

Strength of MINOriginal PointsTwo Clusters

Can handle non-elliptical shapesLimitations of MINOriginal Points

Two Clusters

Sensitive to noise and outliersCluster Similarity: MAX or Complete LinkageSimilarity of two clusters is based on the two least similar (most distant) points in the different clustersDetermined by all pairs of points in the two clusters

12345Hierarchical Clustering: MAXNested ClustersDendrogram

12345612534Strength of MAXOriginal Points

Two Clusters

Less susceptible to noise and outliersLimitations of MAX

Original Points

Two ClustersTends to break large clustersBiased towards globular clustersCluster Similarity: Group AverageProximity of two clusters is the average of pairwise proximity between points in the two clusters.

Need to use average connectivity for scalability since total proximity favors large clusters

12345Hierarchical Clustering: Group AverageNested ClustersDendrogram

12345612534Hierarchical Clustering: Group AverageCompromise between Single and Complete Link

StrengthsLess susceptible to noise and outliers

LimitationsBiased towards globular clustersCluster Similarity: Wards MethodSimilarity of two clusters is based on the increase in squared error when two clusters are mergedSimilar to group average if distance between points is distance squared

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-meansCan be used to initialize K-meansHierarchical Clustering: ComparisonGroup AverageWards Method12345612534MINMAX123456125341234561253412345612345Hierarchical Clustering: Time and Space requirementsO(N2) space since it uses the proximity matrix. N is the number of points.

O(N3) time in many casesThere are N steps and at each step the size, N2, proximity matrix must be updated and searchedComplexity can be reduced to O(N2 log(N) ) time for some approaches

Hierarchical Clustering: Problems and LimitationsOnce a decision is made to combine two clusters, it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following:Sensitivity to noise and outliersDifficulty handling different sized clusters and convex shapesBreaking large clustersDiscovered ClustersIndustry Group

1Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,

Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,

Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,

Sun-DOWNTechnology1-DOWN

2Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,

ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,

Computer-Assoc-DOWN,Circuit-City-DOWN,

Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,

Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWNTechnology2-DOWN

3Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

MBNA-Corp-DOWN,Morgan-Stanley-DOWNFinancial-DOWN

4Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,

Schlumberger-UPOil-UP

...p1p2p3p4p9p10p11p12...p1p2p3p4p9p10p11p12...p1p2p3p4p9p10p11p12...p1p2p3p4p9p10p11p12Sheet1I1I2I3I4I5I11.000.900.100.650.20I20.901.000.700.600.50I30.100.701.000.400.30I40.650.600.401.000.80I50.200.500.300.801.00

Sheet2

Sheet3

Sheet1I1I2I3I4I5I11.000.900.100.650.20I20.901.000.700.600.50I30.100.701.000.400.30I40.650.600.401.000.80I50.200.500.300.801.00

Sheet2

Sheet3

Sheet1I1I2I3I4I5I11.000.900.100.650.20I20.901.000.700.600.50I30.100.701.000.400.30I40.650.600.401.000.80I50.200.500.300.801.00

Sheet2

Sheet3