56
Introduction to Data Mining Hierarchical Clustering CPSC/AMTH 445a/545a Guy Wolf [email protected] Yale University Fall 2016 CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 1 / 17

Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Introduction to Data Mining

Hierarchical Clustering

CPSC/AMTH 445a/545a

Guy [email protected]

Yale UniversityFall 2016

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 1 / 17

Page 2: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Outline

1 Hierarchical clusteringDivisive & agglomerative approachesDendrogram visualizationBisecting k-means

2 Agglomerative clusteringSingle linkageComplete linkageAverage linkageWard’s method

3 Large-scale clusteringCUREBIRCHChameleon

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 2 / 17

Page 3: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clustering

Question: how many cluster should we find in the data?

Suggestion: why not consider all options in a single hierarchy?

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 3 / 17

Page 4: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clustering

Question: how many cluster should we find in the data?

Suggestion: why not consider all options in a single hierarchy?

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 3 / 17

Page 5: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clustering

Question: how many cluster should we find in the data?

Suggestion: why not consider all options in a single hierarchy?

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 3 / 17

Page 6: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clustering

Question: how many cluster should we find in the data?

Suggestion: why not consider all options in a single hierarchy?

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 3 / 17

Page 7: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clustering

Question: how many cluster should we find in the data?

Suggestion: why not consider all options in a single hierarchy?

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 3 / 17

Page 8: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clusteringA hierarchical approach can be useful when considering versatilecluster shapes:

2-means

By first detecting many small clusters, and then merging them, wecan uncover patterns that are challenging for partitional methods.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 3 / 17

Page 9: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clusteringA hierarchical approach can be useful when considering versatilecluster shapes:

10-means

By first detecting many small clusters, and then merging them, wecan uncover patterns that are challenging for partitional methods.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 3 / 17

Page 10: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clusteringDivisive & agglomerative approaches

Hierarchical clustering methods produce a set of nested clustersorganized in a hierarchy tree.

The cluster hierarchy is typically visualized usingdendrogramsSuch approaches are applied either to provide multiresolutiondata organization, or to alleviate computational challenges whenclustering big datasetsIn general, two approaches are applied to build nested clusters:divisive clustering and agglomerative clustering.

Divisive approaches start with the entire data as one cluster, andthen iteratively split “loose” clusters until a stopping criterion (e.g., kclusters or tight enough clusters) is satisfied.Agglomerative approaches start with small tight clusters, or evenwith single-point clusters, and then iteratively merge close clusters untilonly a single one remains.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 4 / 17

Page 11: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clusteringDendrogram visualization

A dendrogram is a tree graph that visualizes a sequence of clustermerges or divisions:

Divisive methods take a top-down root → leaves approach.Agglomerative ones take a bottom-up leaves → root approach.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 5 / 17

Page 12: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clusteringBisecting k-means

Bisecting k-means is a divisive algorithm that utilized the k-meansiteratively to bisect the data into clusters.

Bisecting k-meansUse 2-means to split the data into two clusters1

While there are less than k clusters:Select C as the cluster with the highest SSEUse 2-means to split C into two clusters1

Replace C with the two new clusters

The hierarchical approach in this case is used to stabilize some ofthe weaknesses of the original k-means algorithm, and not for dataorganization purposes.

1Choose best SSE out of t attemptsCPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 6 / 17

Page 13: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clusteringBisecting k-means

Example

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 6 / 17

Page 14: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clusteringBisecting k-means

Example

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 6 / 17

Page 15: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clusteringBisecting k-means

Example

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 6 / 17

Page 16: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clusteringBisecting k-means

Example

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 6 / 17

Page 17: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clusteringBisecting k-means

Example

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 6 / 17

Page 18: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clusteringBisecting k-means

Example

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 6 / 17

Page 19: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clusteringBisecting k-means

Example

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 6 / 17

Page 20: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clusteringBisecting k-means

Example

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 6 / 17

Page 21: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clusteringBisecting k-means

Example

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 6 / 17

Page 22: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Hierarchical clusteringBisecting k-means

Example

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 6 / 17

Page 23: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Agglomerative clustering

Agglomerative clustering approaches are more popular than divisiveones. They all use variations of the following simple algorithm:

Agglomerative clustering paradigmBuild a singleton cluster for each data pointRepeat the following steps:

Find the two closest clustersMerge these two clusters together

Until there is only a single cluster

Two main choices distinguish agglomerative clustering algorithms:1 How to quantify proximity between clusters2 How to merge clusters and efficiently update this proximity

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 7 / 17

Page 24: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Agglomerative clustering

Agglomerative clustering approaches are more popular than divisiveones. They all use variations of the following simple algorithm:

Agglomerative clustering paradigmBuild a singleton cluster for each data pointRepeat the following steps:

Find the two closest clustersMerge these two clusters together

Until there is only a single cluster

With proper implementation, this approach is also helpful for Big Dataprocessing, since each iteration considers a smaller coarse-grained ver-sion of the dataset.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 7 / 17

Page 25: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Agglomerative clustering

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 7 / 17

Page 26: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Agglomerative clustering

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 7 / 17

Page 27: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Agglomerative clustering

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 7 / 17

Page 28: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Agglomerative clustering

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 7 / 17

Page 29: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Agglomerative clusteringLinkage

How to quantify distance or similarity between clusters?

Suggestion #1: represent clusters by centroids and use dis-tance/similarity between them.Problem: this approach ignores the shapes of the clusters.Suggestion #2: combine pairwise distances between each point inone cluster and each point in the other cluster. This approach is calledlinkage.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 8 / 17

Page 30: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Agglomerative clusteringLinkage

How to quantify distance or similarity between clusters?

Suggestion #1: represent clusters by centroids and use dis-tance/similarity between them.

Problem: this approach ignores the shapes of the clusters.Suggestion #2: combine pairwise distances between each point inone cluster and each point in the other cluster. This approach is calledlinkage.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 8 / 17

Page 31: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Agglomerative clusteringLinkage

How to quantify distance or similarity between clusters?

Suggestion #1: represent clusters by centroids and use dis-tance/similarity between them.Problem: this approach ignores the shapes of the clusters.

Suggestion #2: combine pairwise distances between each point inone cluster and each point in the other cluster. This approach is calledlinkage.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 8 / 17

Page 32: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Agglomerative clusteringLinkage

How to quantify distance or similarity between clusters?

Suggestion #1: represent clusters by centroids and use dis-tance/similarity between them.Problem: this approach ignores the shapes of the clusters.Suggestion #2: combine pairwise distances between each point inone cluster and each point in the other cluster. This approach is calledlinkage.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 8 / 17

Page 33: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Agglomerative clusteringSingle linkage

Single linkage uses minimal distance (or maximum similarity) betweena point in one cluster and a point in the other cluster.

Only one inter-cluster link determines the distance, while many otherlinks can be significantly weaker.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 9 / 17

Page 34: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Agglomerative clusteringSingle linkage

Example

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 9 / 17

Page 35: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Agglomerative clusteringComplete linkage

Complete linkage uses maximal distance (or minimal similarity) be-tween a point in one cluster and a point in the other cluster.

In some sense, all inter-cluster links are considered since they must allbe strong to have a small distance.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 10 / 17

Page 36: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Agglomerative clusteringComplete linkage

Example

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 10 / 17

Page 37: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Agglomerative clusteringAverage linkage

Average linkage uses mean distance (or similarity) between points inone cluster and points in the other cluster.

Less susceptible than single- and complete-linkage to noise an outliers,but biased toward globular clusters.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 11 / 17

Page 38: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Agglomerative clusteringAverage linkage

Example

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 11 / 17

Page 39: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Agglomerative clusteringWard’s method

Instead of considering connectivity between clusters, we can alsoconsider the impact of merging clusters on their quality.

Ward’s method compares the total SSE of the two clusters to theSSE of a single cluster obtained by merging them.

Similar to average-linkage with squared distances asdissimilarities.Like average-linkage, biased toward globular clusters while beingsomewhat stable to noise and outliers.

Ward’s method provided an agglomerative/hierarchical analogue tok-means.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 12 / 17

Page 40: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Large-scale clusteringHierarchical clustering is not only useful for data organization, but alsofor large scale data processing, even without special interpretability.A common approach for clustering big data is to iteratively coarse-grain the data to reduce its size, until a desired resolution (e.g., num-ber or size of clusters) is reached. Each coarse-graining iteration isachieved finding (and merging) small tight clusters.Two of the main challenges in implementing such approaches are:

1 Finding a compact representation of clusters that allowsmerging and comparisons,

2 Efficient data scanning strategy during the initial clusterconstruction process and coarse-graining iterations.

Such methods also apply various advanced implementation techniquesthat are beyond the scope of this course.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 13 / 17

Page 41: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Large-scale clusteringCURE

CURE (Clustering Using REpresentatives) extends the idea of k-meansby choosing a small set of r points to represent the cluster instead ofa single centroid point.

Given a cluster, CURE chooses a set of representative points{x1, . . . , xr} using the following steps:

Compute the centroid c of the cluster.Set x1 to be the farthest point in the cluster from c .For i = 2, . . . , r , set xi to be the farthest point from all previousrepresentatives x1, . . . , xi−1.

Notice that these representatives are aimed to capture the bordersof the cluster rather than its center, unlike k-means.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 14 / 17

Page 42: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Large-scale clusteringCURE

Using border points as cluster representatives allows CURE to capturenon-globular and concave-shaped clusters.

However, the farthest-point selection is sensitive to noise and outliers,so CURE “shrinks” these points toward the cluster center.

Each representative xi is replacedwith xi = xi −α(xi − c). Since thisshrinkage is relative to the distance,outliers are more affected by it thanother points.

The shrinkage factor α controls the correction magnitude, and settingα = 1 gives the classic centroid-based cluster representation.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 14 / 17

Page 43: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Large-scale clusteringCURE

Using the cluster representatives, CURE applies a single-link agglom-erative clustering approach, based the minimal distance between rep-resentatives.

Additionally, instead of clustering the entire data in at once, CUREpartitions the dataset into smaller local partitions. Then, the ag-glomerative clustering is applied to each of them (e.g., in parallel).Finally, the coarse-grained data from these clusterings are mergedtogether, and agglomerative clustering is applied on this cluster col-lection.

The hierarchical approach in CURE is mainly aimed to cope with com-putational challenges, rather than find hierarchical data organization.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 14 / 17

Page 44: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Large-scale clusteringCURE

The full CURE algorithm also includes sampling and outlier-removalsteps, as described in the following pipeline:

More details can be found in: CURE: an efficient clusteringalgorithm for large databases (Guha, Rastogi, & Shim, 1998).

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 14 / 17

Page 45: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Large-scale clusteringBIRCH

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies),is an efficient centroid-based clustering method that aims to reducememory-related overheads of the clustering process.

The main principle in BIRCH is to use a single scan of the data inorder to produce tight clusters that can then be iteratively merged toform a cluster hierarchy.

To enable this scan, the algorithm requires a compressed in-memorydata structure, with efficient amortized insertion time of eachnewly scanned data point.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 15 / 17

Page 46: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Large-scale clusteringBIRCH

The BIRCH algorithm introduces the notion of Clustering Features(CF) to compactly represent clusters:

Cluster featuresGiven a cluster C = {~x1, . . . , ~xm} ⊆ Rn, its cluster features are thetriple CF = (m, ~LS, ~SS) ∈ R×Rn ×Rn where ~LS =

∑mi=1 ~xi and

~SS[j ] = ∑mi=1(~xi [j ])2, j = 1, . . . , n.

Notice that given two disjoint clusters C1 and C2, their features areeasily merged as CF1,2 = CF1 + CF2 for C1 ∪ C2.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 15 / 17

Page 47: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Large-scale clusteringBIRCH

Not only are the CF easily merged, but they also have sufficientinformation the computation of many important cluster properties,such as centroid, radius, diameter, and SSE.

ExamplesCentroid: c = 1

m∑

~x∈C ~x = 1m~LS

Radius: consider R2 = 1m∑

~x∈C ‖~x − c‖2, then ∑~x∈C ‖~x − c‖2 =∑

~x∈C ‖~x‖2 + m ‖c‖2 − 2∑~x∈C〈~x , c〉, but then ∑

~x∈C〈~x , c〉 =〈∑~x∈C ~x , c〉 = m ‖c‖2 and ∑

~x∈C ‖~x‖2 =

∑~x∈C

∑nj=1(~x [j ])2 =∑n

j=1~SS[j ], thus we get R =

√1m

∥∥∥ ~SS∥∥∥

1− 1

m2

∥∥∥ ~LS∥∥∥2

2

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 15 / 17

Page 48: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Large-scale clusteringBIRCH

The clusters in BIRCH are built incrementally by scanning the datasetand inserting each data point to the closest cluster. These in-sertions amount to simple updates of the CF of the cluster.

However, the clusters are constrained to have a bounded diameter,and if no cluster can absorb a data point, the scan creates a newsingleton cluster.

In order to enable efficient nearest-cluster searches, the CF’s are or-ganized in a balanced CF-tree. The size of the tree is determinedby technical considerations (e.g., memory size) in order to minimizepaging and I/O overheads of the scanning process.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 15 / 17

Page 49: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Large-scale clusteringBIRCH

Each node in the tree is limited to have at most B clusters.

The leaves of this tree hold tight clusters, while other node holdsuper-clusters, which also correspond to branches & child-nodes.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 15 / 17

Page 50: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Large-scale clusteringBIRCH

When a new point is scanned, the algorithm recursively finds theclosest CF in each node (starting from the root) and follows thecorresponding branch to traverse the CF tree until the closest CF isfound in a leaf node.

Once a CF is found in a leaf node, the algorithm checks whether itcan absorb the data point under the bounded diameter constraint.If a data point is absorbed by a cluster, its CF is updated accordingly,otherwise a new CF is created in the leaf node.

If the leaf node now has more than B clusters, it is split in two, andthe CF entries in its parent node are updated to replace the CF entryof the removed branch with two CF entries for the added branches.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 15 / 17

Page 51: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Large-scale clusteringBIRCH

Once a CF is found in a leaf node, the algorithm checks whether itcan absorb the data point under the bounded diameter constraint.If a data point is absorbed by a cluster, its CF is updated accordingly,otherwise a new CF is created in the leaf node.

If the leaf node now has more than B clusters, it is split in two, andthe CF entries in its parent node are updated to replace the CF entryof the removed branch with two CF entries for the added branches.

In any case, each update also triggers updates in all the ancestornodes on the path toward the root of the CF tree. Similar to the leafnode update, these updates may cause some internal nodes tobe split. If the root splits, it becomes an internal node and a newroot is created.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 15 / 17

Page 52: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Large-scale clusteringBIRCH

The full BIRCH algorithm uses the following steps:

More details can be found in: BIRCH: an efficient data clusteringmethod for very large databases (Zhang, Ramakrishnan, & Livny,1996).

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 15 / 17

Page 53: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Large-scale clusteringChameleon

Chameleon uses graph partitioning together with graph-orientedagglomerative clustering to enable efficient and robust clusteringin big datasets. It follows three main phases:

Preprocessing phase: Chameleon start by computing a sparsek-NN graph to capture local relationships between data points.

Notice that k-NN neighborhoods are more robust invariable-density data.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 16 / 17

Page 54: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Large-scale clusteringChameleon

Partitioning phase: multilevel graph partitioning is used to findmany well-connected clusters in the data.

The working assumption of the algorithm is that these should besubclusters of the true data clusters.

Hierarchical phase: agglomerative clustering is applied to iterativelymerge (sub)clusters.

Instead of using linkage, Chameleon considers interconnectivityand closeness between two clusters.Each iteration merges a pair of clusters with the highest relativeinterconnectivity and relative closeness.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 16 / 17

Page 55: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

Large-scale clusteringChameleon

Interconnectivity between clusters is defined as the sum of edgeweights that cross from one cluster to another.

Closeness between clusters is defined as the average of theseweights.

The relative version of these quantities is obtained by normalizationwith the average corresponding quantity measured over bisectionsthat split each cluster into two equal-size parts.

More details can be found in: Chameleon: Hierarchical clusteringusing dynamic modeling (Karypis, Han, & Kumar, 1999).

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 16 / 17

Page 56: Introduction to Data Mining Hierarchical Clusteringcpsc445.guywolf.org/Slides/CPSC445 - Topic 09 - Hierarchical Cluster… · Large-scale clustering CURE Using the cluster representatives,

SummaryHierarchical clustering provides a multiscale data organization.

Dendrograms are typically used for visualizing the recoverednested cluster structure.Agglomerative clustering is a popular approach to build clusterhierarchies, based on linkage or impact on a suitable clusterquality measure.

It is also useful as a coarse-graining tool for scalable data processing.CURE performs scalable clustering based on sampling,partitioning and representative selection.BIRCH uses an efficient CF tree construction to optimizememory handling overheads of the clustering process.Chameleon is based on sparse graph partition, and an alternativecluster proximity that combines closeness and interconnectivity.

Many different methods are available to improve & extend theseprinciples in both general and application-specific settings.

CPSC 445 (Guy Wolf) Hierarchical Clustering Yale - Fall 2016 17 / 17