35
K-Center and Dendrogram Clustering K-Center and Dendrogram Clustering Jia Li Department of Statistics The Pennsylvania State University Email: [email protected] http://www.stat.psu.edu/jiali Jia Li http://www.stat.psu.edu/jiali

K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

K-Center and Dendrogram Clustering

Jia Li

Department of StatisticsThe Pennsylvania State University

Email: [email protected]://www.stat.psu.edu/∼jiali

Jia Li http://www.stat.psu.edu/∼jiali

Page 2: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

K-center Clustering

I Let A be a set of n objects.

I Partition A into K sets C1, C2, ..., CK .I Cluster size of Ck : the least value D for which all points in Ck

are:

1. within distance D of each other, or2. within distance D/2 of some point called the cluster center.

I Let the cluster size of Ck be Dk .

I The cluster size of partition S is

D = maxk=1,...,K

Dk .

I Goal: Given K , minS D(S).

Jia Li http://www.stat.psu.edu/∼jiali

Page 3: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Comparison with k-meansI Assume the distance between vectors is the squared Euclidean

distance.I K-means:

minS

K∑k=1

∑i :xi∈Ck

(xi − µk)T (xi − µk)

where µk is the centroid for cluster Ck . In particular,

µk =1

Nk

∑i :xi∈Ck

xi .

I K-center:

minS

maxk=1,...,K

maxi :xi∈Ck

(xi − µk)T (xi − µk) .

where µk is called the “centroid”, but may not be the meanvector.

Jia Li http://www.stat.psu.edu/∼jiali

Page 4: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

I Another formulation of k-center:

minS

maxk=1,...,K

maxi ,j :xi ,xj∈Ck

L(xi , xj) .

L(xi , xj) denotes any distance between a pair of objects.

Jia Li http://www.stat.psu.edu/∼jiali

Page 5: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Original unclustered data.

Jia Li http://www.stat.psu.edu/∼jiali

Page 6: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Clustering by k-means. K-meansfocuses on average distance.

Clustering by k-center. K-centerfocuses on worst scenario.

Jia Li http://www.stat.psu.edu/∼jiali

Page 7: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Greedy Algorithm

I Choose a subset H from S consisting K points that arefarthest apart from each other.

I Each point hk ∈ H represents one cluster Ck .

I Point xi is partitioned into cluster Ck if

L(xi , hk) = mink ′=1,...,K

L(xi , hk ′) .

I Only need pairwise distance L(xi , xj) for any xi , xj ∈ S .Hence, xi can be a non-vector representation of the objects.

Jia Li http://www.stat.psu.edu/∼jiali

Page 8: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

I The greedy algorithm achieves an approximation factor of 2 aslong as the distance measure L satisfies the triangle inequality.That is, if

D∗ = minS

maxk=1,...,K

maxi ,j :xi ,xj∈Ck

L(xi , xj)

then the greedy algorithm guarantees that

D ≤ 2D∗ .

I The relation holds if the cluster size is defined in the sense ofcentralized clustering.

Jia Li http://www.stat.psu.edu/∼jiali

Page 9: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Pseudo Code

I H denotes the set of cluster representative objects{h1, ..., hk} ⊂ S .

I Let cluster(xi ) be the identity of the cluster xi ∈ S belongs to.

I Let dist(xi ) be the distance between xi and its closest clusterrepresentative object:

dist(xi ) = minhj∈H

L(xi , hj) .

Jia Li http://www.stat.psu.edu/∼jiali

Page 10: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

I Pseudo code:

1. Randomly select an object xj from S , let h1 = xj , H = {h1}.2. for j = 1 to n,

dist(xj) = L(xj , h1)cluster(xj) = 1

3. for i = 2 to KD = maxxj :xj∈S\H dist(xj)choose hi ∈ S \ H s.t. dist(hi ) = DH = H ∪ {hi}for j = 1 to n

if L(xj , hi ) ≤ dist(xj)dist(xj) = L(xj , hi )cluster(xj) = i

Jia Li http://www.stat.psu.edu/∼jiali

Page 11: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Algorithm Property

I The running time of the algorithm is O(Kn).

I Let the partition obtained by the greedy algorithm be S̃ andthe optimal partition be S∗.

I Let the cluster size of S̃ be D̃ and that of S∗ be D∗. Thecluster size is defined in the pairwise distance sense.

I It can be proved that D̃ ≤ 2D∗.

I We have the approximation factor of 2 result if cluster size ofa partition S is defined in the sense of centralized clustering.

Jia Li http://www.stat.psu.edu/∼jiali

Page 12: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Key Ideas for Proof

I Let Dj be the cluster size of the partition generated by{h1, ..., hj}.

I D1 ≥ D2 ≥ D3 · · · .I For ∀i < j , L(hi , hj) ≥ Dj−1. For ∀j , ∃i < j , s.t.,

L(hi , hj) = Dj−1.

I Consider optimal partition S∗ with k clusters and theminimum size D∗ . Suppose the greedy algorithm generatescentroids {h1, ..., hk , hk+1}.

I By the pigeonhole principle, at least two centroids fall intoone cluster of the partition S∗. Let the two centroids be1 ≤ i < j ≤ k + 1. Then L(hi , hj) ≤ 2D∗, by the triangleinequality, and the fact they lie in the same cluster. AlsoL(hi , hj) ≥ Dj−1 ≥ Dk . Thus Dk ≤ 2D∗.

Jia Li http://www.stat.psu.edu/∼jiali

Page 13: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Proof

I δ̃ = maxxj :xj∈S\H minhk :hk∈H L(xj , hk)

I Let hK+1 be the object in S \ H s.t.minhk :hk∈H L(hK+1, hk) = δ̃.

I By definition, L(hK+1, hk) ≥ δ̃ for all k = 1, ...,K .

I Let Hk = {h1, ..., hk}, k = 1, 2, ...,K .

Jia Li http://www.stat.psu.edu/∼jiali

Page 14: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

I Consider the distance between any hi and hj , i < j ≤ Kwithout loss of generality. According to the greedy algorithm:

minhk :hk∈Hj−1

L(hj , hk) ≥ minhk :hk∈Hj−1

L(xl , hk)

for any xl ∈ S \ Hj .Since hK+1 ∈ S \ H and S \ H ⊂ S \ Hj ,

L(hj , hi ) ≥ minhk :hk∈Hj−1

L(hj , hk)

≥ minhk :hk∈Hj−1

L(hK+1, hk)

≥ minhk :hk∈H

L(hK+1, hk)

= δ̃

Jia Li http://www.stat.psu.edu/∼jiali

Page 15: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

I We have shown that for any i < j ≤ K + 1,

L(hi , hj) ≥ δ̃ .

I Consider the partition C ∗1 ,C ∗

2 , ...,C ∗K formed by S∗. At lease 2

of the K + 1 objects h1, ..., hK+1 will be covered by onecluster. Without loss of generality, assume hi and hj belong tothe same cluster in S∗. Then L(hi , hj) ≤ D∗.

I Since L(hi , hj) ≥ δ̃, δ̃ ≤ D∗.

I Consider any two objects xη and xζ in any cluster representedby hk . By the definition of δ̃, L(xη, hk) ≤ δ̃ and L(xζ , hk) ≤ δ̃.Hence by the triangle inequality,

L(xη, xζ) ≤ L(xη, hk) + L(xζ , hk) ≤ 2δ̃ .

HenceD̃ ≤ 2δ̃ ≤ 2D∗

Jia Li http://www.stat.psu.edu/∼jiali

Page 16: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

I For centralized clustering:I Let D̃ = maxk=1,...,K maxxj :xj∈C̃k

L(xj , hk). Define D∗ similarly.I Step 7 in the proof modifies to L(hi , hj) ≤ 2D∗ by the triangle

inequality.I D̃ = δ̃ ≤ L(hi , hj) ≤ 2D∗.

I A step-by-step illustration of the k-center clustering isprovided next.

Jia Li http://www.stat.psu.edu/∼jiali

Page 17: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

2 clusters

Jia Li http://www.stat.psu.edu/∼jiali

Page 18: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

3 clusters

Jia Li http://www.stat.psu.edu/∼jiali

Page 19: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

4 clusters

Jia Li http://www.stat.psu.edu/∼jiali

Page 20: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Applications to Image Segmentation

Original image

Segmentation using K-means with LBG initialization

Segmentation using K-center

Segmentation by K-means using K-center for initialization

Jia Li http://www.stat.psu.edu/∼jiali

Page 21: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Scatter plots for LUV color components with K-center clustering

Jia Li http://www.stat.psu.edu/∼jiali

Page 22: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

K-means with LGB initialization

Jia Li http://www.stat.psu.edu/∼jiali

Page 23: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

K-means with K-center initialization

Jia Li http://www.stat.psu.edu/∼jiali

Page 24: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Comparison of segmentation results. Left: original images. Middle:K-means with k-center initialization. Right: K-means with LGBinitialization using the same number of clusters as in the k-center case.

Jia Li http://www.stat.psu.edu/∼jiali

Page 25: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Comparison of segmentation results. Left: original images. Middle:K-means with k-center initialization. Right: K-means with LGBinitialization using the same number of clusters as in the k-center case.

Jia Li http://www.stat.psu.edu/∼jiali

Page 26: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Agglomerative Clustering

I Generate clusters in a hierarchical way.

I Let the data set be A = {x1, ..., xn}.I Start with n clusters, each containing one data point.

I Merge the two clusters with minimum pairwise distance.

I Update between-cluster distance.

I Iterate the merging procedure.

I The clustering procedure can be visualized by a tree structurecalled dendrogram.

Jia Li http://www.stat.psu.edu/∼jiali

Page 27: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

I Definition for between-cluster distance?I For clusters containing only one data point, the

between-cluster distance is the between-object distance.I For clusters containing multiple data points, the

between-cluster distance is an agglomerative version of thebetween-object distances.

I Examples: minimum or maximum between-objects distancesfor objects in the two clusters.

I The agglomerative between-cluster distance can often becomputed recursively.

Jia Li http://www.stat.psu.edu/∼jiali

Page 28: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Example Distances

I Suppose cluster r and s are two clusters merged into a newcluster t. Let k be any other cluster.

I Denote between-cluster distance by D(·, ·).I How to get D(t, k) from D(r , k) and D(s, k)?

I Single-link clustering:

D(t, k) = min(D(r , k),D(s, k))

D(t, k) is the minimum distance between two objects incluster t and k respectively.

I Complete-link clustering:

D(t, k) = max(D(r , k),D(s, k))

D(t, k) is the maximum distance between two objects incluster t and k respectively.

Jia Li http://www.stat.psu.edu/∼jiali

Page 29: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

I How to get D(t, k) from D(r , k) and D(s, k)?I Average linkage clustering:

Unweighted case:

D(t, k) =nr

nr + nsD(r , k) +

ns

nr + nsD(s, k)

Weighted case:

D(t, k) =1

2D(r , k) +

1

2D(s, k)

D(t, k) is the average distance between two objects in clustert and k respectively.For the unweighted case, the number of elements in eachcluster is taken into consideration, while in the weighted caseeach cluster is weighted equally. So objects in smaller clusterare weighted more heavily than those in larger clusters.

Jia Li http://www.stat.psu.edu/∼jiali

Page 30: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

I How to get D(t, k) from D(r , k) and D(s, k)?I Centroid clustering:

Unweighted case:

D(t, k) =nr

nr + nsD(r , k) +

ns

nr + nsD(s, k)

− nrns

nr + nsD(r , s)

Weighted case:

D(t, k) =1

2D(r , k) +

1

2D(s, k)− 1

4D(r , s)

A centroid is computed for each cluster and the distancebetween clusters is given by the distance between theirrespective centroids.

Jia Li http://www.stat.psu.edu/∼jiali

Page 31: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

I How to get D(t, k) from D(r , k) and D(s, k)?I Ward’s clustering:

D(t, k) =nr + nk

nr + ns + nkD(r , k)

+ns + nk

nr + ns + nkD(s, k)

− nk

nr + ns + nkD(r , s)

Merge the two clusters for which the change in the variance ofthe clustering is minimized. The variance of a cluster is definedas the sum of squared-error between each object in the clusterand the centroid of the cluster.

I The dendrogram generated by single-link clustering tends tolook like a chain. Clusters generated by complete-link may notbe well separated. Other methods are intermediates betweenthe two.

Jia Li http://www.stat.psu.edu/∼jiali

Page 32: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Pseudo Code

1. Begin with n clusters, each containing one object. Numberthe clusters 1 through n.

2. Compute the between-cluster distance D(r , s) as thebetween-object distance of the two objects in r and srespectively, r , s = 1, 2, ..., n. Let square matrixD = (D(r , s)).

3. Find the most similar pair of clusters r , s, that is, D(r , s) isminimum among all the pairwise distances.

4. Merge r and s to a new cluster t. Compute thebetween-cluster distance D(t, k) for all k 6= r , s. Delete therows and columns corresponding to r and s in D. Add a newrow and column in D corresponding to cluster t.

5. Repeat Step 3 a total of n − 1 times until there is only onecluster left.

Jia Li http://www.stat.psu.edu/∼jiali

Page 33: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Jia Li http://www.stat.psu.edu/∼jiali

Page 34: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Agglomerate clustering of a data set (100 points) into 9 clusters.Left: Single-link, Right: Complete-link.

Jia Li http://www.stat.psu.edu/∼jiali

Page 35: K-Center and Dendrogram Clusteringpersonal.psu.edu/jol2/course/stat597e/notes2/kcenter.pdf · K-Center and Dendrogram Clustering Comparison with k-means I Assume the distance between

K-Center and Dendrogram Clustering

Agglomerate clustering of a data set (100 points) into 9 clusters.Left: Average linkage, Right: Wards clustering.

Jia Li http://www.stat.psu.edu/∼jiali