71
Data Clustering Héctor Corrada Bravo University of Maryland, College Park, USA CMSC 644: 20190306

Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Data ClusteringHéctor Corrada Bravo

University of Maryland, College Park, USA CMSC 644: 2019­03­06

Page 2: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Motivating ExampleTime series dataset of mortgage affordability as calculated anddistributed by Zillow: https://www.zillow.com/research/data/.

The dataset consists of monthly mortgage affordability values for 76counties with data from 1979 to 2017.

1 / 70

Page 3: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Motivating Example"To calculate mortgage affordability, we first calculate themortgage payment for the median­valued home in ametropolitan area by using the metro­level Zillow Home ValueIndex for a given quarter and the 30­year fixed mortgageinterest rate during that time period, provided by the FreddieMac Primary Mortgage Market Survey (based on a 20 percentdown payment)."

2 / 70

Page 4: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Motivating Example"Then, we consider what portion of the monthly medianhousehold income (U.S. Census) goes toward this monthlymortgage payment. Median household income is available witha lag. "

3 / 70

Page 5: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Can we partition countiesinto groups of counties withsimilar value trends acrosstime?

Motivating Example

4 / 70

Page 6: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Cluster AnalysisThe high­level goal of cluster analysis is to organize objects(observations) that are similar to each other into groups.

We want objects within a group to be more similar to each other thanobjects in different groups.

5 / 70

Page 7: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Cluster AnalysisThe high­level goal of cluster analysis is to organize objects(observations) that are similar to each other into groups.

We want objects within a group to be more similar to each other thanobjects in different groups.

Central to this high­level goal is how to measure the degree of similaritybetween objects.

A clustering method then uses the similarity measure provided to it togroup objects into clusters.

6 / 70

Page 8: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Cluster AnalysisResult of the k­means algorithmpartitioning the data into 9 clusters.

The darker series within eachcluster shows the average timeseries within the cluster.

7 / 70

Page 9: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Dissimilarity-based ClusteringFor certain algorithms, instead of similarity we work with dissimilarity,often represented as distances.

When we have observations defined over attributes, or predictors, wedefine dissimilarity based on these attributes.

8 / 70

Page 10: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Dissimilarity-based ClusteringGiven measurements   for   observations over   predictors.

Suppose we define a dissimilarity  , we can then define adissimilarity between objects as

xij i = 1, … ,N j = 1, … , p

dj(xij,xi′j)

d(xi,xi′) =

p

∑j=1

dj(xij,xi′j)

9 / 70

Page 11: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Dissimilarity-based ClusteringIn the k­means algorithm, and many other algorithms, the most commonusage is squared distance

We can use different dissimilarities, for example

which may affect our choice of clustering algorithm later on.

dj(xij,xi′j) = (xij − xi′j)2

dj(xij,xi′j) = |xij − xi′j|

10 / 70

Page 12: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Dissimilarity-based ClusteringFor categorical variables, we could set

dj(xij,xi′j) =⎧⎨⎩

0 if xij = xi′j

1 o.w.

11 / 70

Page 13: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Dissimilarity-based ClusteringIf the values the categorical variable have an intrinsic similarity

Generalize using symmetric matrix   with elements

,  and  otherwise.

This may of course lead to a dissimilarity that is not a proper distance.

L

Lrr′ = Lr′r

Lrr = 0

Lrr′ ≥ 0

12 / 70

Page 14: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Limitations of Distance-based clusteringWhen working with distances, we are quickly confronted with the curseof dimensionality.

One flavor: in high dimensions: all points are equi­distant

13 / 70

Page 15: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

The curse of dimensionalityConsider the case where we have many covariates. We want to use adistance­based clustering method.

14 / 70

Page 16: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

The curse of dimensionalityConsider the case where we have many covariates. We want to use adistance­based clustering method.

Basically, we need to define distance and look for small multi­dimensional "balls" around the data points. With many covariates thisbecomes difficult.

15 / 70

Page 17: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

The curse of dimensionalityImagine we have equally spaced data and that each covariate is in  .We want something that clusters points into local regions, containg somereasonably small amount of data points (say 10%). Let's imagine theseare high­dimensional cubes.

[0, 1]

16 / 70

Page 18: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

The curse of dimensionalityImagine we have equally spaced data and that each covariate is in  .We want something that clusters points into local regions, containg somereasonably small amount of data points (say 10%). Let's imagine theseare high­dimensional cubes.

If we have   covariates and we are forming  ­dimensional cubes, theneach side of the cube must have size   determined by  .

[0, 1]

p p

l l × l × ⋯ × l = lp = .10

17 / 70

Page 19: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

The curse of dimensionalityIf the number of covariates is p=10, then  . So it really isn't local!If we reduce the percent of data we consider to 1%,  . Still not verylocal.

l = .11/10 = .8

l = 0.63

18 / 70

Page 20: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

The curse of dimensionalityIf the number of covariates is p=10, then  . So it really isn't local!If we reduce the percent of data we consider to 1%,  . Still not verylocal.

If we keep reducing the size of the neighborhoods we will end up withvery small number of data points in each cluster and require a largenumber of clusters.

l = .11/10 = .8

l = 0.63

19 / 70

Page 21: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

K-means ClusteringA commonly used algorithm to perform clustering is the K­meansalgorithm.

It is appropriate when using squared Euclidean distance as the measureof object dissimilarity.

d(xi,xi′) =

p

∑j=1

(xij − xi′j)2

= ∥xi − xi′∥2

20 / 70

Page 22: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

K-means ClusteringK­means partitions observations into   clusters, with   provided as aparameter.

Given some clustering, or partition,  , denote cluster assignment ofobservation   to cluster   is denoted as  .

K K

C

xi k ∈ {1, … , K} C(i) = k

21 / 70

Page 23: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

K-means ClusteringK­means partitions observations into   clusters, with   provided as aparameter.

Given some clustering, or partition,  , denote cluster assignment ofobservation   to cluster   is denoted as  .

K­means minimizes this clustering criterion:

K K

C

xi k ∈ {1, … , K} C(i) = k

W(C) =K

∑k=1

∑i: C(i)=k

∑i′: C(i′)=k

∥xi − xi′∥21

2

22 / 70

Page 24: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

K-means ClusteringThis is equivalent to minimizing

with:

 is the average of predictor   over the observations assigned tocluster  , is the number of observations assigned to cluster 

W(C) =K

∑k=1

Nk ∑i: C(i)=k

∥xi − x̄k∥21

2

x̄k = (x̄k1, … , x̄kp)

x̄kj j

k

Nk k

23 / 70

Page 25: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

K-means Clustering

Minimize the total distance given by each observation to the mean(centroid) of the cluster to which the observation is assigned.

W(C) =K

∑k=1

Nk ∑i: C(i)=k

∥xi − x̄k∥21

2

24 / 70

Page 26: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

K-means ClusteringAn iterative algorithm is used to minimize this criterion

1. Initialize by choosing   observations as centroids 2. Assign each observation   to the cluster with the nearest centroid, i.e.,set 

3. Update centroids 4. Iterate steps 1 and 2 until convergence

K m1, m2, … , mk

i

C(i) = arg min1≤k≤K ∥xi − mk∥2

mk = x̄k

25 / 70

Page 27: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Here we illustrate thek­means algorithmover four iterations onour example datawith  .

K-means Clustering

K = 4

26 / 70

Page 28: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

K-means ClusteringCriterion   is reduced in each iteration so the algorithm is assured toconverge.

Not a convex criterion, the clustering we obtain may not be globallyoptimal.

In practice, the algorithm is run with multiple initializations (step 0) andthe best clustering achieved is used.

W(C)

27 / 70

Page 29: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

K-means ClusteringAlso, selection of observations as centroids can be improved using theK­means++ algorithm:

1. Choose an observation as centroid   uniformly at random2. To choose centroid  , compute for each observation   not chosen asa centroid the distance to the nearest centroid 

3. Set centroid   to an observation randomly chosen with probability 

4. Iterate steps 1 and 2 until   centroids are chosen

m1

mk i

di = min1≤l<k ∥xi − ml∥2

mk

ed

i

∑i′ e

d

i′

K

28 / 70

Page 30: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Choosing the number of clustersThe number of parameters must be determined before running the K­means algorithm.

There is no clean direct method for choosing the number of clusters touse in the K­means algorithm (e.g. no cross­validation method)

29 / 70

Page 31: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Looking at criterion  alone is not

sufficient as thecriterion will becomesmaller as the valueof   is reduced.

Choosing the number of clusters

W(C)

K

30 / 70

Page 32: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Choosing the number of clustersWe can use properties of this plot for ad­hoc selection.

Suppose there is a true underlying number   of clusters in the data,

improvement in the   statistic will be fast for values of slower for values of  .

K∗

WK(C) K ≤ K∗

K > K∗

31 / 70

Page 33: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Choosing the number of clustersImprovement in the   statistic will be fast for values of 

In this case, there will be a cluster which will contain observationsbelonging to two of the true underlying clusters, and therefore will havepoor within cluster similarity.

As   is increased, observations may then be separated into separateclusters, providing a sharp improvement in the   statistic.

WK(C) K ≤ K∗

K

WK(C)

32 / 70

Page 34: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Choosing the number of clustersImprovement in the   statistic will be slower for values of 

In this case, observations belonging to a single true cluster are split intomultiple cluster, all with generally high within­cluster similarity,

Splitting these clusters further will not improve the   statistic verysharply.

WK(C) K > K∗

WK(C)

33 / 70

Page 35: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

The curve willtherefore have aninflection pointaround  .

Choosing the number of clusters

K∗

34 / 70

Page 36: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Choosing the number of clustersThe gap statistic is used to identify the inflection point in the curve.

It compares the behavior of the   statistic based on the data with thebehavior of the   statistic for data generated uniformly at random overthe range of the data.

Chooses the   that maximizes the gap between these two   curves.

WK(C)

WK(C)

K WK(C)

35 / 70

Page 37: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

For this dataset, thegap statistic suggeststhere is no clearcluster structure andtherefore   is thebest choice.

A choice of   isalso appropriate.

Choosing the number of clusters

K = 1

K = 4

36 / 70

Page 38: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clusteringCost of K­means as presented:

Each iteration: Compute distance for each point to centroid O(knp)

37 / 70

Page 39: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clusteringCost of K­means as presented:

Each iteration: Compute distance for each point to centroid 

This implies we have to do multiple passes over entire dataset.

Not good for massive datasets

O(knp)

38 / 70

Page 40: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clusteringCost of K­means as presented:

Each iteration: Compute distance for each point to centroid 

This implies we have to do multiple passes over entire dataset.

Not good for massive datasets

Can we do this in "almost" a single pass?

O(knp)

39 / 70

Page 41: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clustering (BFR Algorithm)1. Select   points as before2. Process data file in chunks:

Set chunk size so each can be processed in main memoryWill use memory for workspace so not entire memory available

k

40 / 70

Page 42: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clustering (BFR Algorithm)For each chunk

All points sufficiently close to the centroid of one of the   clusters isassigned to that cluster (Discard Set)

k

41 / 70

Page 43: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clustering (BFR Algorithm)For each chunk

All points sufficiently close to the centroid of one of the   clusters isassigned to that cluster (Discard Set)

Remaining points are clustered (e.g., using  ­means with some valueof  . Two cases

Clusters with more than one point (Compressed Set)Singleton clusters (Retained Set)

k

k

k

42 / 70

Page 44: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clustering (BFR Algorithm)For each chunk

All points sufficiently close to the centroid of one of the   clusters isassigned to that cluster (Discard Set)

Remaining points are clustered (e.g., using  ­means with some valueof  . Two cases

Clusters with more than one point (Compressed Set)Singleton clusters (Retained Set)

Try to merge clusters in Compressed Set

k

k

k

43 / 70

Page 45: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clustering (BFR Algorithm)What is sufficiently close?

"Weighted" distance to centroid below some threshold.

: cluster mean for feature : cluster standard deviation of feature 

⎷d

∑i=1

(pi − ci)2

σi

ci i

σi i

44 / 70

Page 46: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clustering (BFR Algorithm)

Assumption: points are distributed along axis-parallel ellipses

45 / 70

Page 47: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clustering (BFR Algorithm)Under this assumption, we only need to store means and variances tocalculate distances

46 / 70

Page 48: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clustering (BFR Algorithm)Under this assumption, we only need to store means and variances tocalculate distances

We can do this by storing for each cluster  :

 number of points assigned to cluster sum of values of feature   in cluster  sum of squares of values of feature   in cluster 

j

Nj

sij i j

s2

iji j

47 / 70

Page 49: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clustering (BFR Algorithm)Under this assumption, we only need to store means and variances tocalculate distances

We can do this by storing for each cluster  :

 number of points assigned to cluster sum of values of feature   in cluster  sum of squares of values of feature   in cluster 

Constant amount of space to represent cluster

j

Nj

sij i j

s2

iji j

48 / 70

Page 50: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clustering (BFR Algorithm)Under this assumption, we only need to store means and variances tocalculate distances

We can do this by storing for each cluster  :

 number of points assigned to cluster sum of values of feature   in cluster  sum of squares of values of feature   in cluster 

Constant amount of space to represent cluster

Exercise. show these are sufficient to calculate weighted distance

j

Nj

sij i j

s2

iji j

49 / 70

Page 51: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clustering (BFR Algorithm)This is used to represent (final) clusters in Discard Set and (partial)clusters in Compressed Set

Only points explicitly in memory are those in the Retained Set

50 / 70

Page 52: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clustering (BFR Algorithm)This is used to represent (final) clusters in Discard Set and (partial)clusters in Compressed Set

Only points explicitly in memory are those in the Retained Set

Points outside of Retained Set are never kept in memory (written outalong with cluster assignment)

51 / 70

Page 53: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clustering (BFR Algorithm)

Merging clusters in Compressed Set

Example: Merge if the variance of combined clusters is sufficiently closeto variance of separate clusters

52 / 70

Page 54: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clustering (BFR Algorithm)After all data is processed:

Assign points in Retained Set to cluster with nearest centroid

Merge partial clusters in Compressed Set with final clusters inDiscarded Set

53 / 70

Page 55: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Large-scale clustering (BFR Algorithm)After all data is processed:

Assign points in Retained Set to cluster with nearest centroid

Merge partial clusters in Compressed Set with final clusters inDiscarded Set

Or,

Flag all of these as outliers

54 / 70

Page 56: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Soft K-means ClusteringInstead of the combinatorial approach of the  ­means algorithm, take amore direct probabilistic approach to modeling distribution  .

Assume each of the   clusters corresponds to a multivariate distribution ,

 is given by mixture of these distributions as  .

K

Pr(X)

K

Prk(X)

Pr(X) Pr(X) =∑K

k=1 πkPrk(X)

55 / 70

Page 57: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Soft K-means ClusteringSpecifically, take   as a multivariate normal distribution 

and mixture density  .

Prk(X) fk(X) = N(μk,σ2kI)

f(X) = ∑K

k=1 πkfk(X)

56 / 70

Page 58: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Soft K-means ClusteringUse Maximum Likelihood to estimate parameters

based on their log­likelihood

θ = (μ1, … , μK, σ21 , … , σ2

K, π1, … , πK)

ℓ(θ; X) =N

∑i=1

log[ K

∑k=1

πkfk(xi; θ)]

57 / 70

Page 59: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Soft K-means Clustering

Maximizing this likelihood directly is computationally difficult

Use Expectation Maximization algorithm (EM) instead.

ℓ(θ; X) =N

∑i=1

log[ K

∑k=1

πkfk(xi; θ)]

58 / 70

Page 60: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Soft K-means ClusteringConsider unobserved latent variables   taking values 0 or 1,

 specifies observation   was generated by component   of themixture distribution.

Δik

Δij = 1 xi k

59 / 70

Page 61: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Soft K-means ClusteringNow set  ,and assume we observed values for indicatorvariables  .

We can write the log­likelihood of our parameters in this case as

Pr(Δik = 1) = πk

Δik

ℓ0(θ; X, Δ) =N

∑i=1

K

∑k=1

Δik log fk(xi; θ) +N

∑i=1

K

∑k=1

Δik log πk

60 / 70

Page 62: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Soft K-means ClusteringMaximum likelihood estimates:

.

μ̂k =∑N

i=1 Δikxi

∑N

i=1 Δik

σ̂2k =

∑Ni=1 Δik(xi−μ̂k)2

∑Ni=1 Δik

π̂k =∑K

i=1

N

61 / 70

Page 63: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Soft K-means ClusteringOf course, this result depends on observing values for   which we don'tobserve. Use an iterative approach as well:

given current estimate of parameters  ,maximize

.

Δik

θ

E(ℓ0(θ′; X, Δ)|X, θ)

62 / 70

Page 64: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Soft K-means ClusteringOf course, this result depends on observing values for   which we don'tobserve. Use an iterative approach as well:

given current estimate of parameters  ,maximize

.

We can prove that maximizing this quantity also maximizes the likelihoodwe need  .

Δik

θ

E(ℓ0(θ′; X, Δ)|X, θ)

ℓ(θ; X)

63 / 70

Page 65: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Soft K-means ClusteringIn the mixture case, what is the function we would maximize?

Define

γik(θ) = E(Δik|Xi, θ) = Pr(Δik = 1|Xi, θ)

64 / 70

Page 66: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Soft K-means ClusteringUse Bayes' Rule to write this in terms of the multivariate normal densitieswith respect to current estimates  :θ

γik =

=

Pr(Xi|Δik = 1)Pr(Δik = 1)

Pr(Xi)

fk(xi; μk, σ2k)πk

∑K

l=1 fl(xi; μl, σ2l )πl

65 / 70

Page 67: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Soft K-means ClusteringQuantity   is referred to as the responsibility of cluster   for observation, according to current parameter estimate  .

γik(θ) k

i θ

66 / 70

Page 68: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Soft K-means ClusteringThen the expectation we are maximizing is given by

E(ℓ0(θ′; X, Δ)|X, θ) =N

∑i=1

K

∑k=1

γik(θ) log fk(xi; θ′) +N

∑i=1

K

∑k=1

γik(θ) log π′k

67 / 70

Page 69: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Soft K-means ClusteringWe can now give a complete specification of the EM algorithm formixture model clustering.

1. Take initial guesses for parameters 2. Expectation Step: Compute responsibilities 3. Maximization Step: Estimate new parameters based onresponsibilities as below.

4. Iterate steps 1 and 2 until convergence

θ

γik(θ)

68 / 70

Page 70: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Soft K-means AlgorithmEstimates in the Maximization step are given by

and

μ̂k =∑

Ni=1 γik(θ)xi

∑Ni=1 γik

σ̂2k =

∑ i = 1N γik(θ)(xi − μk)

∑N

i=1 γik(θ)

π̂k =∑N

i=1 γik(θ)

N

69 / 70

Page 71: Data Clustering · 2019-05-01 · Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster

Soft K-means AlgorithmThe name "soft" K­means refers to the fact that parameter estimates foreach cluster are obtained by weighted averages across all observations.

This is an instance of the Expectation­Maximization Algorithm which wewill see again later this semester.

70 / 70