CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014

CHAPTER 7:

Clustering

Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added)

Last updated: February 25, 2014

2

k-Means Clustering Problem: Find k reference vectors M={m1,…, mk}

(representatives/centroids) and a 1-of-k coding scheme (X M) represented by b_ti that minimizes the “reconstruction error”:

b_it and M has to be found that minimizes E:

Step1: find the best b_it:

Step2: find M: mi=( b_it*xt)/(b_it)

2**,11

t i i

tt

i

N

t

t

i

k

iibbEJ mxm X

otherwise0

min if1 jt

ji

ttib

mxmx

Problem: b_it depends on M!!only an iterative solution can be found!!Significantly different from book!!

Eick: K-Means and EM

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

3

k-means Clustering

4

k-Means Clustering Reformulated Given dataset X, k (and a random seed) Problem: Find k vectors M={m1,…, mk}d and 1-of-

k coding scheme bit (bi

t: X M) that minimizes the following error:

Subject to: {0,1}

(t)=i =1 for t=1,…,N (“coding scheme”)

t i i

tt

i

N

t

t

i

k

iibbEJ 2**,

11mxm X

t

ib

t

ib


5

k-Means Clustering Generalized Given dataset X, k, distance function d Problem: Find k vectors M={m1,…, mk}d and 1-of-

k coding scheme bit (bi

t: X M) that minimizes the following error:

Subject to: {0,1}

(t)=i =1 for t=1,…,N (“coding scheme”)

t i i

tt

i

N

t

t

i

k

iidbbEJ 2),(*,'

11mxm X

t

ib

t

ib


6

K-mean’s Complexity—O(kntd)

K-means performs t iterations of E/M-steps; in each iteration the following computations have to be done: E-Step:

1. Find the nearest centroid for each object; because there are n objects and k centroids n*k distances have to be computed and compared; the complexity of computing a distance is O(d) yielding: O(k*n*d)

2. Assign objects to clusters/update clusters (O(n)) M-Step: Compute centroid for each cluster (O(n))

We obtain: O(t*(k*n*d + n + n))=O(t*k*n*d)



7

Semiparametric Density Estimation Parametric: Assume a single model for p (x | Ci)

(Chapter 4 and 5) Semiparametric: p (x | Ci) is a mixture of densities

Multiple possible explanations/prototypes: Nonparametric: No model; data speaks for itself

(Chapter 8)

Eick: K-Means and EM8

Mixture Densities

where Gi the components/groups/clusters,

P ( Gi ) mixture proportions (priors),

p ( x | Gi) component densities

Gaussian mixture where p(x|Gi) ~ N ( μi , ∑i ) parameters Φ = {P ( Gi ), μi , ∑i }k

i=1

unlabeled sample X={xt}t (unsupervised learning)

k

iii Ppp

1

| GGxx

Idea: use a density function for each cluster or use multiple density functions for a single class

9

Expectation-Maximization (EM) Clustering Assumptions: datasets X are created a mixture of k

multivariate Gaussian G1,…Gk also preserving their priors P(G1),…,P(Gk).

Example (2-D Dataset):p(x|G1)~N((2,3),1), P(G1)=0.5

p(x|G2)~ N((-4,1),2), P(G2)=0.4

p(x|G3)~N((0,-9),3), P(G3)=0.1 EM’s Task: Given X, reconstruct the k Gaussian and

their priors; each Gaussian is a model of a cluster in the dataset.

Forming of clusters: Assign xi to the cluster Gr which has the maximum value for: p(xi|Gr)P(Gr); alternatively, set p(xi|Gr)P(Gr)= (soft EM)Eick: K-Means and EM

irh

10

Performance Measure of EM EM maximizes Log likelihood of a mixture model

We choose which makesX most likely.

Assume hidden variables z, which when known, make optimization much simpler. z gives the probability of an object belonging to a particular cluster; z_ij is not known and has to be estimated; h_ij denotes its estimator.

EM maximizes the log likelihood of the sample, but instead of maximizing L(Φ |X), it maximizes L(Φ |X,Z), in terms of x and z.

t

k

iii

t

t

t

Pp

p

1

|log

|log|

GG

XL

x

x



11

E- and M-steps

Iterate the two steps1. E-step: Estimate z given X and current Φ2. M-step: Find new Φ’ given z, X, and old Φ.

An increase in Q increases incomplete likelihood

ll

lC

l E

|maxarg:step-M

|||:step-E1 Q

X,ZX,LQ

XLXL ||1 ll

Assignments of samplesto clusters

Current Model

Remark: z_it denotes the unknown membership of the t-th example

12

EM in Gaussian Mixtures zt

i = p(x|Gi); assume p(x|Gi)~N(μi,∑i) E-step: Estimate mixture probabilities for X={x1,

…,xn}

M-step: Obtain: l+1namely (P(Gi),mi,Si)l+1 for i=1,…,k

Use estimated labels in place of unknown labels

ti

j jl

jt

il

it

lti h

Pp

PpzE

G,G

G,GX

|

|,

x

x

t

ti

Tli

t

t

li

ttil

i

t

ti

t

ttil

it

ti

i

h

h

h

h

N

hP

111

1

mxmx

xm

S

G

Remarks: h_it plays the role of b_it in K-Meansh_it acts as an estimator for the unknown labels z_ij.

13

EM Means Expectation-Maximization 1. E-step (Expectation): Determine object membership in

clusters (from model)2. M-step (Maximization): Create model by using

Maximum Likelihood Estimation (from cluster memberships)

Computations in the two steps (K-means/EM):1. Compute b_it / h_it2. Compute centroid/(prior, mean,co-variance matrix)

Remark: K-means uses a simple form of the EM-procedure!


14

Commonalities between K-Means and EM

1. They start with random clusters and rely on a 2 step-approach to minimize the objective function using the EM-procedure (see previous transparency).

2. Use the same optimization procedure of an objective function f(a1,…,am,b1,…,,bk); we basically, maximize the a-values (keeping the b-values fixed) and then the b-values (keeping the a-values fixed) until some convergence is reached. Consequently, both algorithms only find a local minimum of the objective function are sensitive to initialization

3. Both assume the number of clusters k is known


15

Differences between K-Means and EM I

1. K-means is distanced-based and relies on 1-NN queries to form clusters. EM is density based/probabilistic; EM usually works with multivariate Gaussians but can be generalized to work with other probability distributions.

2. K-means minimizes the squared distance of on object to its cluster prototype (usually the centroid). EM maximizes the log-likelihood of a sample given a model (p(X|)); models are assumed to be mixtures of k Gaussians and their priors.

3. K-means is a hard clustering, EM is a soft clustering algorithm: h_it [0,1]


16

Differences between K-Means and EM II

4. K-means cluster models are just k centroids; EM models are k “priors, means+co-variance matrices”.

5. EM directly deals with dependencies between attributes in its density estimation approach: the degree to which an object x belongs to a cluster c depends on the product of c’s prior with the Mahalanobis distance between x and the c’s mean; therefore, EM clusters do not depend on units of measurements and orientation of attributes in space.

6. The distance metrics can be viewed as an input parameter when using k-means, and generalizations of K-means have been proposed which use different distance functions. EM implicitly relies on the Mahalanobis distance function which is part of its density estimation approach.



17

Mixture of Mixtures

In classification, the input comes from a mixture of classes (supervised).

If each class is also a mixture, e.g., of Gaussians, (unsupervised), we have a mixture of mixtures:

K

iii

k

jijiji

Ppp

Pppi

1

1

|

||

CC

GGC

xx

xx


18

Choosing k

Defined by the application, e.g., image quantization

Plot data (after PCA) and check for clusters Incremental (leader-cluster) algorithm: Add one at

a time until “elbow” (reconstruction error/log likelihood/intergroup distances)

Manual check for meaning Run with multiple k-values and compare the

results

Documents

CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014