Upload
clinton-henry
View
219
Download
2
Embed Size (px)
Citation preview
CHAPTER 7:
Clustering
Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added)
Last updated: February 25, 2014
2
k-Means Clustering Problem: Find k reference vectors M={m1,…, mk}
(representatives/centroids) and a 1-of-k coding scheme (X M) represented by b_ti that minimizes the “reconstruction error”:
b_it and M has to be found that minimizes E:
Step1: find the best b_it:
Step2: find M: mi=( b_it*xt)/(b_it)
2**,11
t i i
tt
i
N
t
t
i
k
iibbEJ mxm X
otherwise0
min if1 jt
ji
ttib
mxmx
Problem: b_it depends on M!!only an iterative solution can be found!!Significantly different from book!!
Eick: K-Means and EM
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
3
k-means Clustering
4
k-Means Clustering Reformulated Given dataset X, k (and a random seed) Problem: Find k vectors M={m1,…, mk}d and 1-of-
k coding scheme bit (bi
t: X M) that minimizes the following error:
Subject to: {0,1}
(t)=i =1 for t=1,…,N (“coding scheme”)
t i i
tt
i
N
t
t
i
k
iibbEJ 2**,
11mxm X
t
ib
t
ib
Eick: K-Means and EM
5
k-Means Clustering Generalized Given dataset X, k, distance function d Problem: Find k vectors M={m1,…, mk}d and 1-of-
k coding scheme bit (bi
t: X M) that minimizes the following error:
Subject to: {0,1}
(t)=i =1 for t=1,…,N (“coding scheme”)
t i i
tt
i
N
t
t
i
k
iidbbEJ 2),(*,'
11mxm X
t
ib
t
ib
Eick: K-Means and EM
6
K-mean’s Complexity—O(kntd)
K-means performs t iterations of E/M-steps; in each iteration the following computations have to be done: E-Step:
1. Find the nearest centroid for each object; because there are n objects and k centroids n*k distances have to be computed and compared; the complexity of computing a distance is O(d) yielding: O(k*n*d)
2. Assign objects to clusters/update clusters (O(n)) M-Step: Compute centroid for each cluster (O(n))
We obtain: O(t*(k*n*d + n + n))=O(t*k*n*d)
Eick: K-Means and EM
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
7
Semiparametric Density Estimation Parametric: Assume a single model for p (x | Ci)
(Chapter 4 and 5) Semiparametric: p (x | Ci) is a mixture of densities
Multiple possible explanations/prototypes: Nonparametric: No model; data speaks for itself
(Chapter 8)
Eick: K-Means and EM8
Mixture Densities
where Gi the components/groups/clusters,
P ( Gi ) mixture proportions (priors),
p ( x | Gi) component densities
Gaussian mixture where p(x|Gi) ~ N ( μi , ∑i ) parameters Φ = {P ( Gi ), μi , ∑i }k
i=1
unlabeled sample X={xt}t (unsupervised learning)
k
iii Ppp
1
| GGxx
Idea: use a density function for each cluster or use multiple density functions for a single class
9
Expectation-Maximization (EM) Clustering Assumptions: datasets X are created a mixture of k
multivariate Gaussian G1,…Gk also preserving their priors P(G1),…,P(Gk).
Example (2-D Dataset):p(x|G1)~N((2,3),1), P(G1)=0.5
p(x|G2)~ N((-4,1),2), P(G2)=0.4
p(x|G3)~N((0,-9),3), P(G3)=0.1 EM’s Task: Given X, reconstruct the k Gaussian and
their priors; each Gaussian is a model of a cluster in the dataset.
Forming of clusters: Assign xi to the cluster Gr which has the maximum value for: p(xi|Gr)P(Gr); alternatively, set p(xi|Gr)P(Gr)= (soft EM)Eick: K-Means and EM
irh
10
Performance Measure of EM EM maximizes Log likelihood of a mixture model
We choose which makesX most likely.
Assume hidden variables z, which when known, make optimization much simpler. z gives the probability of an object belonging to a particular cluster; z_ij is not known and has to be estimated; h_ij denotes its estimator.
EM maximizes the log likelihood of the sample, but instead of maximizing L(Φ |X), it maximizes L(Φ |X,Z), in terms of x and z.
t
k
iii
t
t
t
Pp
p
1
|log
|log|
GG
XL
x
x
Eick: K-Means and EM
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
11
E- and M-steps
Iterate the two steps1. E-step: Estimate z given X and current Φ2. M-step: Find new Φ’ given z, X, and old Φ.
An increase in Q increases incomplete likelihood
ll
lC
l E
|maxarg:step-M
|||:step-E1 Q
X,ZX,LQ
XLXL ||1 ll
Assignments of samplesto clusters
Current Model
Remark: z_it denotes the unknown membership of the t-th example
12
EM in Gaussian Mixtures zt
i = p(x|Gi); assume p(x|Gi)~N(μi,∑i) E-step: Estimate mixture probabilities for X={x1,
…,xn}
M-step: Obtain: l+1namely (P(Gi),mi,Si)l+1 for i=1,…,k
Use estimated labels in place of unknown labels
ti
j jl
jt
il
it
lti h
Pp
PpzE
G,G
G,GX
|
|,
x
x
t
ti
Tli
t
t
li
ttil
i
t
ti
t
ttil
it
ti
i
h
h
h
h
N
hP
111
1
mxmx
xm
S
G
Remarks: h_it plays the role of b_it in K-Meansh_it acts as an estimator for the unknown labels z_ij.
13
EM Means Expectation-Maximization 1. E-step (Expectation): Determine object membership in
clusters (from model)2. M-step (Maximization): Create model by using
Maximum Likelihood Estimation (from cluster memberships)
Computations in the two steps (K-means/EM):1. Compute b_it / h_it2. Compute centroid/(prior, mean,co-variance matrix)
Remark: K-means uses a simple form of the EM-procedure!
Eick: K-Means and EM
14
Commonalities between K-Means and EM
1. They start with random clusters and rely on a 2 step-approach to minimize the objective function using the EM-procedure (see previous transparency).
2. Use the same optimization procedure of an objective function f(a1,…,am,b1,…,,bk); we basically, maximize the a-values (keeping the b-values fixed) and then the b-values (keeping the a-values fixed) until some convergence is reached. Consequently, both algorithms only find a local minimum of the objective function are sensitive to initialization
3. Both assume the number of clusters k is known
Eick: K-Means and EM
15
Differences between K-Means and EM I
1. K-means is distanced-based and relies on 1-NN queries to form clusters. EM is density based/probabilistic; EM usually works with multivariate Gaussians but can be generalized to work with other probability distributions.
2. K-means minimizes the squared distance of on object to its cluster prototype (usually the centroid). EM maximizes the log-likelihood of a sample given a model (p(X|)); models are assumed to be mixtures of k Gaussians and their priors.
3. K-means is a hard clustering, EM is a soft clustering algorithm: h_it [0,1]
Eick: K-Means and EM
16
Differences between K-Means and EM II
4. K-means cluster models are just k centroids; EM models are k “priors, means+co-variance matrices”.
5. EM directly deals with dependencies between attributes in its density estimation approach: the degree to which an object x belongs to a cluster c depends on the product of c’s prior with the Mahalanobis distance between x and the c’s mean; therefore, EM clusters do not depend on units of measurements and orientation of attributes in space.
6. The distance metrics can be viewed as an input parameter when using k-means, and generalizations of K-means have been proposed which use different distance functions. EM implicitly relies on the Mahalanobis distance function which is part of its density estimation approach.
Eick: K-Means and EM
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
17
Mixture of Mixtures
In classification, the input comes from a mixture of classes (supervised).
If each class is also a mixture, e.g., of Gaussians, (unsupervised), we have a mixture of mixtures:
K
iii
k
jijiji
Ppp
Pppi
1
1
|
||
CC
GGC
xx
xx
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
18
Choosing k
Defined by the application, e.g., image quantization
Plot data (after PCA) and check for clusters Incremental (leader-cluster) algorithm: Add one at
a time until “elbow” (reconstruction error/log likelihood/intergroup distances)
Manual check for meaning Run with multiple k-values and compare the
results