Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
Segmentation:
Clustering, Graph Cut and EM
Ying Wu
Electrical Engineering and Computer ScienceNorthwestern University, Evanston, IL 60208
http://www.eecs.northwestern.edu/~yingwu
1 / 29
Outline
Motivations and Applications
Image Segmentation by ClusteringK-Means AlgorithmSelf-Organizing Map
Image Segmentation by Graph CutBasic IdeaBlock-diagonalization
Segmentation by Expectation-MaximizationMissing Data ProblemE-M iterationIssues Remained
2 / 29
Segmentation is a Fundamental Problem To group up similar components such as image pixels, image
regions or even video clips. It is an ill-posed problem! How do we define the similarity measurement?
3 / 29
Background Subtraction
Video surveillance applications
Separate the “foreground” from “background”
Assume fixed camera
Subtract background from images
Adaptive scheme
Bn+1 =waF +
∑
i wiBn−i
wc
by selecting wa, wi and wc .
4 / 29
Object Modeling
Represent an object by regions
First step for recognition
Represent a scene by a set of layers
Motion segmentation
Image segmentation v.s. motion segmentation
5 / 29
Basic Approaches
Segmentation by clustering
Segmentation by graph cut
Segmentation by EM algorithm
6 / 29
Outline
Motivations and Applications
Image Segmentation by ClusteringK-Means AlgorithmSelf-Organizing Map
Image Segmentation by Graph CutBasic IdeaBlock-diagonalization
Segmentation by Expectation-MaximizationMissing Data ProblemE-M iterationIssues Remained
7 / 29
K-Means Clustering
Assume the number of clusters, K , is given.
Use the center of each clusters Ci to represent each cluster.
How do we determine the identity of a data point?
Need to define a distance measurement, D(x , y). e.g.,D(x , y) = ||x − y ||2.
Winner takes all:
lk(xk) = arg mini
D(xk , Ci ) = arg mini
||xk − Ci ||2
where lk is the label for the data point xk .
K-means finds the clusters to minimize the total distortion.
φ(X , C) =∑
i∈C
∑
j∈i−th cluster
||xj − Ci ||2
8 / 29
K-Means Clustering
To minimize φ, K-means algorithm iterates between two steps:
Labelling: assume the p-th iteration ends up with a set of
cluster centers C(p)i , i = 1, . . . ,K . We label each data point
based on such a set of cluster centers, i.e., ∀xk , find
l(p+1)k (xk) = min
i||xk − C
(p)i ||2
and group data points belong to the same cluster
Ωj = xk : lk(xk) = Cj
Re-centering: re-calculating the centers:
C(p+1)i =
∑
xk∈Ωixk
|Ωi |
Iterates between labelling and re-centering until it converges.
9 / 29
Self-Organizing Map (SOM)
SOM can be used for visualizing high-dim data
Map to a low-dim space based on competitive learning
A two-layer neural network
x1
ξ 1
x2 x3
weights
inputs
outputs
ξ 2 ξ m-1 ξ m
The # of neuron in the input layer is the same as thedimension of the input vector.
Connection weights Wk for each output neuron.
10 / 29
Competitive Learning
For an input x , all neurons compete against each other
The winner is the one whose weight is the closest to the input:
y∗i = arg min
iD(xi , Wi )
The index of the winner is taken as the output of SOM.
Adjust the weight of the winner
Train the neurons nearby, and counter-train those far away.
A window function Λ(|y − y∗k |) and the Hebbian learning rule:
W (t + 1) = W (t) + η(t)Λ(|y − y∗k |)(xk − Wy∗
k(t))
Intuition: the input data point will attract the neuron insidethe window to its location, but push those neuron outside thewindow far away.
Relation to vector quantization (VQ) and K-means clustering?
11 / 29
Outline
Motivations and Applications
Image Segmentation by ClusteringK-Means AlgorithmSelf-Organizing Map
Image Segmentation by Graph CutBasic IdeaBlock-diagonalization
Segmentation by Expectation-MaximizationMissing Data ProblemE-M iterationIssues Remained
12 / 29
Adjacency Graph and Affinity Matrix
We can represent the set of data x1, . . . , xN by a graphG = V , E
Each vertex represents an individual data point
Each edge represents the adjacency of two data points
And the weight of the edge represents the affinity of the twopoints
For example
Aij = exp
−||xi − xj ||
2
2σ2
i.e., the similarity of two points
Thus, the data set can be viewed as a weighted adjacencygraph
More importantly, it can also be viewed as an affinity matrix A
13 / 29
Block-diagonalization: Idea
If the data are grouped, then the affinity matrix is prettymuch block-diagonalized
Now, clustering can be treated as the task of finding the bestre-permutation to block-diagonalize A
More specifically, the summation of the affinity values of thoseoff-diagonal block matrices is minimized
or the sum of diagonal block matrices is maximized
14 / 29
Block-diagonalization: Formulation Introduce an association vector (i.e., a projection) for each
cluster component wk ,
wk =
wk1
wk2...
wkN
where wki is the association of xi to the cluster k . Positive wki indicates that xi is in cluster k to some extent,
and negative otherwise Usually, such projection vector is normalized, i.e., we have:
wTk wk = 1, ∀k = 1, . . . ,K
Now we can formulate the problem as
w∗k = arg max
wk
wTk Awk
s.t. wTk wk = 1
15 / 29
Spectral Analysis
The solution is easy
The Lagrangian
L = wTk Awk + λ(1 − wT
k wk)
It is clear that
∂L
∂wk
= 2Awk − 2λwk = 0 ⇒ Awk = λwk
What is this!
wk , an eigenvector, indicates the association of data withcluster k
The “size” of the cluster is given by the eigenvalue λ
More significantly, we don’t need to know K in advance!
The significant λs tell K
16 / 29
A Problem
Ideally, we can check the values of wki for grouping
But life is always complicated
Suppose A has two identical eigenvalues
Aw1 = λw1, and Aw2 = λw2
It is easy to see any linear combination of w1 and w2 alsogives a valid eigenvector
A(a1w1 + a2w2) = λ(a1w1 + a2w2)
This means that we cannot simply use the values ofw = a1w1 + a2w2 for grouping now
Instead of using the 1-D subspace, we need to go to the 2-Dsubspace spanned by w1,w2
If all the K clusters are more or less of the same size, we’llhave K similar eigenvalues. Then we have to go to K -dsubspace. This is the worse case.
17 / 29
Graph Cut
We may view the problem from another point of view: graphcut
We still represent the data set by the affinity graph
Suppose we want to divide the data set into two clusters, weneed to find the set of “weakest links” between the subgraphs,each of which corresponds to one cluster
A set of edges in a graph is called a cut
Now, we need to find a minimum cut for the “weakest links”
But we have singularity here: the separation of the isolatedvertex gives the minimum cut
In other words, the cut does not balance the sizes of theclusters
18 / 29
Normalized Cut So, the cut needs to be normalized. Suppose we partition V into A and B. z ∈ −1, 1N is the
indicator. zi = 1 if xi in A, and -1 otherwise. Let di =
∑
j Aij be the total connection from xi to all others Define normalized cut
NCut(A, B)=
cut(A, B)
asso(A, V )+
cut(B, A)
asso(B, V )
=
∑
xi>0,xj<0
−Aijxixj
∑
xi>0
di
+
∑
xi<0,xj>0
−Aijxixj
∑
xi<0
di
Denote
D = diagd1, . . . ,dN, k =
∑
xi>0 di∑
i di
, b =k
1 − k
19 / 29
Normalized Cut
Define y = (1 + x) − b(1 − x)
Shi & Malik (1997) gave a nice formulation1
minx
NCut(x) = miny
yT (D − A)y
yTDy
s.t.
yi ∈ 1,−byTD1 = 0
This is to solve a generalized EVD under constraints
(D − A)y = λDy
The showed that the eigenvector associated with the 2ndsmallest eigenvalue is able to bipartite the graph
1J. Shi and J. Malik, Normalized Cuts and Image Segmentation, CVPR’97
20 / 29
Outline
Motivations and Applications
Image Segmentation by ClusteringK-Means AlgorithmSelf-Organizing Map
Image Segmentation by Graph CutBasic IdeaBlock-diagonalization
Segmentation by Expectation-MaximizationMissing Data ProblemE-M iterationIssues Remained
21 / 29
Generative Model and Missing Data
Assume each image pixel is produced by a probability densityassociated with one of the g image segments.
The data generation process: we first choose an imagesegment, and then generate the pixel based on:
p(x) =∑
i
p(x |θi )πi
where π is the prior for the i-th image segment, and θi is theparameter.
We can use Gaussian for each component:
p(x |θi ) ∼ G (µi , Σi )
Associate a label lk for each xk for its identity
This mixture model is a generative model.
The data labels are missing.
22 / 29
Formulation So, our task is to do the inverse. Given a set of data point (image pixels)
X = xk , k = 1, . . . ,N, we need to estimate thoseparameters θi , πi , and estimating the labels for all the datapoints by:
l∗j = arg maxk
p(lj = k |xj , Θ), ∀xj
which gives the posteriori probability of xj . Maximum Likelihood Estimation. The likelihood of the data set can be written by:
p(X|Θ) =∏
j
(
g∑
i=1
p(xj |θi )πi )
Usually, we use log likelihood:
log p(X|Θ) =∑
j
log(
g∑
i=1
p(xj |θi )πi )
But this is very ugly (why?) and intractable! 23 / 29
Missing Data and Indicator Variable
Introduce an indicator variable z:
z =
z1
z2...
zg
If a data point x is drawn from the k-th component, thenzk = 1, and all other zi 6=k = 0.
This indicator variable tells the identity of a data point
It is the missing part!
Why do we need it?
24 / 29
Good News! Let’s form the complete data:
yk =
[
xk
zk
]
And the complete data set is Y = yk , k = 1, . . . ,N. The likelihood of the complete data point yk :
p(yk)|Θ) =
g∑
i=1
zkip(xk |θi )
log p(yk |Θ) =
g∑
i=1
zki log p(xk |θi )
So, for the whole data set, we have
p(Y|Θ) =
N∏
k=1
g∑
i=1
zkip(xk |θi )
25 / 29
Good News and Bad News
And thus:
log p(Y|Θ) =N
∑
k=1
log(
g∑
i=1
zkip(xk |θi ))
=N
∑
k=1
g∑
i=1
zki log p(xk |θi )
Because we eliminate the summation terms inside log, the MLestimation becomes easier:
Θ∗ = arg maxΘ
log p(Y|Θ)
However, the bad news is that the indicator variable zk makethe ML difficult, since we do not know zk .
26 / 29
Expectation-Maximization Iteration
Fortunately, life won’t be too bad.
A quite interesting phenomenon: if we know such zk , i.e., we know the identities for each data
point, we can easily estimate the density parameters Θ basedon ML, without any doubt.
At the same time, if we know the density parameters, we caneasily solve such indicator variables zk based on MAP.
This phenomenon suggest an iterative procedure: E-step: computing an expected value of the complete data,
here only E [zk ]; M-step: maximizing the the log likelihood of the complete
data to estimate Θ.
It converges to a local maximum of the likelihood.
27 / 29
EM for Image Segmentation let’s apply EM to image segmentation:
E-step:
E [zki ] = 1 · p(kth pixel comes from ith component)
+ 0 · p(kth pixel doesn′t come from ith component)
= p(kth pixel comes from ith component)
=πip(xk |θi )
∑g
j=1 πjp(xk |θj)
M-step
πi =1
r
r∑
l=1
p(i |xl ,Θ)
µi =
∑r
l=1 xlp(i |xl ,Θ)∑r
l=1 p(i |xl ,Θ)
Σi =
∑r
l=1 p(i |xl ,Θ)[(xl − µi )(xl − µi )T ]
∑r
l=1 p(i |xl ,Θ)
28 / 29
Issues Remained
Structural parameters EM assumes a known number of components A common problem in clustering What if we don’t know it? Minimum Description Length (MDL) principle in theory Cross-validation in practice
Curse of dimensionality What if the dimensionality of x is very high? Too many parameters to estimate Requires a huge amount of training data Otherwise, the estimation is heavily biased
29 / 29