Segmentation: Clustering, Graph Cut and EMusers.eecs.northwestern.edu/~yingwu/teaching/EECS432/Notes/segmentation_notes.pdfSegmentation: Clustering, Graph Cut and EM Ying Wu Electrical

Segmentation:

Clustering, Graph Cut and EM

Ying Wu

Electrical Engineering and Computer ScienceNorthwestern University, Evanston, IL 60208

[email protected]

http://www.eecs.northwestern.edu/~yingwu

1 / 29

Outline

Motivations and Applications

Image Segmentation by ClusteringK-Means AlgorithmSelf-Organizing Map

Image Segmentation by Graph CutBasic IdeaBlock-diagonalization

Segmentation by Expectation-MaximizationMissing Data ProblemE-M iterationIssues Remained

2 / 29

Segmentation is a Fundamental Problem To group up similar components such as image pixels, image

regions or even video clips. It is an ill-posed problem! How do we define the similarity measurement?

3 / 29

Background Subtraction

Video surveillance applications

Separate the “foreground” from “background”

Assume fixed camera

Subtract background from images

Adaptive scheme

Bn+1 =waF +

∑

i wiBn−i

wc

by selecting wa, wi and wc .

4 / 29

Object Modeling

Represent an object by regions

First step for recognition

Represent a scene by a set of layers

Motion segmentation

Image segmentation v.s. motion segmentation

5 / 29

Basic Approaches

Segmentation by clustering

Segmentation by graph cut

Segmentation by EM algorithm

6 / 29

Outline





7 / 29

K-Means Clustering

Assume the number of clusters, K , is given.

Use the center of each clusters Ci to represent each cluster.

How do we determine the identity of a data point?

Need to define a distance measurement, D(x , y). e.g.,D(x , y) = ||x − y ||2.

Winner takes all:

lk(xk) = arg mini

D(xk , Ci ) = arg mini

||xk − Ci ||2

where lk is the label for the data point xk .

K-means finds the clusters to minimize the total distortion.

φ(X , C) =∑

i∈C

∑

j∈i−th cluster

||xj − Ci ||2

8 / 29

K-Means Clustering

To minimize φ, K-means algorithm iterates between two steps:

Labelling: assume the p-th iteration ends up with a set of

cluster centers C(p)i , i = 1, . . . ,K . We label each data point

based on such a set of cluster centers, i.e., ∀xk , find

l(p+1)k (xk) = min

i||xk − C

(p)i ||2

and group data points belong to the same cluster

Ωj = xk : lk(xk) = Cj

Re-centering: re-calculating the centers:

C(p+1)i =

∑

xk∈Ωixk

|Ωi |

Iterates between labelling and re-centering until it converges.

9 / 29

Self-Organizing Map (SOM)

SOM can be used for visualizing high-dim data

Map to a low-dim space based on competitive learning

A two-layer neural network

x1

ξ 1

x2 x3

weights

inputs

outputs

ξ 2 ξ m-1 ξ m

The # of neuron in the input layer is the same as thedimension of the input vector.

Connection weights Wk for each output neuron.

10 / 29

Competitive Learning

For an input x , all neurons compete against each other

The winner is the one whose weight is the closest to the input:

y∗i = arg min

iD(xi , Wi )

The index of the winner is taken as the output of SOM.

Adjust the weight of the winner

Train the neurons nearby, and counter-train those far away.

A window function Λ(|y − y∗k |) and the Hebbian learning rule:

W (t + 1) = W (t) + η(t)Λ(|y − y∗k |)(xk − Wy∗

k(t))

Intuition: the input data point will attract the neuron insidethe window to its location, but push those neuron outside thewindow far away.

Relation to vector quantization (VQ) and K-means clustering?

11 / 29

Outline





12 / 29

Adjacency Graph and Affinity Matrix

We can represent the set of data x1, . . . , xN by a graphG = V , E

Each vertex represents an individual data point

Each edge represents the adjacency of two data points

And the weight of the edge represents the affinity of the twopoints

For example

Aij = exp

−||xi − xj ||

2

2σ2

i.e., the similarity of two points

Thus, the data set can be viewed as a weighted adjacencygraph

More importantly, it can also be viewed as an affinity matrix A

13 / 29

Block-diagonalization: Idea

If the data are grouped, then the affinity matrix is prettymuch block-diagonalized

Now, clustering can be treated as the task of finding the bestre-permutation to block-diagonalize A

More specifically, the summation of the affinity values of thoseoff-diagonal block matrices is minimized

or the sum of diagonal block matrices is maximized

14 / 29

Block-diagonalization: Formulation Introduce an association vector (i.e., a projection) for each

cluster component wk ,

wk =

wk1

wk2...

wkN

where wki is the association of xi to the cluster k . Positive wki indicates that xi is in cluster k to some extent,

and negative otherwise Usually, such projection vector is normalized, i.e., we have:

wTk wk = 1, ∀k = 1, . . . ,K

Now we can formulate the problem as

w∗k = arg max

wk

wTk Awk

s.t. wTk wk = 1

15 / 29

Spectral Analysis

The solution is easy

The Lagrangian

L = wTk Awk + λ(1 − wT

k wk)

It is clear that

∂L

∂wk

= 2Awk − 2λwk = 0 ⇒ Awk = λwk

What is this!

wk , an eigenvector, indicates the association of data withcluster k

The “size” of the cluster is given by the eigenvalue λ

More significantly, we don’t need to know K in advance!

The significant λs tell K

16 / 29

A Problem

Ideally, we can check the values of wki for grouping

But life is always complicated

Suppose A has two identical eigenvalues

Aw1 = λw1, and Aw2 = λw2

It is easy to see any linear combination of w1 and w2 alsogives a valid eigenvector

A(a1w1 + a2w2) = λ(a1w1 + a2w2)

This means that we cannot simply use the values ofw = a1w1 + a2w2 for grouping now

Instead of using the 1-D subspace, we need to go to the 2-Dsubspace spanned by w1,w2

If all the K clusters are more or less of the same size, we’llhave K similar eigenvalues. Then we have to go to K -dsubspace. This is the worse case.

17 / 29

Graph Cut

We may view the problem from another point of view: graphcut

We still represent the data set by the affinity graph

Suppose we want to divide the data set into two clusters, weneed to find the set of “weakest links” between the subgraphs,each of which corresponds to one cluster

A set of edges in a graph is called a cut

Now, we need to find a minimum cut for the “weakest links”

But we have singularity here: the separation of the isolatedvertex gives the minimum cut

In other words, the cut does not balance the sizes of theclusters

18 / 29

Normalized Cut So, the cut needs to be normalized. Suppose we partition V into A and B. z ∈ −1, 1N is the

indicator. zi = 1 if xi in A, and -1 otherwise. Let di =

∑

j Aij be the total connection from xi to all others Define normalized cut

NCut(A, B)=

cut(A, B)

asso(A, V )+

cut(B, A)

asso(B, V )

=

∑

xi>0,xj<0

−Aijxixj

∑

xi>0

di

+

∑

xi<0,xj>0

−Aijxixj

∑

xi<0

di

Denote

D = diagd1, . . . ,dN, k =

∑

xi>0 di∑

i di

, b =k

1 − k

19 / 29

Normalized Cut

Define y = (1 + x) − b(1 − x)

Shi & Malik (1997) gave a nice formulation1

minx

NCut(x) = miny

yT (D − A)y

yTDy

s.t.

yi ∈ 1,−byTD1 = 0

This is to solve a generalized EVD under constraints

(D − A)y = λDy

The showed that the eigenvector associated with the 2ndsmallest eigenvalue is able to bipartite the graph

1J. Shi and J. Malik, Normalized Cuts and Image Segmentation, CVPR’97

20 / 29

Outline





21 / 29

Generative Model and Missing Data

Assume each image pixel is produced by a probability densityassociated with one of the g image segments.

The data generation process: we first choose an imagesegment, and then generate the pixel based on:

p(x) =∑

i

p(x |θi )πi

where π is the prior for the i-th image segment, and θi is theparameter.

We can use Gaussian for each component:

p(x |θi ) ∼ G (µi , Σi )

Associate a label lk for each xk for its identity

This mixture model is a generative model.

The data labels are missing.

22 / 29

Formulation So, our task is to do the inverse. Given a set of data point (image pixels)

X = xk , k = 1, . . . ,N, we need to estimate thoseparameters θi , πi , and estimating the labels for all the datapoints by:

l∗j = arg maxk

p(lj = k |xj , Θ), ∀xj

which gives the posteriori probability of xj . Maximum Likelihood Estimation. The likelihood of the data set can be written by:

p(X|Θ) =∏

j

(

g∑

i=1

p(xj |θi )πi )

Usually, we use log likelihood:

log p(X|Θ) =∑

j

log(

g∑

i=1

p(xj |θi )πi )

But this is very ugly (why?) and intractable! 23 / 29

Missing Data and Indicator Variable

Introduce an indicator variable z:

z =

z1

z2...

zg

If a data point x is drawn from the k-th component, thenzk = 1, and all other zi 6=k = 0.

This indicator variable tells the identity of a data point

It is the missing part!

Why do we need it?

24 / 29

Good News! Let’s form the complete data:

yk =

[

xk

zk

]

And the complete data set is Y = yk , k = 1, . . . ,N. The likelihood of the complete data point yk :

p(yk)|Θ) =

g∑

i=1

zkip(xk |θi )

log p(yk |Θ) =

g∑

i=1

zki log p(xk |θi )

So, for the whole data set, we have

p(Y|Θ) =

N∏

k=1

g∑

i=1

zkip(xk |θi )

25 / 29

Good News and Bad News

And thus:

log p(Y|Θ) =N

∑

k=1

log(

g∑

i=1

zkip(xk |θi ))

=N

∑

k=1

g∑

i=1

zki log p(xk |θi )

Because we eliminate the summation terms inside log, the MLestimation becomes easier:

Θ∗ = arg maxΘ

log p(Y|Θ)

However, the bad news is that the indicator variable zk makethe ML difficult, since we do not know zk .

26 / 29

Expectation-Maximization Iteration

Fortunately, life won’t be too bad.

A quite interesting phenomenon: if we know such zk , i.e., we know the identities for each data

point, we can easily estimate the density parameters Θ basedon ML, without any doubt.

At the same time, if we know the density parameters, we caneasily solve such indicator variables zk based on MAP.

This phenomenon suggest an iterative procedure: E-step: computing an expected value of the complete data,

here only E [zk ]; M-step: maximizing the the log likelihood of the complete

data to estimate Θ.

It converges to a local maximum of the likelihood.

27 / 29

EM for Image Segmentation let’s apply EM to image segmentation:

E-step:

E [zki ] = 1 · p(kth pixel comes from ith component)

+ 0 · p(kth pixel doesn′t come from ith component)

= p(kth pixel comes from ith component)

=πip(xk |θi )

∑g

j=1 πjp(xk |θj)

M-step

πi =1

r

r∑

l=1

p(i |xl ,Θ)

µi =

∑r

l=1 xlp(i |xl ,Θ)∑r

l=1 p(i |xl ,Θ)

Σi =

∑r

l=1 p(i |xl ,Θ)[(xl − µi )(xl − µi )T ]

∑r

l=1 p(i |xl ,Θ)

28 / 29

Issues Remained

Structural parameters EM assumes a known number of components A common problem in clustering What if we don’t know it? Minimum Description Length (MDL) principle in theory Cross-validation in practice

Curse of dimensionality What if the dimensionality of x is very high? Too many parameters to estimate Requires a huge amount of training data Otherwise, the estimation is heavily biased

29 / 29

Documents

Segmentation: Clustering, Graph Cut and EMusers.eecs.northwestern.edu/~yingwu/teaching/EECS432/Notes/segmentation_notes.pdfSegmentation: Clustering, Graph Cut and EM Ying Wu Electrical