Generative Models for Image Understanding

Generative Models for Image Understanding

Nebojsa Jojic and Thomas HuangBeckman Institute and ECE Dept.

University of Illinois

Problem: Summarization of High Dimensional Data

• Pattern Analysis: – For several classes c=1,..,C of the data, define probability

distribution functions p(x| c)

• Compression: – Define a probabilistic model p(x) and devise an optimal coding

approach

• Video Summary: – Drop most of the frames in a video sequence and keep interesting

information that summarizes it.

Generative density modeling• Find a probability model that

– reflects desired structure– randomly generates plausible images, – represents the data by parameters

• ML estimation

• p(image|class) used for recognition, detection, ...

Problems we attacked• Transformation as a discrete variable in

generative models of intensity images• Tracking articulated objects in dense stereo

maps• Unsupervised learning for video summary

• Idea - the structure of the generative model reveals the interesting objects we want to extract.

Mixture of Gaussians

c

z

The probability of pixel intensities z given that the image is from cluster c is p(z|c) = N(z; c , c)

P(c) = c

Mixture of Gaussians

cP(c) = c

zp(z|c) = N(z; c , c)

• Parameters c, c and c represent the data

• For input z, the cluster responsibilities are

P(c|z) = p(z|c)P(c) / c p(z|c)P(c)

Example: Simulation

c=1P(c) = c

z=

p(z|c) = N(z; c , c)

1= 0.6,

2= 0.4,

Example: Simulation

c=2P(c) = c

z=

p(z|c) = N(z; c , c)

1= 0.6,

2= 0.4,

Example: Learning - E step

c=1

Images from data set

z=

c=2

P(c|z)

c0.52

0.48

1= 0.5,

2= 0.5,

Example: Learning - E step

Images from data set

z=

cc=1

c=2

P(c|z)0.48

0.52

1= 0.5,

2= 0.5,

Example: Learning - M step

c

1= 0.5,

2= 0.5,

zSet 1 to the average of zP(c=1|z)

Set 2 to the average of zP(c=2|z)

Example: Learning - M step

c

1= 0.5,

2= 0.5,

zSet 1 to the average of

diag((z-1)T (z-1))P(c=1|z)Set 2 to the average of

diag((z-2)T (z-2))P(c=2|z)

Transformation as a Discrete Latent Variable

withBrendan J. Frey

Computer Science, University of Waterloo, CanadaBeckman Institute & ECE, Univ of Illinois at Urbana

Kind of data we’re interested in

Even after tracking, the features still have unknown positions, rotations, scales, levels of shearing, ...

Oneapproach

Normalization

PatternAnalysis

Images

Normalizedimages

Labor

Ourapproach

Joint Normalization

andPattern Analysis

Images

• A continuous transformation moves an image, , along a continuous curve

• Our subspace model should assign images near this nonlinear manifold to the same point in the subspace

What transforming an image does in the vector space of pixel intensities

Tractable approaches to modeling the transformation manifold

\ Linear approximation - good locally

• Discrete approximation - good globally

Adding “transformation” as a discrete latent variable

• Say there are N pixels

• We assume we are given a set of sparse N x N transformation generating matrices G1,…,Gl ,…,GL

• These generate points from point

Transformed Mixture of Gaussians

• l, c, c and c represent the data

• The cluster/transf responsibilities, P(c,l|x), are quite easy to compute

p(x|z,l) = N(x; Gl z , )

x

P(l) = l l

p(z|c) = N(z; c , c)

c

z

P(c) = c

Example: Simulation

l=1

c=1

G1 = shift left and up, G2 = I, G3 = shift right and up

z=

x=

ML estimation of a Transformed Mixture of Gaussians using EM

x

l

c

z

• E step: Compute P(l|x), P(c|x) and p(z|c,x) for each x in data

• M step: Set– c = avg of P(c|x)

– l = avg of P(l|x)

– c = avg mean of p(z|c,x)

– c = avg variance of p(z|c,x)

– = avg var of p(x-Gl z|x)

Face ClusteringExamples of 400 outdoor images of 2 people

(44 x 28 pixels)

Mixture of Gaussians15 iterations of EM (MATLAB takes 1 minute)

Cluster meansc = 1 c = 2 c = 3 c = 4

30 iterations of EM

Cluster meansc = 1 c = 2 c = 3 c = 4

Transformed mixture of Gaussians

Video Analysis Using Generative Models

with Brendan Frey, Nemanja Petrovic and Thomas Huang

Idea

• Use generative models of video sequences to do unsupervised learning

• Use the resulting model for video summarization, filtering, stabilization, recognition of objects, retrieval, etc.

Transformed Hidden Markov Model

x

l

c

z

x

l

c

z

tt-1

P(c,l|past)

THMM Transition Models

• Independent probability distributions for class and transformations; relative motion

P(ct , lt | past)= P(ct | ct-1) P(d(lt , l t-1))

• Relative motion dependent on the classP(ct , lt | past)= P(ct | ct-1) P(d(lt , l t-1) | ct)

• Autoregressive model for transformation distribution

Inference in THMM

• Tasks:– Find the most likely state at time t given the

whole observed sequence {xt} and the model parameters (class means and variances, transition probabilities, etc.)

– Find the distribution over states for each time t– Find the most likely state sequence– Learn the parameters that maximize he

likelihood of the observed data

Video Summary and Filtering

x

l

c

z

p(x|z,l) = N(x; Gl z , )

p(z|c) = N(z; c , c) Video summary

Image segmentation

Removal of sensor noise

Image Stabilization

Example: Learning

• Hand-held camera• Moving subject• Cluttered backgroundDATA

c 1 class

121 translations (11 vertical and 11 horizontal shifts)

c5 classes

c

c

Examples

• Normalized sequence

• Simulated sequence

• De-noising

• Seeing through distractions

Future work

• Fast approximate learning and inference

• Multiple layers

• Learning transformations from images

Nebojsa Jojic: www.ifp.uiuc.edu/~jojic

Subspace models of imagesExample: Image, R 1200 = f (y, R 2)

Frown

Shut eyes

y

z

The density of pixel intensitiesz given subspace pointy is p(z|y) = N(z; +y, )

p(y) = N(y; 0, I)

Factor analysis (generative PCA)

Manifold: f (y) = +y, linear

• Parameters , represent the manifold• Observing z induces a Gaussian p(y|z):

COV[y|z] = (I)

E[y|z] = COV[y|z] z

y

z

p(z|y) = N(z; +y, )

p(y) = N(y; 0, I)

Factor analysis (generative PCA)

Example: Simulation

Shut

eye

s

Frow

n=

y

z

p(z|y) = N(z; +y, )

p(y) = N(y; 0, I) Frn

SE =

Example: Simulation

Shut

eye

s

Frow

n=

y

z

p(z|y) = N(z; +y, )

p(y) = N(y; 0, I) Frn

SE =

Example: Simulation

Shut

eye

s

Frow

n=

y

z

p(z|y) = N(z; +y, )

p(y) = N(y; 0, I) Frn

SE =

y

z

p(z|y) = N(z; +y, )

Transformed Component Analysis

lP(l) = l

p(y) = N(y; 0, I)

The probability of observedimage x is p(x|z,l) = N(x; Gl z , )

x

Example: Simulation

Shut

eye

s

Frow

n=

=

G1 = shift left & up, G2 = I, G3 = shift right & up

zl=3

yFrn

SE

x

Example: InferenceG1 = shift left & up, G2 = I, G3 = shift right & up

zl=3

x

yFrn

SE

zl=2

x

yFrn

SE

zl=1

x

yFrn

SE

Garbage

Garbage

P(l=1|x) =

P(l=3|x) =

P(l=2|x) =

EM algorithm for TCA• Initialize , , , to random values • E Step

– For each training case x(t), infer

q(t)(l,z,y) = p(l,z,y |x(t))• M Step

– Compute new,new, new,new,new to maximize

t E[ log p(y) p(z|y) P(l) p(x(t)|z,l)],

where E[] is wrt q(t)(l,z,y) • Each iteration increases log p(Data)

A tough toy problem• 144, 9 x 9 images• 1 shape (pyramid)• 3-D lighting• cluttered background

• 25 possible locations

1st 8 principal components:

TCA:

• 3 components• 81 transformations

- 9 horiz shifts- 9 vert shifts

• 10 iters of EM

• Model generates realistic examples

:1:2 :3

Expression modeling

• 100 16 x 24 training images

• variation in expression

• imperfect alignment

PCA: Mean + 1st 10 principal components

Factor Analysis: Mean + 10 factors after 70 its of EM

TCA: Mean + 10 factors after 70 its of EM

Fantasies from FA model Fantasies from TCA model

Modeling handwritten digits

• 200 8 x 8 images of each digit

• preprocessing normalizes vert/horiz translation and scale

• different writing angles (shearing) - see “7”

TCA: - 29 shearing + translation combinations - 10 components per digit - 30 iterations of EM per digit

Mean of each digitTransformed means

FA: Mean + 10 components per digit

TCA: Mean + 10 components per digit

Classification Performance• Training: 200 cases/digit, 20 components, 50 EM iters

• Testing: 1000 cases, p(x|class) used for classification

• Results:

Method Error ratek-nearest neighbors (optimized k) 7.6%Factor analysis 3.2%Tranformed component analysis 2.7%

• Bonus: P(l|x) infers the writing angle!

Wrap-up• Papers, MATLAB scripts:

www.ifp.uiuc.edu/~jojicwww.cs.uwaterloo.ca/~frey

• Other domains: audio, bioinfomatics, …

• Other latent image models, p(z)– mixtures of factor analyzers (NIPS99)– layers, multiple objects, occlusions– time series (in preparation)

Wrap-up• Discrete+Linear Combination: Set some

components equal to derivatives of wrt transformations

• Multiresolution approach

• Fast variational methods, belief propagation,...

Other generative models

• Modeling human appearance in stereo images: articulated, self-occluding Gaussians

Documents

Generative Models for Image Understanding