34
hat is it? hen would you use it? hy does it work? ow do you implement it? here does it stand in relation to other methods? EM algorithm reading group Introduction & Motivation Theory Practical Comparison with other methods

What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods?

  • Upload
    siran

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

EM algorithm reading group. What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods?. Introduction & Motivation. Theory. Practical. Comparison with other methods. Expectation Maximization (EM). - PowerPoint PPT Presentation

Citation preview

Page 1: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

What is it?

When would you use it?

Why does it work?

How do you implement it?

Where does it stand in relation to other methods?

EM algorithm reading group

Introduction & Motivation

Theory

Practical

Comparison with other methods

Page 2: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Expectation Maximization (EM)• Iterative method for parameter estimation where you have missing data

• Has two steps: Expectation (E) and Maximization (M)

• Applicable to a wide range of problems

• Old idea (late 50’s) but formalized by Dempster, Laird and Rubin in 1977

• Subject of much investigation. See McLachlan & Krishnan book 1997.

Page 3: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Applications of EM (1)• Fitting mixture models

Page 4: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Applications of EM (2)• Probabilistic Latent Semantic Analysis (pLSA)

– Technique from text community

P(w,d) P(w|z) P(z|d)

Z

W D

Z

D

W

Page 5: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Applications of EM (3)• Learning parts and structure models

Page 6: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Applications of EM (4)• Automatic segmentation of layers in video

http://www.psi.toronto.edu/images/figures/cutouts_vid.gif

Page 7: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Motivating example

0 1-2 -1-4 -3 4 52 3

Data:

Model:

Parameters:

OBJECTIVE: Fit mixture of Gaussian model with C=2 components

keep fixed

i.e. only estimate x

P(x|)

where

Page 8: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Likelihood functionLikelihood is a function of parameters, Probability is a function of r.v. x DIFFERENT TO LAST PLOT

Page 9: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Probabilistic modelImagine model generating data

Need to introduce label, z, for each data point

Label is called a latent variablealso called hidden, unobserved, missing

c

0 1-2 -1-4 -3 4 52 3

Simplifies the problem:if we knew the labels, we can decouple the components as estimate

parameters separately for each one

Page 10: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Intuition of EME-step: Compute a distribution on the labels of the points, using current parameters

M-step: Update parameters using current guess of label distribution.

E

E

M

M

E

Page 11: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Theory

Page 12: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Complete log-likelihood (CLL)

Log-likelihood [Incomplete log-likelihood (ILL)]

Expected complete log-likelihood (ECLL)

Some definitionsObserved data

Latent variables

Iteration index

Continuous I.I.D

Discrete 1 ... C

Page 13: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Use Jensen’sinequality

Lower bound on log-likelihood

AUXILIARY FUNCTION

Page 14: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Jensen’s InequalityJensen’s inequality:

For a real continuous concave function and

Equality holds when all x are the same

1. Definition of concavity. Consider where

then

2. By induction:

for

Page 15: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Recall key result : Auxiliary function is LOWER BOUND on likelihood

EM is alternating ascent

Alternately improve q then :

Is guaranteed to improve likelihood itself….

Page 16: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

E-step: Choosing the optimal q(z|x,)Turns out that q(z|x,) = p(z|x,t) is the best.

Page 17: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

E-step: What do we actually compute?

nComponents x nPoints matrix (columns sum to 1):

Component 1

Component 2

Poi

nt 1

Poi

nt 2

Poi

nt 6

Responsibility of component for point :

Page 18: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

M-Step

Entropy termECLL

Auxiliary function separates into ECLL and entropy term:

Page 19: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

M-Step

From previous slide:

Recall definition of ECLL:

From E-step

Let’s see what happens for

Page 20: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?
Page 21: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?
Page 22: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?
Page 23: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?
Page 24: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?
Page 25: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Practical

Page 26: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

InitializationMean of data + random offsetK-Means

TerminationMax # iterationslog-likelihood changeparameter change

ConvergenceLocal maximaAnnealed methods (DAEM)Birth/death process (SMEM)

Numerical issuesInject noise in covariance matrix to prevent blowupSingle point gives infinite likelihood

Number of componentsOpen problemMinimum description lengthBayesian approach

Practical issues

Page 27: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Local minima

Page 28: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Robustness of EM

Page 29: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

What EM won’t doPick structure of model

# componentsgraph structure

Find global maximum

Always have nice closed-form updatesoptimize within E/M step

Avoid computational problemssampling methods for computing expectations

Page 30: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Comparison with other methods

Page 31: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Why not use standard optimization methods?

• No step size

• Works directly in parameter space model, thus parameter constraints are obeyed

• Fits naturally into graphically model frame work

• Supposedly faster

In favour of EM:

Page 32: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Gradient

Newton

EM

Page 33: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Gradient

Newton

EM

Page 34: What  is it?  When  would you use it? Why  does it work?  How  do you implement it? Where  does it stand in relation to other methods?

Acknowledgements

Shameless stealing of figures and equations and explanations from:

Frank DellaertMichael JordanYair Weiss