46
Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

  • View
    228

  • Download
    7

Embed Size (px)

Citation preview

Page 1: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Maximum Likelihood (ML), Expectation Maximization (EM)

Pieter AbbeelUC Berkeley EECS

Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Page 2: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Maximum likelihood (ML)

Priors, and maximum a posteriori (MAP)

Cross-validation

Expectation Maximization (EM)

Outline

Page 3: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Let µ = P(up), 1-µ = P(down)

How to determine µ ?

Empirical estimate: 8 up, 2 down

Thumbtack

Page 4: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

http://web.me.com/todd6ton/Site/Classroom_Blog/Entries/2009/10/7_A_Thumbtack_Experiment.html

Page 5: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

µ = P(up), 1-µ = P(down)

Observe:

Likelihood of the observation sequence depends on µ:

Maximum likelihood finds

extrema at µ = 0, µ = 1, µ = 0.8

Inspection of each extremum yields µML = 0.8

Maximum Likelihood

Page 6: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

More generally, consider binary-valued random variable with µ = P(1), 1-µ = P(0), assume we observe n1 ones, and n0 zeros

Likelihood:

Derivative:

Hence we have for the extrema:

n1/(n0+n1) is the maximum

= empirical counts.

Maximum Likelihood

Page 7: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

The function

is a monotonically increasing function of x

Hence for any (positive-valued) function f:

In practice often more convenient to optimize the log-likelihood rather than the likelihood itself

Example:

Log-likelihood

Page 8: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Reconsider thumbtacks: 8 up, 2 down Likelihood

Definition: A function f is concave if and only

Concave functions are generally easier to maximize then non-concave functions

Log-likelihood Likelihood

log-likelihood

ConcaveNot Concave

Page 9: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

f is concave if and only

“Easy” to maximize

Concavity and Convexity

x1x2

¸ x2+(1-¸)x2

f is convex if and only

“Easy” to minimize

x1 x2

¸ x2+(1-¸)x2

Page 10: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Consider having received samples

ML for Multinomial

Page 11: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Given samples

Dynamics model:

Observation model:

Independent ML problems for each and each

ML for Fully Observed HMM

Page 12: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Consider having received samples 3.1, 8.2, 1.7

ML for Exponential Distribution

Source: wikipedia

ll

Page 13: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Consider having received samples

ML for Exponential Distribution

Source: wikipedia

Page 14: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Consider having received samples

Uniform

Page 15: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Consider having received samples

ML for Gaussian

Page 16: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Equivalently:

More generally:

ML for Conditional Gaussian

Page 17: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

ML for Conditional Gaussian

Page 18: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

ML for Conditional Multivariate Gaussian

Page 19: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Aside: Key Identities for Derivation on Previous Slide

Page 20: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Consider the Linear Gaussian setting:

Fully observed, i.e., given

Two separate ML estimation problems for conditional multivariate Gaussian:

1:

2:

ML Estimation in Fully Observed Linear Gaussian Bayes Filter Setting

Page 21: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Let µ = P(up), 1-µ = P(down)

How to determine µ ?

ML estimate: 5 up, 0 down

Laplace estimate: add a fake count of 1 for each outcome

Priors --- Thumbtack

Page 22: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Alternatively, consider $µ$ to be random variable

Prior P(µ) / µ(1-µ)

Measurements: P( x | µ )

Posterior:

Maximum A Posterior (MAP) estimation = find µ that maximizes the posterior

Priors --- Thumbtack

Page 23: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Priors --- Beta Distribution

Figure source: Wikipedia

Page 24: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Generalizes Beta distribution

MAP estimate corresponds to adding fake counts n1, …, nK

Priors --- Dirichlet Distribution

Page 25: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Assume variance known. (Can be extended to also find MAP for variance.)

Prior:

MAP for Mean of Univariate Gaussian

Page 26: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Assume variance known. (Can be extended to also find MAP for variance.)

Prior:

MAP for Univariate Conditional Linear Gaussian

[Interpret!]

Page 27: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

MAP for Univariate Conditional Linear Gaussian: Example

TRUE ---Samples .ML ---MAP ---

Page 28: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Choice of prior will heavily influence quality of result

Fine-tune choice of prior through cross-validation: 1. Split data into “training” set and “validation”

set 2. For a range of priors,

Train: compute µMAP on training set Cross-validate: evaluate performance on validation set

by evaluating the likelihood of the validation data under µMAP just found

3. Choose prior with highest validation score For this prior, compute µMAP on (training+validation) set

Typical training / validation splits:

1-fold: 70/30, random split

10-fold: partition into 10 sets, average performance for each of the sets being the validation set and the other 9 being the training set

Cross Validation

Page 29: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Maximum likelihood (ML)

Priors, and maximum a posteriori (MAP)

Cross-validation

Expectation Maximization (EM)

Outline

Page 30: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Generally:

Example:

ML Objective: given data z(1), …, z(m)

Setting derivatives w.r.t. µ, ¹, § equal to zero does not enable to solve for their ML estimates in closed form

We can evaluate function we can in principle perform local optimization, see future lectures. In this lecture: “EM” algorithm, which is typically used to efficiently optimize the objective (locally)

Mixture of Gaussians

Page 31: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Example:

Model:

Goal: Given data z(1), …, z(m) (but no x(i) observed) Find maximum likelihood estimates of ¹1, ¹2

EM basic idea: if x(i) were known two easy-to-solve separate ML problems

EM iterates over E-step: For i=1,…,m fill in missing data x(i) according to

what is most likely given the current model ¹ M-step: run ML for completed data, which gives new

model ¹

Expectation Maximization (EM)

Page 32: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

EM solves a Maximum Likelihood problem of the form:

µ: parameters of the probabilistic model we try to findx: unobserved variablesz: observed variables

EM Derivation

Jensen’s Inequality

Page 33: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Jensen’s inequality

x1x2

E[X] = ¸ x2+(1-¸)x2

Illustration: P(X=x1) = 1-¸, P(X=x2) = ¸

Page 34: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

EM Algorithm: Iterate

1. E-step: Compute

2. M-step: Compute

EM Derivation (ctd)

Jensen’s Inequality: equality holds when is an affine

function. This is achieved for

M-step optimization can be done efficiently in most casesE-step is usually the more expensive stepIt does not fill in the missing data x with hard values, but finds a distribution q(x)

Page 35: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

M-step objective is upper-bounded by true objective

M-step objective is equal to true objective at current parameter estimate

EM Derivation (ctd)

Improvement in true objective is at least as large as improvement in M-step objective

Page 36: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Estimate 1-d mixture of two Gaussians with unit variance:

one parameter ¹ ; ¹1 = ¹ - 7.5, ¹2 = ¹+7.5

EM 1-D Example --- 2 iterations

Page 37: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

X ~ Multinomial Distribution, P(X=k ; µ) = µk

Z ~ N(¹k, §k)

Observed: z(1), z(2), …, z(m)

EM for Mixture of Gaussians

Page 38: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

E-step:

M-step:

EM for Mixture of Gaussians

Page 39: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Given samples

Dynamics model:

Observation model:

ML objective:

No simple decomposition into independent ML problems for each and each

No closed form solution found by setting derivatives equal to zero

ML Objective HMM

Page 40: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

µ and ° computed from “soft” counts

EM for HMM --- M-step

Page 41: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

No need to find conditional full joint

Run smoother to find:

EM for HMM --- E-step

Page 42: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Linear Gaussian setting:

Given

ML objective:

EM-derivation: same as HMM

ML Objective for Linear Gaussians

Page 43: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Forward:

Backward:

EM for Linear Gaussians --- E-Step

Page 44: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

EM for Linear Gaussians --- M-step

[Updates for A, B, C, d. TODO: Fill in once found/derived.]

Page 45: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

When running EM, it can be good to keep track of the log-likelihood score --- it is supposed to increase every iteration

EM for Linear Gaussians --- The Log-likelihood

Page 46: Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

As the linearization is only an approximation, when performing the updates, we might end up with parameters that result in a lower (rather than higher) log-likelihood score

Solution: instead of updating the parameters to the newly estimated ones, interpolate between the previous parameters and the newly estimated ones. Perform a “line-search” to find the setting that achieves the highest log-likelihood score

EM for Extended Kalman Filter Setting