41
Learning Partially Observed GM: the Expectation - Maximization algorithm Kayhan Batmanghelich 1

Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Learning Partially Observed GM: the Expectation-Maximization

algorithmKayhan Batmanghelich

1

Page 2: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Recall: Learning Graphical Models• Scenarios:

• completely observed GMs• directed• undirected

• partially or unobserved GMs• directed• undirected (an open research topic)

• Estimation principles:• Maximal likelihood estimation (MLE)• Bayesian estimation• Maximal conditional likelihood• Maximal "Margin" • Maximum entropy

• We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.

© Eric Xing @ CMU, 2005-2015 2

Page 3: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Recall: Approaches to inference

• Exact inference algorithms

• The elimination algorithm• Message-passing algorithm (sum-product, belief propagation)• The junction tree algorithms

• Approximate inference techniques

• Stochastic simulation / sampling methods• Markov chain Monte Carlo methods• Variational algorithms

© Eric Xing @ CMU, 2005-2015 3

Page 4: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Partially observed GMs

• Speech recognition

A AA AX2 X3X1 XT

Y2 Y3Y1 YT...

...

4© Eric Xing @ CMU, 2005-2015

Page 5: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Partially observed GM

• Biological Evolutionancestor

A C

Qh QmT years

?

AGAGAC

5© Eric Xing @ CMU, 2005-2015

Page 6: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Mixture Models

6© Eric Xing @ CMU, 2005-2015

Page 7: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Mixture Models, con'd• A density model p(x) may be multi-modal.• We may be able to model it as a mixture of uni-modal distributions

(e.g., Gaussians).• Each mode may correspond to a different sub-population (e.g., male

and female).

Þ

7© Eric Xing @ CMU, 2005-2015

Page 8: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Unobserved Variables

• A variable can be unobserved (latent) because:• it is an imaginary quantity meant to provide some simplified and abstractive view of

the data generation process• e.g., speech recognition models, mixture models …

• it is a real-world object and/or phenomena, but difficult or impossible to measure• e.g., the temperature of a star, causes of a disease, evolutionary ancestors …

• it is a noisy measurement of the a real-world object (i.e. the true value is unobserved).

• Example: Discrete latent variables can be used to partition/cluster data into sub-groups.• Example: Continuous latent variables (factors) can be used for

dimensionality reduction (factor analysis, etc).

8© Eric Xing @ CMU, 2005-2015

Page 9: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

• Consider a mixture of K Gaussian components:

• This model can be used for unsupervised clustering.• This model (fit by AutoClass) has been used to discover new kinds of stars in astronomical

data, etc.

Gaussian Mixture Models (GMMs)

å S=Sk kkkn xNxp ),|,(),( µpµ

mixture proportion mixture component

9© Eric Xing @ CMU, 2005-2015

Page 10: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

p(xn|{µk,⌃k}Kk=1) =X

k

p(zn = k;⇡)p(xn|zn = k; {µk,⌃k}Kk=1)

=X

k

Y

k

(⇡k)I(zn=k)N (xn;µk,⌃k)

=X

k

⇡kN (xn;µk,⌃k)

Gaussian Mixture Models (GMMs)

• Consider a mixture of K Gaussian components:• Z is a latent class indicator vector:

• X is a conditional Gaussian variable with a class-specific mean/covariance

• The likelihood of a sample:

mixture proportion

mixture component 10

p(zn) = Cat(zn;⇡) =Y

k

(⇡k)I(zn=k)

p(xn|zn = k; {µk,⌃k}Kk=1) =1

(2⇡)m/2 det(⌃k)12

exp

�1

2(xn � µk)

T⌃�1k (xn � µk)

xn

zn

Page 11: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Why is Learning Harder?• In fully observed iid settings, the log likelihood decomposes into a

sum of local terms (at least for directed models).

• With latent variables, all the parameters become coupled together via marginalization

),|(log)|(log)|,(log);( xzc zxpzpzxpD qqqq +==l

åå ==z

xzz

c zxpzpzxpD ),|()|(log)|,(log);( qqqql

11© Eric Xing @ CMU, 2005-2015

Page 12: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

l Recall MLE for completely observed data

l Data log-likelihood

l Separate MLE

l What if we do not know zn?

Cxzz

xN

zxpzpxzpD

n kkn

kn

n kk

kn

n k

zkn

n k

zk

nnn

nn

nn

kn

kn

+-=

+=

==

ååååå Õå Õ

ÕÕ

221 )-(log

),;(loglog

),,|()|(log),(log);(

2 µp

sµp

sµp

s

θl

Toward the EM algorithm

),;(maxargˆ , DMLEk θlpp =

);(maxargˆ , DMLEk θlµµ =åå=Þ

nkn

n nkn

MLEk zxz

,ˆ µ

12© Eric Xing @ CMU, 2005-2015

xn

zn

N

Let’s pretend it is observed

Let’s assume ! is known

Page 13: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Question

• “ … We solve problem X using Expectation-Maximization …”• What does it mean?

• E• What do we take expectation with?• What do we take expectation over?

• M• What do we maximize?• What do we maximize with respect to?

13© Eric Xing @ CMU, 2005-2015

Page 14: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Recall: K-means

)()(maxarg )()()()( tkn

tk

Ttknk

tn xxz µµ -S-= -1

åå=+

ntn

n ntnt

k kzxkz),(),(

)(

)()(

dd

µ 1

14© Eric Xing @ CMU, 2005-2015

Page 15: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Expectation-Maximization

• Start: • "Guess" the centroid µk and coveriance Sk of each of the K clusters

• Loop

15© Eric Xing @ CMU, 2005-2015

Page 16: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Example: Gaussian mixture model

• A mixture of K Gaussians:

• Z is a latent class indicator vector

• X is a conditional Gaussian variable with class-specific mean/covariance

• The likelihood of a sample:

• The expected complete log likelihood

{ })-()-(-exp)(

),,|( // knkT

knk

mknn xxzxp µµ

pµ 1

21

212211 -SS

=S=

( )( ) åå Õå

S=S=

S===S

k kkkz kz

kknz

k

kkk

n

xNxN

zxpzpxp

n

kn

kn ),|,(),:(

),,|,()|(),(

µpµp

µpµ 11

( )åååå

åå

+S+-S--=

S+=

-

n kkknk

Tkn

kn

n kk

kn

nxzpnn

nxzpnc

Cxxzz

zxpzpzx

log)()(21log

),,|(log)|(log),;(

1

)|()|(

µµp

µpθl

16© Eric Xing @ CMU, 2005-2015

xn

zn

N

p(zn) = Cat(zn;⇡) =Y

k

(⇡k)I(zn=k)

Page 17: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

• We maximize iteratively using the following iterative procedure:

─Expectation step: computing the expected value of the sufficient statistics of the hidden variables (i.e., z) given current est. of the parameters (i.e., p and µ).

• Here we are essentially doing inference

å SS=S===

i

ti

tin

ti

tk

tkn

tkttk

nq

kn

tkn xN

xNxzpz t ),|,(),|,(),,|1( )()()(

)()()()()()(

)( µpµpµt

)(θcl

E-step

18© Eric Xing @ CMU, 2005-2015

Like soft count

Page 18: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

• We maximize iteratively using the following iterative procedure:

─Maximization step: compute the parameters under current results of the expected value of the hidden variables

• This is isomorphic to MLE except that the variables that are hidden are replaced by their expectations (in general they will by replaced by their corresponding "sufficient statistics")

)(θcl

M-step

1 s.t. , ,0)( ,)(maxarg

)(*

k

*

)(

Nn

NNz

kll

kntk

nn qkn

k

kcck

t

k

===Þ

="=Þ=

åå

嶶

tp

pp p θθ

åå=Þ= +

ntk

n

n ntk

ntkk

xl )(

)()1(* ,)(maxarg

tt

µµ θ

åå ++

+ --=SÞ=S

ntk

n

nTt

kntkn

tknt

kk

xxl )(

)1()1()()1(* ))((

,)(maxargt

µµtθ

TT

T

xxxx =¶

¶-

-

AA

A A

Alog

:Fact

1

1

19© Eric Xing @ CMU, 2005-2015

Page 19: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Compare: K-means and EM

• K-means• In the K-means “E-step” we do hard assignment:

• In the K-means “M-step” we update the means as the weighted sum of the data, but now the weights are 0 or 1:

• EM• E-step

• M-step

)()(maxarg )()()()( tkn

tk

Ttknk

tn xxz µµ -S-= -1

åå=+

ntn

n ntnt

k kzxkz),(),(

)(

)()(

dd

µ 1

The EM algorithm for mixtures of Gaussians is like a "soft version" of the K-means algorithm.

å SS=S==

=

iti

tin

ti

tk

tkn

tkttk

n

q

kn

tkn

xNxNxzp

z t

),|,(),|,(),,|1( )()()(

)()()()()(

)()(

µpµpµ

t

åå=+

ntk

n

n ntk

ntk

x)(

)()1(

tt

µ

20© Eric Xing @ CMU, 2005-2015

Page 20: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Theory underlying EM

• What are we doing?

• Recall that according to MLE, we intend to learn the model parameter that would have maximize the likelihood of the data.

• But we do not observe z, so computing

is difficult!

• What shall we do?

åå ==z

xzz

c zxpzpzxpD ),|()|(log)|,(log);( qqqql

21© Eric Xing @ CMU, 2005-2015

Page 21: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Complete & Incomplete Log Likelihoods• Complete log likelihood

Let X denote the observable variable(s), and Z denote the latent variable(s). If Z could be observed, then

• Recalled that in this case the objective for, e.g., MLE, decomposes into a sum of factors, the parameter for each factor can be estimated separately (c.f. MLE for fully observed models).

• But given that Z is not observed, lc() is a random quantity, cannot be maximized directly.

• Incomplete log likelihoodWith z unobserved, our objective becomes the log of a marginal probability:• This objective won't decouple

)|,(log),;(def

qq zxpzxc =l

å==z

c zxpxpx )|,(log)|(log);( qqql

22© Eric Xing @ CMU, 2005-2015

Page 22: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Expected Complete Log Likelihood

å=z

qc zxpxzqzx )|,(log),|(),;(def

qqql

å

å

å

³

=

=

=

z

z

z

xzqzxpxzq

xzqzxpxzq

zxpxpx

)|()|,(

log)|(

)|()|,(

)|(log

)|,(log

)|(log);(

q

q

qqql

qqc Hzxx +³Þ ),;();( qq ll

• For any distribution q(z), define expected complete log likelihood:

• A deterministic function of q• Linear in lc() --- inherit its factorability • Does maximizing this surrogate yield a maximizer of the likelihood?

• Jensen’s inequality

23© Eric Xing @ CMU, 2005-2015

Page 23: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Lower Bounds and Free Energy• For fixed data x, define a functional called the free energy:

• The EM algorithm is coordinate-ascent on F :• E-step:

• M-step:

);()|()|,(

log)|(),(def

xxzq

zxpxzqqFz

qqq l£=å

),(maxarg tq

t qFq q=+1

),(maxarg ttt qF qqq

11 ++ =

24© Eric Xing @ CMU, 2005-2015

q

Page 24: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

E-step: maximization of expected lc w.r.t. q• Claim:

• The best solution is the posterior over the latent variables given the data and the parameters. Often we need this at test time anyway (e.g. to perform classification).

• Proof (easy): this setting attains the bound l(q;x)³F(q,q )

• Can also show this result using variational calculus or the fact that

),|(),(maxarg ttq

t xzpqFq qq ==+1

);()|(log

)|(log)|(

),()|,(

log),()),,((

xxp

xpxzqxzpzxpxzpxzpF

ttz

t

zt

tttt

qq

q

qqqqq

l==

=

=

å

å

( )),|(||KL),();( qqq xzpqqFx =-l25© Eric Xing @ CMU, 2005-2015

Page 25: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

E-step º plug in posterior expectation of latent variables• Without loss of generality: assume that p(x,z|q) is a generalized

exponential family distribution:

• Special cases: if p(X|Z) are GLMs, then

• The expected complete log likelihood under is

)(),(

)()|,(log),|(),;(

),|(qq

qqqq

qAzxf

Azxpxzqzx

ixzqi

ti

z

ttq

tc

t

t

-=

-=

åå+1l

þýü

îíì= å

iii zxfzxh

Zzxp ),(exp),(

)(),( q

qq 1

)()(),( xzzxf iTii xh=

),|( tt xzpq q=+1

)()()( ),|(

GLIM~qxhq

qAxz

iixzqi

ti

pt -= å

26© Eric Xing @ CMU, 2005-2015

Page 26: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

M-step: maximization of expected lc w.r.t. q• Note that the free energy breaks into two terms:

• Thus, in the M-step, maximizing with respect to q for fixed q we only need to consider the first term:

• Under optimal qt+1, this is equivalent to solving a standard MLE of fully observed model p(x,z|q), with the sufficient statistics involving z replaced by their expectations w.r.t. p(z|x,q).

qqc

zz

z

Hzx

xzqxzqzxpxzqxzq

zxpxzqqF

+=

-=

=

åå

å

),;(

)|(log)|()|,(log)|(

)|()|,(

log)|(),(

q

q

qq

l

å== ++

zqc

t zxpxzqzx t )|,(log)|(maxarg),;(maxarg qqqqq

11 l

27© Eric Xing @ CMU, 2005-2015

Page 27: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Example: HMM

• Supervised learning: estimation when the “right answer” is known

• Examples:

GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental)

annotations of the CpG islands

GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls

• Unsupervised learning: estimation when the “right answer” is unknown• Examples:

GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition

GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice

• QUESTION: Update the parameters q of the model to maximize P(x|q) --- Maximal likelihood (ML) estimation

28© Eric Xing @ CMU, 2005-2015

Page 28: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Hidden Markov Model: from static to dynamic mixture models

Dynamic mixture

A AA AX2 X3X1 XT

Y2 Y3Y1 YT...

...

Static mixture

AX1

Y1

NThe sequence:

The underlying source:

Phonemes,

Speech signal,

sequence of rolls,

dice,

29© Eric Xing @ CMU, 2005-2015

Page 29: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

The Baum Welch algorithm

30

A AA AX2 X3X1 XT

Y2 Y3Y1 YT...

... `c(x,y; ✓) = log p(x,y; ✓) = logY

n

p(yn1 )

TY

t=2

p(ynt |ynt�1)TY

t=1

p(xnt |ynt )

!

<latexit sha1_base64="0hzXhXrOfIIZX2qdpqW80kR+kKc=">AAACqXicjVFdb9MwFHXC1ygfK+yRlysqRCuNKpmQAKFJE7zwgjTQulY0beS4N601x4nsG7Qoy4/jL+xt/wa3C1LZeOBKlo/POVe2z00KJS0FwZXn37l77/6DnYedR4+fPN3tPnt+avPSCByJXOVmknCLSmockSSFk8IgzxKF4+Ts81of/0RjZa5PqCpwlvGllqkUnBwVd39FqFQsoB9lnFZJWp83+39g1XyEiFZIfACHEKl8CcV/+qLC5ItYuwOm1AfXV8XhXA9aoabDg2Z+4mioYppruHB7TW/CxmHYcoWt63zL1XogMnK5okHc7QXDYFNwG4Qt6LG2juPuZbTIRZmhJqG4tdMwKGhWc0NSKGw6UWmx4OKML3HqoOYZ2lm9ibqBV45ZQJobtzTBht3uqHlmbZUlzrlOx97U1uS/tGlJ6ftZLXVREmpxfVFaKqAc1nODhTQoSFUOcGGkeyuIFTdckJtux4UQ3vzybTA6GH4YBt/e9o4+tWnssBfsJeuzkL1jR+wLO2YjJrzX3lfv1Bv7+/53f+L/uLb6Xtuzx/4qX/wGrgfNxg==</latexit><latexit sha1_base64="0hzXhXrOfIIZX2qdpqW80kR+kKc=">AAACqXicjVFdb9MwFHXC1ygfK+yRlysqRCuNKpmQAKFJE7zwgjTQulY0beS4N601x4nsG7Qoy4/jL+xt/wa3C1LZeOBKlo/POVe2z00KJS0FwZXn37l77/6DnYedR4+fPN3tPnt+avPSCByJXOVmknCLSmockSSFk8IgzxKF4+Ts81of/0RjZa5PqCpwlvGllqkUnBwVd39FqFQsoB9lnFZJWp83+39g1XyEiFZIfACHEKl8CcV/+qLC5ItYuwOm1AfXV8XhXA9aoabDg2Z+4mioYppruHB7TW/CxmHYcoWt63zL1XogMnK5okHc7QXDYFNwG4Qt6LG2juPuZbTIRZmhJqG4tdMwKGhWc0NSKGw6UWmx4OKML3HqoOYZ2lm9ibqBV45ZQJobtzTBht3uqHlmbZUlzrlOx97U1uS/tGlJ6ftZLXVREmpxfVFaKqAc1nODhTQoSFUOcGGkeyuIFTdckJtux4UQ3vzybTA6GH4YBt/e9o4+tWnssBfsJeuzkL1jR+wLO2YjJrzX3lfv1Bv7+/53f+L/uLb6Xtuzx/4qX/wGrgfNxg==</latexit><latexit sha1_base64="0hzXhXrOfIIZX2qdpqW80kR+kKc=">AAACqXicjVFdb9MwFHXC1ygfK+yRlysqRCuNKpmQAKFJE7zwgjTQulY0beS4N601x4nsG7Qoy4/jL+xt/wa3C1LZeOBKlo/POVe2z00KJS0FwZXn37l77/6DnYedR4+fPN3tPnt+avPSCByJXOVmknCLSmockSSFk8IgzxKF4+Ts81of/0RjZa5PqCpwlvGllqkUnBwVd39FqFQsoB9lnFZJWp83+39g1XyEiFZIfACHEKl8CcV/+qLC5ItYuwOm1AfXV8XhXA9aoabDg2Z+4mioYppruHB7TW/CxmHYcoWt63zL1XogMnK5okHc7QXDYFNwG4Qt6LG2juPuZbTIRZmhJqG4tdMwKGhWc0NSKGw6UWmx4OKML3HqoOYZ2lm9ibqBV45ZQJobtzTBht3uqHlmbZUlzrlOx97U1uS/tGlJ6ftZLXVREmpxfVFaKqAc1nODhTQoSFUOcGGkeyuIFTdckJtux4UQ3vzybTA6GH4YBt/e9o4+tWnssBfsJeuzkL1jR+wLO2YjJrzX3lfv1Bv7+/53f+L/uLb6Xtuzx/4qX/wGrgfNxg==</latexit>

• The complete log likelihood

h`c(x,y; ✓)i =<latexit sha1_base64="Ov4mrUe93MVROTURNdRukJMrF44=">AAADtHicjVJbb9MwGHUTLiNc1sEjLxZVpVaCKqmQBoKhAS88DmlllZo2clyndec4UfwFNQv5hbzxxr/BcdOt23iYpSjH5zvnu8lhKrgC1/3bsux79x883HvkPH7y9Nl+++D5D5XkGWUjmogkG4dEMcElGwEHwcZpxkgcCnYWnn+t42c/WaZ4Ik+hSNk0JgvJI04JaCo4aP3u+kyIgOKeHxNYhlG5rl5vYVF9wD4sGZA+PsK+SBY4vaPOT7NkHkh9YRH0sPYVgTeT/SZQwtGwmp1qGhcBzCT+pf8lvPEqjfGOymtU6x1Vo8F+xhdL6DvdOjnoyqurNPrGTTMlJkHJVxX2a+6z0/VTHnANTUdG5nTDYIV7F33Drk2qC5PKJO07ZoiP+I6bavr6pL26msrjqzU0iYqZ1KUvdc3CON4Zaeurf3Dp03A73zXzZsSg3XEHrjn4NvAa0EHNOQnaf/x5QvOYSaCCKDXx3BSmJcmAU8Eqx88VSwk9Jws20VCSmKlpaV5dhbuameMoyfQnARt211GSWKkiDrWyXpK6GavJ/8UmOUTvpiWXaQ5M0k2hKBcYElw/YTznGaMgCg0IzbjuFdMlyQgF/dAdvQTv5si3wWg4eD9wv7/tHH9ptrGHXqJXqIc8dIiO0Td0gkaIWkNrbBErtA/tqU1ttpFarcbzAl07tvwH5lAjRQ==</latexit><latexit sha1_base64="Ov4mrUe93MVROTURNdRukJMrF44=">AAADtHicjVJbb9MwGHUTLiNc1sEjLxZVpVaCKqmQBoKhAS88DmlllZo2clyndec4UfwFNQv5hbzxxr/BcdOt23iYpSjH5zvnu8lhKrgC1/3bsux79x883HvkPH7y9Nl+++D5D5XkGWUjmogkG4dEMcElGwEHwcZpxkgcCnYWnn+t42c/WaZ4Ik+hSNk0JgvJI04JaCo4aP3u+kyIgOKeHxNYhlG5rl5vYVF9wD4sGZA+PsK+SBY4vaPOT7NkHkh9YRH0sPYVgTeT/SZQwtGwmp1qGhcBzCT+pf8lvPEqjfGOymtU6x1Vo8F+xhdL6DvdOjnoyqurNPrGTTMlJkHJVxX2a+6z0/VTHnANTUdG5nTDYIV7F33Drk2qC5PKJO07ZoiP+I6bavr6pL26msrjqzU0iYqZ1KUvdc3CON4Zaeurf3Dp03A73zXzZsSg3XEHrjn4NvAa0EHNOQnaf/x5QvOYSaCCKDXx3BSmJcmAU8Eqx88VSwk9Jws20VCSmKlpaV5dhbuameMoyfQnARt211GSWKkiDrWyXpK6GavJ/8UmOUTvpiWXaQ5M0k2hKBcYElw/YTznGaMgCg0IzbjuFdMlyQgF/dAdvQTv5si3wWg4eD9wv7/tHH9ptrGHXqJXqIc8dIiO0Td0gkaIWkNrbBErtA/tqU1ttpFarcbzAl07tvwH5lAjRQ==</latexit><latexit sha1_base64="Ov4mrUe93MVROTURNdRukJMrF44=">AAADtHicjVJbb9MwGHUTLiNc1sEjLxZVpVaCKqmQBoKhAS88DmlllZo2clyndec4UfwFNQv5hbzxxr/BcdOt23iYpSjH5zvnu8lhKrgC1/3bsux79x883HvkPH7y9Nl+++D5D5XkGWUjmogkG4dEMcElGwEHwcZpxkgcCnYWnn+t42c/WaZ4Ik+hSNk0JgvJI04JaCo4aP3u+kyIgOKeHxNYhlG5rl5vYVF9wD4sGZA+PsK+SBY4vaPOT7NkHkh9YRH0sPYVgTeT/SZQwtGwmp1qGhcBzCT+pf8lvPEqjfGOymtU6x1Vo8F+xhdL6DvdOjnoyqurNPrGTTMlJkHJVxX2a+6z0/VTHnANTUdG5nTDYIV7F33Drk2qC5PKJO07ZoiP+I6bavr6pL26msrjqzU0iYqZ1KUvdc3CON4Zaeurf3Dp03A73zXzZsSg3XEHrjn4NvAa0EHNOQnaf/x5QvOYSaCCKDXx3BSmJcmAU8Eqx88VSwk9Jws20VCSmKlpaV5dhbuameMoyfQnARt211GSWKkiDrWyXpK6GavJ/8UmOUTvpiWXaQ5M0k2hKBcYElw/YTznGaMgCg0IzbjuFdMlyQgF/dAdvQTv5si3wWg4eD9wv7/tHH9ptrGHXqJXqIc8dIiO0Td0gkaIWkNrbBErtA/tqU1ttpFarcbzAl07tvwH5lAjRQ==</latexit>

! = ($, &, ') are the parameters of the model

Page 30: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

The Baum Welch algorithm

31

A AA AX2 X3X1 XT

Y2 Y3Y1 YT...

... `c(x,y; ✓) = log p(x,y; ✓) = logY

n

p(yn1 )

TY

t=2

p(ynt |ynt�1)TY

t=1

p(xnt |ynt )

!

<latexit sha1_base64="0hzXhXrOfIIZX2qdpqW80kR+kKc=">AAACqXicjVFdb9MwFHXC1ygfK+yRlysqRCuNKpmQAKFJE7zwgjTQulY0beS4N601x4nsG7Qoy4/jL+xt/wa3C1LZeOBKlo/POVe2z00KJS0FwZXn37l77/6DnYedR4+fPN3tPnt+avPSCByJXOVmknCLSmockSSFk8IgzxKF4+Ts81of/0RjZa5PqCpwlvGllqkUnBwVd39FqFQsoB9lnFZJWp83+39g1XyEiFZIfACHEKl8CcV/+qLC5ItYuwOm1AfXV8XhXA9aoabDg2Z+4mioYppruHB7TW/CxmHYcoWt63zL1XogMnK5okHc7QXDYFNwG4Qt6LG2juPuZbTIRZmhJqG4tdMwKGhWc0NSKGw6UWmx4OKML3HqoOYZ2lm9ibqBV45ZQJobtzTBht3uqHlmbZUlzrlOx97U1uS/tGlJ6ftZLXVREmpxfVFaKqAc1nODhTQoSFUOcGGkeyuIFTdckJtux4UQ3vzybTA6GH4YBt/e9o4+tWnssBfsJeuzkL1jR+wLO2YjJrzX3lfv1Bv7+/53f+L/uLb6Xtuzx/4qX/wGrgfNxg==</latexit><latexit sha1_base64="0hzXhXrOfIIZX2qdpqW80kR+kKc=">AAACqXicjVFdb9MwFHXC1ygfK+yRlysqRCuNKpmQAKFJE7zwgjTQulY0beS4N601x4nsG7Qoy4/jL+xt/wa3C1LZeOBKlo/POVe2z00KJS0FwZXn37l77/6DnYedR4+fPN3tPnt+avPSCByJXOVmknCLSmockSSFk8IgzxKF4+Ts81of/0RjZa5PqCpwlvGllqkUnBwVd39FqFQsoB9lnFZJWp83+39g1XyEiFZIfACHEKl8CcV/+qLC5ItYuwOm1AfXV8XhXA9aoabDg2Z+4mioYppruHB7TW/CxmHYcoWt63zL1XogMnK5okHc7QXDYFNwG4Qt6LG2juPuZbTIRZmhJqG4tdMwKGhWc0NSKGw6UWmx4OKML3HqoOYZ2lm9ibqBV45ZQJobtzTBht3uqHlmbZUlzrlOx97U1uS/tGlJ6ftZLXVREmpxfVFaKqAc1nODhTQoSFUOcGGkeyuIFTdckJtux4UQ3vzybTA6GH4YBt/e9o4+tWnssBfsJeuzkL1jR+wLO2YjJrzX3lfv1Bv7+/53f+L/uLb6Xtuzx/4qX/wGrgfNxg==</latexit><latexit sha1_base64="0hzXhXrOfIIZX2qdpqW80kR+kKc=">AAACqXicjVFdb9MwFHXC1ygfK+yRlysqRCuNKpmQAKFJE7zwgjTQulY0beS4N601x4nsG7Qoy4/jL+xt/wa3C1LZeOBKlo/POVe2z00KJS0FwZXn37l77/6DnYedR4+fPN3tPnt+avPSCByJXOVmknCLSmockSSFk8IgzxKF4+Ts81of/0RjZa5PqCpwlvGllqkUnBwVd39FqFQsoB9lnFZJWp83+39g1XyEiFZIfACHEKl8CcV/+qLC5ItYuwOm1AfXV8XhXA9aoabDg2Z+4mioYppruHB7TW/CxmHYcoWt63zL1XogMnK5okHc7QXDYFNwG4Qt6LG2juPuZbTIRZmhJqG4tdMwKGhWc0NSKGw6UWmx4OKML3HqoOYZ2lm9ibqBV45ZQJobtzTBht3uqHlmbZUlzrlOx97U1uS/tGlJ6ftZLXVREmpxfVFaKqAc1nODhTQoSFUOcGGkeyuIFTdckJtux4UQ3vzybTA6GH4YBt/e9o4+tWnssBfsJeuzkL1jR+wLO2YjJrzX3lfv1Bv7+/53f+L/uLb6Xtuzx/4qX/wGrgfNxg==</latexit>

• The complete log likelihood

h`c(x,y; ✓)i =<latexit sha1_base64="Ov4mrUe93MVROTURNdRukJMrF44=">AAADtHicjVJbb9MwGHUTLiNc1sEjLxZVpVaCKqmQBoKhAS88DmlllZo2clyndec4UfwFNQv5hbzxxr/BcdOt23iYpSjH5zvnu8lhKrgC1/3bsux79x883HvkPH7y9Nl+++D5D5XkGWUjmogkG4dEMcElGwEHwcZpxkgcCnYWnn+t42c/WaZ4Ik+hSNk0JgvJI04JaCo4aP3u+kyIgOKeHxNYhlG5rl5vYVF9wD4sGZA+PsK+SBY4vaPOT7NkHkh9YRH0sPYVgTeT/SZQwtGwmp1qGhcBzCT+pf8lvPEqjfGOymtU6x1Vo8F+xhdL6DvdOjnoyqurNPrGTTMlJkHJVxX2a+6z0/VTHnANTUdG5nTDYIV7F33Drk2qC5PKJO07ZoiP+I6bavr6pL26msrjqzU0iYqZ1KUvdc3CON4Zaeurf3Dp03A73zXzZsSg3XEHrjn4NvAa0EHNOQnaf/x5QvOYSaCCKDXx3BSmJcmAU8Eqx88VSwk9Jws20VCSmKlpaV5dhbuameMoyfQnARt211GSWKkiDrWyXpK6GavJ/8UmOUTvpiWXaQ5M0k2hKBcYElw/YTznGaMgCg0IzbjuFdMlyQgF/dAdvQTv5si3wWg4eD9wv7/tHH9ptrGHXqJXqIc8dIiO0Td0gkaIWkNrbBErtA/tqU1ttpFarcbzAl07tvwH5lAjRQ==</latexit><latexit sha1_base64="Ov4mrUe93MVROTURNdRukJMrF44=">AAADtHicjVJbb9MwGHUTLiNc1sEjLxZVpVaCKqmQBoKhAS88DmlllZo2clyndec4UfwFNQv5hbzxxr/BcdOt23iYpSjH5zvnu8lhKrgC1/3bsux79x883HvkPH7y9Nl+++D5D5XkGWUjmogkG4dEMcElGwEHwcZpxkgcCnYWnn+t42c/WaZ4Ik+hSNk0JgvJI04JaCo4aP3u+kyIgOKeHxNYhlG5rl5vYVF9wD4sGZA+PsK+SBY4vaPOT7NkHkh9YRH0sPYVgTeT/SZQwtGwmp1qGhcBzCT+pf8lvPEqjfGOymtU6x1Vo8F+xhdL6DvdOjnoyqurNPrGTTMlJkHJVxX2a+6z0/VTHnANTUdG5nTDYIV7F33Drk2qC5PKJO07ZoiP+I6bavr6pL26msrjqzU0iYqZ1KUvdc3CON4Zaeurf3Dp03A73zXzZsSg3XEHrjn4NvAa0EHNOQnaf/x5QvOYSaCCKDXx3BSmJcmAU8Eqx88VSwk9Jws20VCSmKlpaV5dhbuameMoyfQnARt211GSWKkiDrWyXpK6GavJ/8UmOUTvpiWXaQ5M0k2hKBcYElw/YTznGaMgCg0IzbjuFdMlyQgF/dAdvQTv5si3wWg4eD9wv7/tHH9ptrGHXqJXqIc8dIiO0Td0gkaIWkNrbBErtA/tqU1ttpFarcbzAl07tvwH5lAjRQ==</latexit><latexit sha1_base64="Ov4mrUe93MVROTURNdRukJMrF44=">AAADtHicjVJbb9MwGHUTLiNc1sEjLxZVpVaCKqmQBoKhAS88DmlllZo2clyndec4UfwFNQv5hbzxxr/BcdOt23iYpSjH5zvnu8lhKrgC1/3bsux79x883HvkPH7y9Nl+++D5D5XkGWUjmogkG4dEMcElGwEHwcZpxkgcCnYWnn+t42c/WaZ4Ik+hSNk0JgvJI04JaCo4aP3u+kyIgOKeHxNYhlG5rl5vYVF9wD4sGZA+PsK+SBY4vaPOT7NkHkh9YRH0sPYVgTeT/SZQwtGwmp1qGhcBzCT+pf8lvPEqjfGOymtU6x1Vo8F+xhdL6DvdOjnoyqurNPrGTTMlJkHJVxX2a+6z0/VTHnANTUdG5nTDYIV7F33Drk2qC5PKJO07ZoiP+I6bavr6pL26msrjqzU0iYqZ1KUvdc3CON4Zaeurf3Dp03A73zXzZsSg3XEHrjn4NvAa0EHNOQnaf/x5QvOYSaCCKDXx3BSmJcmAU8Eqx88VSwk9Jws20VCSmKlpaV5dhbuameMoyfQnARt211GSWKkiDrWyXpK6GavJ/8UmOUTvpiWXaQ5M0k2hKBcYElw/YTznGaMgCg0IzbjuFdMlyQgF/dAdvQTv5si3wWg4eD9wv7/tHH9ptrGHXqJXqIc8dIiO0Td0gkaIWkNrbBErtA/tqU1ttpFarcbzAl07tvwH5lAjRQ==</latexit>

• Fix ! and compute the marginal posterior: • " #$ = &|(; ! ,• " #$ = &, #$,- = . (; !)

• Update ! by MLE (closed-form) – remember the soft count

Page 31: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Extension to general BN

Page 32: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

EM for general BNs! represents both hidden and observed:

See section 11.2.4 of David Barber’s book 33

A bit of notation:

Page 33: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

EM for general BNs! represents both hidden and observed:

See section 11.2.4 of David Barber’s book 34

Page 34: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

EM for general BNs! represents both hidden and observed:

See section 11.2.4 of David Barber’s book 35

Page 35: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Summary: EM Algorithm

• A way of maximizing likelihood function for latent variable models. Finds MLE of parameters

when the original (hard) problem can be broken up into two (easy) pieces:

1. Estimate some “missing” or “unobserved” data from observed data and current parameters.

2. Using this “complete” data, find the maximum likelihood parameter estimates.

• Alternate between filling in the latent variables using the best guess (posterior) and updating

the parameters based on this guess:

• E-step:

• M-step:

• In the M-step we optimize a lower bound on the likelihood. In the E-step we close the gap,

making bound=likelihood.

),(maxarg tq

t qFq q=+1

),(maxarg ttt qF qqq

11 ++ =

39© Eric Xing @ CMU, 2005-2015

Page 36: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

More Examples

Page 37: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Conditional mixture model: Mixture of experts

• We will model p(Y |X) using different experts, each responsible for different regions of the input space.• Latent variable Z chooses expert using softmax gating function:

• Each expert can be a linear regression model:• The posterior expert responsibilities are

( )xxzP Tk xSoftmax)( == 1( )21 k

Tk

k xyzxyP sq ,;),( N==

å ==

==j jjj

jkkk

kk

xypxzpxypxzpyxzP

),,()(),,()(

),,( 2

2

111

sqsq

q

41© Eric Xing @ CMU, 2005-2015

Page 38: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

EM for conditional mixture model

• Model:

• The objective function

• EM:• E-step:• M-step:

• using the normal equation for standard LR , but with the data re-weighted by t (homework)• IRLS and/or weighted IRLS algorithm to update {xk, qk, sk} based on data pair (xn,yn), with weights (homework?)

),,,|(),|()( sqx ik

kk xzypxzpxyP 11 ===å

å ==

===j jjnnjn

jn

kknnknkn

nnkn

tkn xypxzp

xypxzpyxzP),,()(

),,()(),,()(

2

2

111

sqsq

t θ

( ) åååå

åå

÷÷ø

öççè

æ++-=

+=

n kk

k

nTknk

nn k

nTk

kn

nyxzpnnn

nyxzpnnc

Cxyzxz

zxypxzpzyx

22

),|(),|(

log)-(21)softmax(log

),,,|(log),|(log),,;(

ssqx

sqxθl

YXXX TT 1-= )(q)(tk

nt

42© Eric Xing @ CMU, 2005-2015

Page 39: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Hierarchical mixture of experts

• This is like a soft version of a depth-2 classification/regression tree.• P(Y |X,G1,G2) can be modeled as a GLIM, with parameters dependent on the values of G1 and G2(which specify a "conditional path" to a given leaf in the tree).

43© Eric Xing @ CMU, 2005-2015

Page 40: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

Mixture of overlapping experts

• By removing the Xà Z arc, we can make the partitions independent of the input, thus allowing overlap.• This is a mixture of linear regressors; each subpopulation has a different

conditional mean.

å ==

==j jjj

jkkk

kk

xypzpxypzpyxzP

),,()(),,()(

),,( 2

2

111

sqsq

q

44© Eric Xing @ CMU, 2005-2015

Page 41: Learning Partially Observed GM: the Expectation ...Mixture Models, con'd •A density model p(x) may be multi-modal. •We may be able to model it as a mixture of uni-modal distributions

A Report Card for EM

• Some good things about EM:• no learning rate (step-size) parameter• automatically enforces parameter constraints• very fast for low dimensions• each iteration guaranteed to improve likelihood

• Some bad things about EM:• can get stuck in local minima• can be slower than conjugate gradient (especially near convergence)• requires expensive inference step• is a maximum likelihood/MAP method

47© Eric Xing @ CMU, 2005-2015