Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Learning Partially Observed GM: the Expectation-Maximization
algorithmKayhan Batmanghelich
1
Recall: Learning Graphical Models• Scenarios:
• completely observed GMs• directed• undirected
• partially or unobserved GMs• directed• undirected (an open research topic)
• Estimation principles:• Maximal likelihood estimation (MLE)• Bayesian estimation• Maximal conditional likelihood• Maximal "Margin" • Maximum entropy
• We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.
© Eric Xing @ CMU, 2005-2015 2
Recall: Approaches to inference
• Exact inference algorithms
• The elimination algorithm• Message-passing algorithm (sum-product, belief propagation)• The junction tree algorithms
• Approximate inference techniques
• Stochastic simulation / sampling methods• Markov chain Monte Carlo methods• Variational algorithms
© Eric Xing @ CMU, 2005-2015 3
Partially observed GMs
• Speech recognition
A AA AX2 X3X1 XT
Y2 Y3Y1 YT...
...
4© Eric Xing @ CMU, 2005-2015
Partially observed GM
• Biological Evolutionancestor
A C
Qh QmT years
?
AGAGAC
5© Eric Xing @ CMU, 2005-2015
Mixture Models
6© Eric Xing @ CMU, 2005-2015
Mixture Models, con'd• A density model p(x) may be multi-modal.• We may be able to model it as a mixture of uni-modal distributions
(e.g., Gaussians).• Each mode may correspond to a different sub-population (e.g., male
and female).
Þ
7© Eric Xing @ CMU, 2005-2015
Unobserved Variables
• A variable can be unobserved (latent) because:• it is an imaginary quantity meant to provide some simplified and abstractive view of
the data generation process• e.g., speech recognition models, mixture models …
• it is a real-world object and/or phenomena, but difficult or impossible to measure• e.g., the temperature of a star, causes of a disease, evolutionary ancestors …
• it is a noisy measurement of the a real-world object (i.e. the true value is unobserved).
• Example: Discrete latent variables can be used to partition/cluster data into sub-groups.• Example: Continuous latent variables (factors) can be used for
dimensionality reduction (factor analysis, etc).
8© Eric Xing @ CMU, 2005-2015
• Consider a mixture of K Gaussian components:
• This model can be used for unsupervised clustering.• This model (fit by AutoClass) has been used to discover new kinds of stars in astronomical
data, etc.
Gaussian Mixture Models (GMMs)
å S=Sk kkkn xNxp ),|,(),( µpµ
mixture proportion mixture component
9© Eric Xing @ CMU, 2005-2015
p(xn|{µk,⌃k}Kk=1) =X
k
p(zn = k;⇡)p(xn|zn = k; {µk,⌃k}Kk=1)
=X
k
Y
k
(⇡k)I(zn=k)N (xn;µk,⌃k)
=X
k
⇡kN (xn;µk,⌃k)
Gaussian Mixture Models (GMMs)
• Consider a mixture of K Gaussian components:• Z is a latent class indicator vector:
• X is a conditional Gaussian variable with a class-specific mean/covariance
• The likelihood of a sample:
mixture proportion
mixture component 10
p(zn) = Cat(zn;⇡) =Y
k
(⇡k)I(zn=k)
p(xn|zn = k; {µk,⌃k}Kk=1) =1
(2⇡)m/2 det(⌃k)12
exp
�1
2(xn � µk)
T⌃�1k (xn � µk)
�
xn
zn
Why is Learning Harder?• In fully observed iid settings, the log likelihood decomposes into a
sum of local terms (at least for directed models).
• With latent variables, all the parameters become coupled together via marginalization
),|(log)|(log)|,(log);( xzc zxpzpzxpD qqqq +==l
åå ==z
xzz
c zxpzpzxpD ),|()|(log)|,(log);( qqqql
11© Eric Xing @ CMU, 2005-2015
l Recall MLE for completely observed data
l Data log-likelihood
l Separate MLE
l What if we do not know zn?
Cxzz
xN
zxpzpxzpD
n kkn
kn
n kk
kn
n k
zkn
n k
zk
nnn
nn
nn
kn
kn
+-=
+=
==
ååååå Õå Õ
ÕÕ
221 )-(log
),;(loglog
),,|()|(log),(log);(
2 µp
sµp
sµp
s
θl
Toward the EM algorithm
),;(maxargˆ , DMLEk θlpp =
);(maxargˆ , DMLEk θlµµ =åå=Þ
nkn
n nkn
MLEk zxz
,ˆ µ
12© Eric Xing @ CMU, 2005-2015
xn
zn
N
Let’s pretend it is observed
Let’s assume ! is known
Question
• “ … We solve problem X using Expectation-Maximization …”• What does it mean?
• E• What do we take expectation with?• What do we take expectation over?
• M• What do we maximize?• What do we maximize with respect to?
13© Eric Xing @ CMU, 2005-2015
Recall: K-means
)()(maxarg )()()()( tkn
tk
Ttknk
tn xxz µµ -S-= -1
åå=+
ntn
n ntnt
k kzxkz),(),(
)(
)()(
dd
µ 1
14© Eric Xing @ CMU, 2005-2015
Expectation-Maximization
• Start: • "Guess" the centroid µk and coveriance Sk of each of the K clusters
• Loop
15© Eric Xing @ CMU, 2005-2015
Example: Gaussian mixture model
• A mixture of K Gaussians:
• Z is a latent class indicator vector
• X is a conditional Gaussian variable with class-specific mean/covariance
• The likelihood of a sample:
• The expected complete log likelihood
{ })-()-(-exp)(
),,|( // knkT
knk
mknn xxzxp µµ
pµ 1
21
212211 -SS
=S=
( )( ) åå Õå
S=S=
S===S
k kkkz kz
kknz
k
kkk
n
xNxN
zxpzpxp
n
kn
kn ),|,(),:(
),,|,()|(),(
µpµp
µpµ 11
( )åååå
åå
+S+-S--=
S+=
-
n kkknk
Tkn
kn
n kk
kn
nxzpnn
nxzpnc
Cxxzz
zxpzpzx
log)()(21log
),,|(log)|(log),;(
1
)|()|(
µµp
µpθl
16© Eric Xing @ CMU, 2005-2015
xn
zn
N
p(zn) = Cat(zn;⇡) =Y
k
(⇡k)I(zn=k)
• We maximize iteratively using the following iterative procedure:
─Expectation step: computing the expected value of the sufficient statistics of the hidden variables (i.e., z) given current est. of the parameters (i.e., p and µ).
• Here we are essentially doing inference
å SS=S===
i
ti
tin
ti
tk
tkn
tkttk
nq
kn
tkn xN
xNxzpz t ),|,(),|,(),,|1( )()()(
)()()()()()(
)( µpµpµt
)(θcl
E-step
18© Eric Xing @ CMU, 2005-2015
Like soft count
• We maximize iteratively using the following iterative procedure:
─Maximization step: compute the parameters under current results of the expected value of the hidden variables
• This is isomorphic to MLE except that the variables that are hidden are replaced by their expectations (in general they will by replaced by their corresponding "sufficient statistics")
)(θcl
M-step
1 s.t. , ,0)( ,)(maxarg
)(*
k
*
)(
Nn
NNz
kll
kntk
nn qkn
k
kcck
t
k
===Þ
="=Þ=
åå
嶶
tp
pp p θθ
åå=Þ= +
ntk
n
n ntk
ntkk
xl )(
)()1(* ,)(maxarg
tt
µµ θ
åå ++
+ --=SÞ=S
ntk
n
nTt
kntkn
tknt
kk
xxl )(
)1()1()()1(* ))((
,)(maxargt
µµtθ
TT
T
xxxx =¶
¶
=¶
¶-
-
AA
A A
Alog
:Fact
1
1
19© Eric Xing @ CMU, 2005-2015
Compare: K-means and EM
• K-means• In the K-means “E-step” we do hard assignment:
• In the K-means “M-step” we update the means as the weighted sum of the data, but now the weights are 0 or 1:
• EM• E-step
• M-step
)()(maxarg )()()()( tkn
tk
Ttknk
tn xxz µµ -S-= -1
åå=+
ntn
n ntnt
k kzxkz),(),(
)(
)()(
dd
µ 1
The EM algorithm for mixtures of Gaussians is like a "soft version" of the K-means algorithm.
å SS=S==
=
iti
tin
ti
tk
tkn
tkttk
n
q
kn
tkn
xNxNxzp
z t
),|,(),|,(),,|1( )()()(
)()()()()(
)()(
µpµpµ
t
åå=+
ntk
n
n ntk
ntk
x)(
)()1(
tt
µ
20© Eric Xing @ CMU, 2005-2015
Theory underlying EM
• What are we doing?
• Recall that according to MLE, we intend to learn the model parameter that would have maximize the likelihood of the data.
• But we do not observe z, so computing
is difficult!
• What shall we do?
åå ==z
xzz
c zxpzpzxpD ),|()|(log)|,(log);( qqqql
21© Eric Xing @ CMU, 2005-2015
Complete & Incomplete Log Likelihoods• Complete log likelihood
Let X denote the observable variable(s), and Z denote the latent variable(s). If Z could be observed, then
• Recalled that in this case the objective for, e.g., MLE, decomposes into a sum of factors, the parameter for each factor can be estimated separately (c.f. MLE for fully observed models).
• But given that Z is not observed, lc() is a random quantity, cannot be maximized directly.
• Incomplete log likelihoodWith z unobserved, our objective becomes the log of a marginal probability:• This objective won't decouple
)|,(log),;(def
qq zxpzxc =l
å==z
c zxpxpx )|,(log)|(log);( qqql
22© Eric Xing @ CMU, 2005-2015
Expected Complete Log Likelihood
å=z
qc zxpxzqzx )|,(log),|(),;(def
qqql
å
å
å
³
=
=
=
z
z
z
xzqzxpxzq
xzqzxpxzq
zxpxpx
)|()|,(
log)|(
)|()|,(
)|(log
)|,(log
)|(log);(
q
q
qqql
qqc Hzxx +³Þ ),;();( qq ll
• For any distribution q(z), define expected complete log likelihood:
• A deterministic function of q• Linear in lc() --- inherit its factorability • Does maximizing this surrogate yield a maximizer of the likelihood?
• Jensen’s inequality
23© Eric Xing @ CMU, 2005-2015
Lower Bounds and Free Energy• For fixed data x, define a functional called the free energy:
• The EM algorithm is coordinate-ascent on F :• E-step:
• M-step:
);()|()|,(
log)|(),(def
xxzq
zxpxzqqFz
qqq l£=å
),(maxarg tq
t qFq q=+1
),(maxarg ttt qF qqq
11 ++ =
24© Eric Xing @ CMU, 2005-2015
✓
q
E-step: maximization of expected lc w.r.t. q• Claim:
• The best solution is the posterior over the latent variables given the data and the parameters. Often we need this at test time anyway (e.g. to perform classification).
• Proof (easy): this setting attains the bound l(q;x)³F(q,q )
• Can also show this result using variational calculus or the fact that
),|(),(maxarg ttq
t xzpqFq qq ==+1
);()|(log
)|(log)|(
),()|,(
log),()),,((
xxp
xpxzqxzpzxpxzpxzpF
ttz
t
zt
tttt
q
qqqqq
l==
=
=
å
å
( )),|(||KL),();( qqq xzpqqFx =-l25© Eric Xing @ CMU, 2005-2015
E-step º plug in posterior expectation of latent variables• Without loss of generality: assume that p(x,z|q) is a generalized
exponential family distribution:
• Special cases: if p(X|Z) are GLMs, then
• The expected complete log likelihood under is
)(),(
)()|,(log),|(),;(
qqqq
qAzxf
Azxpxzqzx
ixzqi
ti
z
ttq
tc
t
t
-=
-=
åå+1l
þýü
îíì= å
iii zxfzxh
Zzxp ),(exp),(
)(),( q
qq 1
)()(),( xzzxf iTii xh=
),|( tt xzpq q=+1
)()()( ),|(
GLIM~qxhq
qAxz
iixzqi
ti
pt -= å
26© Eric Xing @ CMU, 2005-2015
M-step: maximization of expected lc w.r.t. q• Note that the free energy breaks into two terms:
• Thus, in the M-step, maximizing with respect to q for fixed q we only need to consider the first term:
• Under optimal qt+1, this is equivalent to solving a standard MLE of fully observed model p(x,z|q), with the sufficient statistics involving z replaced by their expectations w.r.t. p(z|x,q).
qqc
zz
z
Hzx
xzqxzqzxpxzqxzq
zxpxzqqF
+=
-=
=
åå
å
),;(
)|(log)|()|,(log)|(
)|()|,(
log)|(),(
q
q
l
å== ++
zqc
t zxpxzqzx t )|,(log)|(maxarg),;(maxarg qqqqq
11 l
27© Eric Xing @ CMU, 2005-2015
Example: HMM
• Supervised learning: estimation when the “right answer” is known
• Examples:
GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental)
annotations of the CpG islands
GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls
• Unsupervised learning: estimation when the “right answer” is unknown• Examples:
GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition
GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice
• QUESTION: Update the parameters q of the model to maximize P(x|q) --- Maximal likelihood (ML) estimation
28© Eric Xing @ CMU, 2005-2015
Hidden Markov Model: from static to dynamic mixture models
Dynamic mixture
A AA AX2 X3X1 XT
Y2 Y3Y1 YT...
...
Static mixture
AX1
Y1
NThe sequence:
The underlying source:
Phonemes,
Speech signal,
sequence of rolls,
dice,
29© Eric Xing @ CMU, 2005-2015
The Baum Welch algorithm
30
A AA AX2 X3X1 XT
Y2 Y3Y1 YT...
... `c(x,y; ✓) = log p(x,y; ✓) = logY
n
p(yn1 )
TY
t=2
p(ynt |ynt�1)TY
t=1
p(xnt |ynt )
!
<latexit sha1_base64="0hzXhXrOfIIZX2qdpqW80kR+kKc=">AAACqXicjVFdb9MwFHXC1ygfK+yRlysqRCuNKpmQAKFJE7zwgjTQulY0beS4N601x4nsG7Qoy4/jL+xt/wa3C1LZeOBKlo/POVe2z00KJS0FwZXn37l77/6DnYedR4+fPN3tPnt+avPSCByJXOVmknCLSmockSSFk8IgzxKF4+Ts81of/0RjZa5PqCpwlvGllqkUnBwVd39FqFQsoB9lnFZJWp83+39g1XyEiFZIfACHEKl8CcV/+qLC5ItYuwOm1AfXV8XhXA9aoabDg2Z+4mioYppruHB7TW/CxmHYcoWt63zL1XogMnK5okHc7QXDYFNwG4Qt6LG2juPuZbTIRZmhJqG4tdMwKGhWc0NSKGw6UWmx4OKML3HqoOYZ2lm9ibqBV45ZQJobtzTBht3uqHlmbZUlzrlOx97U1uS/tGlJ6ftZLXVREmpxfVFaKqAc1nODhTQoSFUOcGGkeyuIFTdckJtux4UQ3vzybTA6GH4YBt/e9o4+tWnssBfsJeuzkL1jR+wLO2YjJrzX3lfv1Bv7+/53f+L/uLb6Xtuzx/4qX/wGrgfNxg==</latexit><latexit sha1_base64="0hzXhXrOfIIZX2qdpqW80kR+kKc=">AAACqXicjVFdb9MwFHXC1ygfK+yRlysqRCuNKpmQAKFJE7zwgjTQulY0beS4N601x4nsG7Qoy4/jL+xt/wa3C1LZeOBKlo/POVe2z00KJS0FwZXn37l77/6DnYedR4+fPN3tPnt+avPSCByJXOVmknCLSmockSSFk8IgzxKF4+Ts81of/0RjZa5PqCpwlvGllqkUnBwVd39FqFQsoB9lnFZJWp83+39g1XyEiFZIfACHEKl8CcV/+qLC5ItYuwOm1AfXV8XhXA9aoabDg2Z+4mioYppruHB7TW/CxmHYcoWt63zL1XogMnK5okHc7QXDYFNwG4Qt6LG2juPuZbTIRZmhJqG4tdMwKGhWc0NSKGw6UWmx4OKML3HqoOYZ2lm9ibqBV45ZQJobtzTBht3uqHlmbZUlzrlOx97U1uS/tGlJ6ftZLXVREmpxfVFaKqAc1nODhTQoSFUOcGGkeyuIFTdckJtux4UQ3vzybTA6GH4YBt/e9o4+tWnssBfsJeuzkL1jR+wLO2YjJrzX3lfv1Bv7+/53f+L/uLb6Xtuzx/4qX/wGrgfNxg==</latexit><latexit sha1_base64="0hzXhXrOfIIZX2qdpqW80kR+kKc=">AAACqXicjVFdb9MwFHXC1ygfK+yRlysqRCuNKpmQAKFJE7zwgjTQulY0beS4N601x4nsG7Qoy4/jL+xt/wa3C1LZeOBKlo/POVe2z00KJS0FwZXn37l77/6DnYedR4+fPN3tPnt+avPSCByJXOVmknCLSmockSSFk8IgzxKF4+Ts81of/0RjZa5PqCpwlvGllqkUnBwVd39FqFQsoB9lnFZJWp83+39g1XyEiFZIfACHEKl8CcV/+qLC5ItYuwOm1AfXV8XhXA9aoabDg2Z+4mioYppruHB7TW/CxmHYcoWt63zL1XogMnK5okHc7QXDYFNwG4Qt6LG2juPuZbTIRZmhJqG4tdMwKGhWc0NSKGw6UWmx4OKML3HqoOYZ2lm9ibqBV45ZQJobtzTBht3uqHlmbZUlzrlOx97U1uS/tGlJ6ftZLXVREmpxfVFaKqAc1nODhTQoSFUOcGGkeyuIFTdckJtux4UQ3vzybTA6GH4YBt/e9o4+tWnssBfsJeuzkL1jR+wLO2YjJrzX3lfv1Bv7+/53f+L/uLb6Xtuzx/4qX/wGrgfNxg==</latexit>
• The complete log likelihood
h`c(x,y; ✓)i =<latexit sha1_base64="Ov4mrUe93MVROTURNdRukJMrF44=">AAADtHicjVJbb9MwGHUTLiNc1sEjLxZVpVaCKqmQBoKhAS88DmlllZo2clyndec4UfwFNQv5hbzxxr/BcdOt23iYpSjH5zvnu8lhKrgC1/3bsux79x883HvkPH7y9Nl+++D5D5XkGWUjmogkG4dEMcElGwEHwcZpxkgcCnYWnn+t42c/WaZ4Ik+hSNk0JgvJI04JaCo4aP3u+kyIgOKeHxNYhlG5rl5vYVF9wD4sGZA+PsK+SBY4vaPOT7NkHkh9YRH0sPYVgTeT/SZQwtGwmp1qGhcBzCT+pf8lvPEqjfGOymtU6x1Vo8F+xhdL6DvdOjnoyqurNPrGTTMlJkHJVxX2a+6z0/VTHnANTUdG5nTDYIV7F33Drk2qC5PKJO07ZoiP+I6bavr6pL26msrjqzU0iYqZ1KUvdc3CON4Zaeurf3Dp03A73zXzZsSg3XEHrjn4NvAa0EHNOQnaf/x5QvOYSaCCKDXx3BSmJcmAU8Eqx88VSwk9Jws20VCSmKlpaV5dhbuameMoyfQnARt211GSWKkiDrWyXpK6GavJ/8UmOUTvpiWXaQ5M0k2hKBcYElw/YTznGaMgCg0IzbjuFdMlyQgF/dAdvQTv5si3wWg4eD9wv7/tHH9ptrGHXqJXqIc8dIiO0Td0gkaIWkNrbBErtA/tqU1ttpFarcbzAl07tvwH5lAjRQ==</latexit><latexit sha1_base64="Ov4mrUe93MVROTURNdRukJMrF44=">AAADtHicjVJbb9MwGHUTLiNc1sEjLxZVpVaCKqmQBoKhAS88DmlllZo2clyndec4UfwFNQv5hbzxxr/BcdOt23iYpSjH5zvnu8lhKrgC1/3bsux79x883HvkPH7y9Nl+++D5D5XkGWUjmogkG4dEMcElGwEHwcZpxkgcCnYWnn+t42c/WaZ4Ik+hSNk0JgvJI04JaCo4aP3u+kyIgOKeHxNYhlG5rl5vYVF9wD4sGZA+PsK+SBY4vaPOT7NkHkh9YRH0sPYVgTeT/SZQwtGwmp1qGhcBzCT+pf8lvPEqjfGOymtU6x1Vo8F+xhdL6DvdOjnoyqurNPrGTTMlJkHJVxX2a+6z0/VTHnANTUdG5nTDYIV7F33Drk2qC5PKJO07ZoiP+I6bavr6pL26msrjqzU0iYqZ1KUvdc3CON4Zaeurf3Dp03A73zXzZsSg3XEHrjn4NvAa0EHNOQnaf/x5QvOYSaCCKDXx3BSmJcmAU8Eqx88VSwk9Jws20VCSmKlpaV5dhbuameMoyfQnARt211GSWKkiDrWyXpK6GavJ/8UmOUTvpiWXaQ5M0k2hKBcYElw/YTznGaMgCg0IzbjuFdMlyQgF/dAdvQTv5si3wWg4eD9wv7/tHH9ptrGHXqJXqIc8dIiO0Td0gkaIWkNrbBErtA/tqU1ttpFarcbzAl07tvwH5lAjRQ==</latexit><latexit sha1_base64="Ov4mrUe93MVROTURNdRukJMrF44=">AAADtHicjVJbb9MwGHUTLiNc1sEjLxZVpVaCKqmQBoKhAS88DmlllZo2clyndec4UfwFNQv5hbzxxr/BcdOt23iYpSjH5zvnu8lhKrgC1/3bsux79x883HvkPH7y9Nl+++D5D5XkGWUjmogkG4dEMcElGwEHwcZpxkgcCnYWnn+t42c/WaZ4Ik+hSNk0JgvJI04JaCo4aP3u+kyIgOKeHxNYhlG5rl5vYVF9wD4sGZA+PsK+SBY4vaPOT7NkHkh9YRH0sPYVgTeT/SZQwtGwmp1qGhcBzCT+pf8lvPEqjfGOymtU6x1Vo8F+xhdL6DvdOjnoyqurNPrGTTMlJkHJVxX2a+6z0/VTHnANTUdG5nTDYIV7F33Drk2qC5PKJO07ZoiP+I6bavr6pL26msrjqzU0iYqZ1KUvdc3CON4Zaeurf3Dp03A73zXzZsSg3XEHrjn4NvAa0EHNOQnaf/x5QvOYSaCCKDXx3BSmJcmAU8Eqx88VSwk9Jws20VCSmKlpaV5dhbuameMoyfQnARt211GSWKkiDrWyXpK6GavJ/8UmOUTvpiWXaQ5M0k2hKBcYElw/YTznGaMgCg0IzbjuFdMlyQgF/dAdvQTv5si3wWg4eD9wv7/tHH9ptrGHXqJXqIc8dIiO0Td0gkaIWkNrbBErtA/tqU1ttpFarcbzAl07tvwH5lAjRQ==</latexit>
! = ($, &, ') are the parameters of the model
The Baum Welch algorithm
31
A AA AX2 X3X1 XT
Y2 Y3Y1 YT...
... `c(x,y; ✓) = log p(x,y; ✓) = logY
n
p(yn1 )
TY
t=2
p(ynt |ynt�1)TY
t=1
p(xnt |ynt )
!
<latexit sha1_base64="0hzXhXrOfIIZX2qdpqW80kR+kKc=">AAACqXicjVFdb9MwFHXC1ygfK+yRlysqRCuNKpmQAKFJE7zwgjTQulY0beS4N601x4nsG7Qoy4/jL+xt/wa3C1LZeOBKlo/POVe2z00KJS0FwZXn37l77/6DnYedR4+fPN3tPnt+avPSCByJXOVmknCLSmockSSFk8IgzxKF4+Ts81of/0RjZa5PqCpwlvGllqkUnBwVd39FqFQsoB9lnFZJWp83+39g1XyEiFZIfACHEKl8CcV/+qLC5ItYuwOm1AfXV8XhXA9aoabDg2Z+4mioYppruHB7TW/CxmHYcoWt63zL1XogMnK5okHc7QXDYFNwG4Qt6LG2juPuZbTIRZmhJqG4tdMwKGhWc0NSKGw6UWmx4OKML3HqoOYZ2lm9ibqBV45ZQJobtzTBht3uqHlmbZUlzrlOx97U1uS/tGlJ6ftZLXVREmpxfVFaKqAc1nODhTQoSFUOcGGkeyuIFTdckJtux4UQ3vzybTA6GH4YBt/e9o4+tWnssBfsJeuzkL1jR+wLO2YjJrzX3lfv1Bv7+/53f+L/uLb6Xtuzx/4qX/wGrgfNxg==</latexit><latexit sha1_base64="0hzXhXrOfIIZX2qdpqW80kR+kKc=">AAACqXicjVFdb9MwFHXC1ygfK+yRlysqRCuNKpmQAKFJE7zwgjTQulY0beS4N601x4nsG7Qoy4/jL+xt/wa3C1LZeOBKlo/POVe2z00KJS0FwZXn37l77/6DnYedR4+fPN3tPnt+avPSCByJXOVmknCLSmockSSFk8IgzxKF4+Ts81of/0RjZa5PqCpwlvGllqkUnBwVd39FqFQsoB9lnFZJWp83+39g1XyEiFZIfACHEKl8CcV/+qLC5ItYuwOm1AfXV8XhXA9aoabDg2Z+4mioYppruHB7TW/CxmHYcoWt63zL1XogMnK5okHc7QXDYFNwG4Qt6LG2juPuZbTIRZmhJqG4tdMwKGhWc0NSKGw6UWmx4OKML3HqoOYZ2lm9ibqBV45ZQJobtzTBht3uqHlmbZUlzrlOx97U1uS/tGlJ6ftZLXVREmpxfVFaKqAc1nODhTQoSFUOcGGkeyuIFTdckJtux4UQ3vzybTA6GH4YBt/e9o4+tWnssBfsJeuzkL1jR+wLO2YjJrzX3lfv1Bv7+/53f+L/uLb6Xtuzx/4qX/wGrgfNxg==</latexit><latexit sha1_base64="0hzXhXrOfIIZX2qdpqW80kR+kKc=">AAACqXicjVFdb9MwFHXC1ygfK+yRlysqRCuNKpmQAKFJE7zwgjTQulY0beS4N601x4nsG7Qoy4/jL+xt/wa3C1LZeOBKlo/POVe2z00KJS0FwZXn37l77/6DnYedR4+fPN3tPnt+avPSCByJXOVmknCLSmockSSFk8IgzxKF4+Ts81of/0RjZa5PqCpwlvGllqkUnBwVd39FqFQsoB9lnFZJWp83+39g1XyEiFZIfACHEKl8CcV/+qLC5ItYuwOm1AfXV8XhXA9aoabDg2Z+4mioYppruHB7TW/CxmHYcoWt63zL1XogMnK5okHc7QXDYFNwG4Qt6LG2juPuZbTIRZmhJqG4tdMwKGhWc0NSKGw6UWmx4OKML3HqoOYZ2lm9ibqBV45ZQJobtzTBht3uqHlmbZUlzrlOx97U1uS/tGlJ6ftZLXVREmpxfVFaKqAc1nODhTQoSFUOcGGkeyuIFTdckJtux4UQ3vzybTA6GH4YBt/e9o4+tWnssBfsJeuzkL1jR+wLO2YjJrzX3lfv1Bv7+/53f+L/uLb6Xtuzx/4qX/wGrgfNxg==</latexit>
• The complete log likelihood
h`c(x,y; ✓)i =<latexit sha1_base64="Ov4mrUe93MVROTURNdRukJMrF44=">AAADtHicjVJbb9MwGHUTLiNc1sEjLxZVpVaCKqmQBoKhAS88DmlllZo2clyndec4UfwFNQv5hbzxxr/BcdOt23iYpSjH5zvnu8lhKrgC1/3bsux79x883HvkPH7y9Nl+++D5D5XkGWUjmogkG4dEMcElGwEHwcZpxkgcCnYWnn+t42c/WaZ4Ik+hSNk0JgvJI04JaCo4aP3u+kyIgOKeHxNYhlG5rl5vYVF9wD4sGZA+PsK+SBY4vaPOT7NkHkh9YRH0sPYVgTeT/SZQwtGwmp1qGhcBzCT+pf8lvPEqjfGOymtU6x1Vo8F+xhdL6DvdOjnoyqurNPrGTTMlJkHJVxX2a+6z0/VTHnANTUdG5nTDYIV7F33Drk2qC5PKJO07ZoiP+I6bavr6pL26msrjqzU0iYqZ1KUvdc3CON4Zaeurf3Dp03A73zXzZsSg3XEHrjn4NvAa0EHNOQnaf/x5QvOYSaCCKDXx3BSmJcmAU8Eqx88VSwk9Jws20VCSmKlpaV5dhbuameMoyfQnARt211GSWKkiDrWyXpK6GavJ/8UmOUTvpiWXaQ5M0k2hKBcYElw/YTznGaMgCg0IzbjuFdMlyQgF/dAdvQTv5si3wWg4eD9wv7/tHH9ptrGHXqJXqIc8dIiO0Td0gkaIWkNrbBErtA/tqU1ttpFarcbzAl07tvwH5lAjRQ==</latexit><latexit sha1_base64="Ov4mrUe93MVROTURNdRukJMrF44=">AAADtHicjVJbb9MwGHUTLiNc1sEjLxZVpVaCKqmQBoKhAS88DmlllZo2clyndec4UfwFNQv5hbzxxr/BcdOt23iYpSjH5zvnu8lhKrgC1/3bsux79x883HvkPH7y9Nl+++D5D5XkGWUjmogkG4dEMcElGwEHwcZpxkgcCnYWnn+t42c/WaZ4Ik+hSNk0JgvJI04JaCo4aP3u+kyIgOKeHxNYhlG5rl5vYVF9wD4sGZA+PsK+SBY4vaPOT7NkHkh9YRH0sPYVgTeT/SZQwtGwmp1qGhcBzCT+pf8lvPEqjfGOymtU6x1Vo8F+xhdL6DvdOjnoyqurNPrGTTMlJkHJVxX2a+6z0/VTHnANTUdG5nTDYIV7F33Drk2qC5PKJO07ZoiP+I6bavr6pL26msrjqzU0iYqZ1KUvdc3CON4Zaeurf3Dp03A73zXzZsSg3XEHrjn4NvAa0EHNOQnaf/x5QvOYSaCCKDXx3BSmJcmAU8Eqx88VSwk9Jws20VCSmKlpaV5dhbuameMoyfQnARt211GSWKkiDrWyXpK6GavJ/8UmOUTvpiWXaQ5M0k2hKBcYElw/YTznGaMgCg0IzbjuFdMlyQgF/dAdvQTv5si3wWg4eD9wv7/tHH9ptrGHXqJXqIc8dIiO0Td0gkaIWkNrbBErtA/tqU1ttpFarcbzAl07tvwH5lAjRQ==</latexit><latexit sha1_base64="Ov4mrUe93MVROTURNdRukJMrF44=">AAADtHicjVJbb9MwGHUTLiNc1sEjLxZVpVaCKqmQBoKhAS88DmlllZo2clyndec4UfwFNQv5hbzxxr/BcdOt23iYpSjH5zvnu8lhKrgC1/3bsux79x883HvkPH7y9Nl+++D5D5XkGWUjmogkG4dEMcElGwEHwcZpxkgcCnYWnn+t42c/WaZ4Ik+hSNk0JgvJI04JaCo4aP3u+kyIgOKeHxNYhlG5rl5vYVF9wD4sGZA+PsK+SBY4vaPOT7NkHkh9YRH0sPYVgTeT/SZQwtGwmp1qGhcBzCT+pf8lvPEqjfGOymtU6x1Vo8F+xhdL6DvdOjnoyqurNPrGTTMlJkHJVxX2a+6z0/VTHnANTUdG5nTDYIV7F33Drk2qC5PKJO07ZoiP+I6bavr6pL26msrjqzU0iYqZ1KUvdc3CON4Zaeurf3Dp03A73zXzZsSg3XEHrjn4NvAa0EHNOQnaf/x5QvOYSaCCKDXx3BSmJcmAU8Eqx88VSwk9Jws20VCSmKlpaV5dhbuameMoyfQnARt211GSWKkiDrWyXpK6GavJ/8UmOUTvpiWXaQ5M0k2hKBcYElw/YTznGaMgCg0IzbjuFdMlyQgF/dAdvQTv5si3wWg4eD9wv7/tHH9ptrGHXqJXqIc8dIiO0Td0gkaIWkNrbBErtA/tqU1ttpFarcbzAl07tvwH5lAjRQ==</latexit>
• Fix ! and compute the marginal posterior: • " #$ = &|(; ! ,• " #$ = &, #$,- = . (; !)
• Update ! by MLE (closed-form) – remember the soft count
Extension to general BN
EM for general BNs! represents both hidden and observed:
See section 11.2.4 of David Barber’s book 33
A bit of notation:
EM for general BNs! represents both hidden and observed:
See section 11.2.4 of David Barber’s book 34
EM for general BNs! represents both hidden and observed:
See section 11.2.4 of David Barber’s book 35
Summary: EM Algorithm
• A way of maximizing likelihood function for latent variable models. Finds MLE of parameters
when the original (hard) problem can be broken up into two (easy) pieces:
1. Estimate some “missing” or “unobserved” data from observed data and current parameters.
2. Using this “complete” data, find the maximum likelihood parameter estimates.
• Alternate between filling in the latent variables using the best guess (posterior) and updating
the parameters based on this guess:
• E-step:
• M-step:
• In the M-step we optimize a lower bound on the likelihood. In the E-step we close the gap,
making bound=likelihood.
),(maxarg tq
t qFq q=+1
),(maxarg ttt qF qqq
11 ++ =
39© Eric Xing @ CMU, 2005-2015
More Examples
Conditional mixture model: Mixture of experts
• We will model p(Y |X) using different experts, each responsible for different regions of the input space.• Latent variable Z chooses expert using softmax gating function:
• Each expert can be a linear regression model:• The posterior expert responsibilities are
( )xxzP Tk xSoftmax)( == 1( )21 k
Tk
k xyzxyP sq ,;),( N==
å ==
==j jjj
jkkk
kk
xypxzpxypxzpyxzP
),,()(),,()(
),,( 2
2
111
sqsq
q
41© Eric Xing @ CMU, 2005-2015
EM for conditional mixture model
• Model:
• The objective function
• EM:• E-step:• M-step:
• using the normal equation for standard LR , but with the data re-weighted by t (homework)• IRLS and/or weighted IRLS algorithm to update {xk, qk, sk} based on data pair (xn,yn), with weights (homework?)
),,,|(),|()( sqx ik
kk xzypxzpxyP 11 ===å
å ==
===j jjnnjn
jn
kknnknkn
nnkn
tkn xypxzp
xypxzpyxzP),,()(
),,()(),,()(
2
2
111
sqsq
t θ
( ) åååå
åå
÷÷ø
öççè
æ++-=
+=
n kk
k
nTknk
nn k
nTk
kn
nyxzpnnn
nyxzpnnc
Cxyzxz
zxypxzpzyx
22
),|(),|(
log)-(21)softmax(log
),,,|(log),|(log),,;(
ssqx
sqxθl
YXXX TT 1-= )(q)(tk
nt
42© Eric Xing @ CMU, 2005-2015
Hierarchical mixture of experts
• This is like a soft version of a depth-2 classification/regression tree.• P(Y |X,G1,G2) can be modeled as a GLIM, with parameters dependent on the values of G1 and G2(which specify a "conditional path" to a given leaf in the tree).
43© Eric Xing @ CMU, 2005-2015
Mixture of overlapping experts
• By removing the Xà Z arc, we can make the partitions independent of the input, thus allowing overlap.• This is a mixture of linear regressors; each subpopulation has a different
conditional mean.
å ==
==j jjj
jkkk
kk
xypzpxypzpyxzP
),,()(),,()(
),,( 2
2
111
sqsq
q
44© Eric Xing @ CMU, 2005-2015
A Report Card for EM
• Some good things about EM:• no learning rate (step-size) parameter• automatically enforces parameter constraints• very fast for low dimensions• each iteration guaranteed to improve likelihood
• Some bad things about EM:• can get stuck in local minima• can be slower than conjugate gradient (especially near convergence)• requires expensive inference step• is a maximum likelihood/MAP method
47© Eric Xing @ CMU, 2005-2015