Genome Evolution. Amos Tanay 2010 Genome evolution: Lecture 9: Variational inference and Belief propagation

Genome Evolution. Amos Tanay 2010

Genome evolution:

Lecture 9: Variational inference and Belief propagation


Expectation-Maximization

),(),|()|,(

)|,(log)|(log

sPshPshP

shPsPh

),|(log)|,(log)|(log shPshPsP

h h

shPshPshPshPsP ),|(log),|()|,(log),|()|(log ''

h

kk shPshPQ )|,(log),|()|(

h

kkkkk

k

shP

shPshPQQ

sPsP

),|(

),|(log),|()|()|(

)|(log)|(log

h i

ik

h

kk shPshPshPshPQ )|,(log),|()|,(log),|()|(


Log-likelihood to Free Energy

h

shPsP )|,(log)|(log

• We have so far worked on computing the likelihood:

h hs

hqhqF )

)|,Pr(

)(log()(

• Better: when q a distribution, the free energy bounds the likelihood:

• Computing likelihood is hard. We can reformulate the problem by adding parameters and transforming it into an optimization problem. Given a trial function q, define the free energy of the model as:

• The free energy is exactly the likelihood when q is the posterior:

)|Pr(log)|Pr(log),|Pr(

))|,Pr(/),|log(Pr(),|Pr(),|Pr()(

sssh

hsshshFshhq

h

h

hh

shqshhqhqF )),Pr(ogl)()),|Pr(/)(log()(

D(q || p(h|s)) Likelihood


Energy?? What energy?

T

xE

eTZ

xp)(

)(

1)(

• In statistical mechanics, a system at temperature T with states x and an energy function E(x) is characterized by Boltzman’s law:

• If we think of P(h|s,):

• Given a model p(h,s|T) (a BN), we can define the energy using Boltzman’s law

• Z is the partition function:

dxeTZ TxE /)()(

)|,(log)|,(1 shpshET

)(),,(log)( spZshphE


Free Energy and Variational Free EnergyT

xE

eTZ

xp)(

)(

1)(

• The Helmoholtz free energy is defined in physics as:

• The average energy is:

• The variational transformation introduce trial functions q(h), and set the variational free energy (or Gibbs free energy) to:

• This free energy is important in statistical mechanics, but it is difficult to compute, as our probabilistic Z (= p(s))

)()()( qHqUqF

h

hEhqqU )()()(

ZFH log

h

hqhqqH )(log)()(

• The variational entropy is:

• And as before:

)||()( pqDFqF H


Solving the variational optimization problem

• So instead of computing p(s), we can search for q that optimizes the free energy

)()()( qHqUqF h

shphqqU ),(log)()( h

hqhqqH )(log)()(

• This is still hard as before, but we can simplify the problem by restricting q• (this is where the additional degrees of freedom become important)

Maxmizing U? Maxmizing H?

Focus on max configurations Spread out the distribution


Simplest variational approximation: Mean Field

• Let’s assume complete independence among r.v.’s posteriors:

)()()( qHqUqF h

shphqqU ),(log)()( h

hqhqqH )(log)()(

• Under this assumption we can try optimizing the qi – (looking for minimal energy!)

Maxmizing U? Maxmizing H?

Focus on max configurations Spread out the distribution

)()( iii

hqhq

)(log),(logmin)(min iiiiiq

MF hqqshpqqFFi

i hiii

hii

i

hqqshpq )(log),(log)(min


Mean Field Inference

• We optimize iteratively:

• Select i (sequentially, or using any method)

• Optimize qi to minimize FMF(q1,..,qi,…,qn) while fixing all other qs

• Terminate when FMF cannot be improved further

)()( iii

hqhq )(log),(logmin)(min iiiiiq

MF hqqshpqqFFi

i hiii

hii

i

hqqshpq )(log),(log)(min

• Remember: FMF always bound the likelihood

• qi optimization can usually be done efficiently


Mean field for a simple-tree model

)()(),|Pr( ii hqhqsh Just for illustration, since we know how solve this one exactly:

MFqii Fhqi

maxarg)(

We select a node and optimize its qi while making sure it is a distribution:

)(log))(()|,Pr(log))(( hqhqshhqFh

jjh

jjMF

Chqhqxxhq iih

iih j

jjkkk

i

)(log)()pa|Pr(log)(

Chqhqhhhqhq

hhhqhqhhhqhq

iih

iihh

illlii

hhirrrii

hhiiiiii

ili

riii

)(log)()|Pr(log)()(

)|Pr(log)()()pa|Pr(log)()(

,

,pa,papa

To ease notation, assume the left (l) and right (r) children are hidden

Chqhhhq

hhhqhhhq

hqii

hilll

hirrr

hiiii

hii

i

ri

i

)(log)|Pr(log)(

)|Pr(log)()pa|Pr(log)(

)(pa

papa

The energy decomposes, and only few terms are affected:


Mean field for a simple-tree model

)()(),|Pr( ii hqhqsh

Just for illustration, since we know how solve this one exactly:

MFqii Fhqi

maxarg)(

We select a node and optimize its qi while making sure it is a distribution:

Chqhhhq

hhhqhhhq

hqFii

hilll

hirrr

hiiii

hiiMF

i

ri

i

)(log)|Pr(log)(


)(pa

papa

i

ri

hilll

hirrr

hiiii

iihhhq

hhhqhhhq

hq)|Pr(log)(


exp)(pa

papa


Mean field for a phylo-hmm model

)()(),|Pr( ,ji

jiji hqhqsh

Now we don’t know how to solve this exactly, but MF is still simple:

)(log))(()|,Pr(log))(()( ,, hqhqshhqqFh

mk

mkkm

h

mk

mkmk

jiMF

hjihj-1

i

hj-1pai hj

pai

Chqhq

hhhhhhhhq

hhhhhhhhq

ji

ji

h

ji

ji

hhhh

hhhh

hhhh

hhhh

hh

hh

jpai

ji

jpai

ji

jpai

ji

jpai

ji

hh

hh

jpai

ji

jpai

ji

jpai

ji

jpai

ji

i

jl

ji

jl

ji

jl

ji

jl

ji

jr

ji

jr

ji

jr

ji

jr

ji

jpai

ji

jpai

ji

jpai

ji

jpai

ji

)(log)(

(..)(..)(..)(..)

),,|Pr(log),,,(

),,|Pr(log),,,(

11111111

11

11

,,,

,,,

,,,

,,,

,

,,

1111

,

,,

1111

hjl

hj+1l hj

r hj+1r

hj+1i

hjpai

hj-1rhj-1

r


Mean field for a phylo-hmm model


Now we don’t know how to solve this exactly, but MF is still simple:

hjihj-1

i

hj-1pai hj

pai

hjl

hj+1l hj

r hj+1r

hj+1i

hjpai

hj-1rhj-1

r

1111111111

11

,,

,,

,,

,,

,

,

,

,

1111

(..)(..)(..)(..)(..)

),,|Pr(log),,(

exp)(

jl

ji

jl

jl

ji

jl

jr

ji

jr

jr

ji

jr

jpai

ji

jpai

jpai

ji

jpai

hhh

hhh

hhh

hhh

hh

h

hh

h

jpai

ji

jpai

ji

jpai

ji

jpai

ii

hhhhhhhq

hq

As before, the optimal solution is derived by making logqi equals the sum of affected terms:

)(log))(()|,Pr(log))(()( ,, hqhqshhqqFh

mk

mkkm

h

mk

mkmk

jiMF


Because the MF trial function is very crude

Simple Mean Field is usually not a good idea

Why?

For example, we said before that the joint posteriors cannot be approximated by independent product of the hidden variables posteriors


A C A C

A/C A/C

A/C

),|Pr(pa),|Pr(),|pa,Pr( shshshh iiii


Exploiting additional structure

The approximation specify independent distributions for each loci, but maintain the tree dependencies.

We can greatly improve accuracy by generalizing the mean field algorithm using larger building blocks

We now optimize each tree q separately, given the current other tree potentials.

),..,()(),|Pr( 11jn

jj hhqhqsh

The key point is that optimizing for any given tree is efficient: we just use a modified up-down algorithm


Tree based variational inference

)(log))(()|,Pr(log))(()( hqhqshhqqFh

mmm

h

mmm

jMF

Chqhq

hhhqhq

hhhqhq

jj

h

j

hh

jjjj

hh

jjjj

j

jj

jj

)(log)(

)|Pr(log)()(

)|Pr(log)()(

1

1

,

11

,

11

Chqhq

hhhqhhhqhq

jj

h

j

h h

jjj

h

jjjj

j

j jj

)(log)(

))|Pr(log)((())|Pr(log)(()(11

1111

Chqhq

hhhhhq

hhhhhq

hq

jj

h

j

h

h i

jpai

jpai

ji

ji

j

h i

jpai

jpai

ji

ji

j

j

j

j

j

j

)(log)(

)),,|Pr(log)((

)),,|Pr(log)((

)(

1

1

111

111

Each tree is only affected by the tree before and the tree after:


Tree based variational inference

Chqhq

hhhhhq

hhhhhq

hq

jj

h

j

h

h i

jpai

jpai

ji

ji

j

h i

jpai

jpai

ji

ji

j

j

j

j

j

j

)(log)(

)),,|Pr(log)((

)),,|Pr(log)((

)(

1

1

111

111

Chqhq

hhhhhhq

hhhhhhq

hq

jj

h

j

h i

hh

jpai

jpai

ji

ji

jpai

ji

hh

jpai

jpai

ji

ji

jpai

ji

j

j

j

jpai

ji

jpai

ji

)(log)(

),,|Pr(log),(

),,|Pr(log),(

)(

11

11

,

1111

,

1111

)(log)()|Pr(log)()|( hqhqhhhqsPhh i

paiiTreeSimple

We got the same functional form as we had for the simple tree, so we can use the up-down algorithm to optimize qj.

),(log paji

ji hh

)(log)(),(log)( hqhqhhhqFhh i

jpai

ji

j


Chain cluster variational inference

We can use any partition of a BN to trees and derive a similar MF algorithm

For example, instead of trees we can use the Markov chains in each species

What will work better for us?

Depends on the strength of dependencies at each dimension – we should try to capture as much “dependency” as possible


Simple Tree: Inference as message passing

s

s

s s

s

s

sYou are P(H|our data)

You are P(H|our data)

I am P(H|all data)

DATA


Factor graphs

Defining the joint probability for a set of random variables given:

1) Any set of node subsets (hypergraph)

2) Functions on the node subsets (Potentials)

)(1

)Pr( aa xZ

x

)( ax

)|{, VaaAV

x

aa xZ )(

Joint distribution:

Partition function:

If the potentials are condition probabilities, what will be Z?

Things are difficult when there are several modes

Factor

R.V.

Not necessarily 1! (can you think of an example?)


hpaij hpai

j+1hpaij-1

hij hi

j+1hij-1

hpaij hpai

j+1hpaij-1

hij hi

j+1hij-1

DBN PhyloHMM

hpaij hpai

j+1hpaij-1

hij hi

j+1hij-1

hpaij hpai

j+1hpaij-1

hij hi

j+1hij-1

hpaij hpai

j+1hpaij-1

hij hi

j+1hij-1

hpaij hpai

j+1hpaij-1

hij hi

j+1hij-1

Converting directional models to factor graphs

(Loops!) Well defined

Z=1Z=1

)pa|Pr()( xxxa )pa|Pr()( xxxa

)pa|Pr()( xxxa Z!=1


More definitions

The model: a

axZx )(log)log())log(Pr(

Potentials can be defined on discrete, real valued etc.it is also common to define general log-linear models directly:

))(logexp(1

)Pr( a

aa xwZ

x

Inference:

Dx a

aa xwZ

D ))(logexp(1

)|Pr(

)|Pr(/))(logexp(1

),|Pr(,|

DxwZ

DxDxx a

aai

i

Learning:

Find the factors parameterization: )|Pr(maxarg

D


Inference in factor graphs: Algorithms

Directed models are sometimes more natural and easy to understand. Their popularity stems from their original role as expressing knowledge in AIThey are not very natural for modeling physical phenomena, except for time-dependent processes

Undirected models are analogous to well-developed models in statistical physics (e.g., spin glass models)

We borrow computational ideas from physicists (these people are big with approximations)

The models are convex which give them important algorithmic properties

Dynamic programming:

Forward sampling (likelihood weighting):

Metropolis/Gibbs:

Mean field:

Structural variational inference:

No (also not in BN!)

No

Yes

Yes

Yes


Belief propagation in a factor graph

)(1

)|( aaa xZ

xP

• Remember, a factor graph is defined given a set of random variables (use indices i,j,k.) and a set of factors on groups of variables (use indices a,b..)

)( iia xm

• Think of messages as transmitting beliefs:

a->i : given my other inputs variables, and ignoring your message, you are x

i->a : given my other inputs factors and my potential, and ignoring your message, you are x

• xa refers to an assignment of values to the inputs of the factor a

• Z is the partition function (which is hard to compute)

• The BP algorithm is constructed by computing and updating messages:

• Messages from factors to variables:

• Messages from variables to factors: )( iai xm

(any value attainable by xi)->real values


Messages update rules:

)()(\)(

iicaiNc

iai xmxm

ia xx

jajiaNj

aaiia xmxxm )()()(\)(

Messages from variables to factors:

Messages from factors to variables:

a

i aiN \)(

a

iiaN \)(


The algorithm proceeds by updating messages:

• Define the beliefs as approximating single variables posterios (p(hi|s)):

)()()(

iiaiNa

ii xmxb

Algorithm:

Initialize all messages to uniformIterate until no message change:

Update factors to variables messagesUpdate variables to factors messages

• Why this is different than the mean field algorithm?

)()( ii hqhq


Beliefs on factor inputs

This is far from mean field, since for example:

)()(

)()()()(

\)()(

)()(

jjcajNcaNj

a

jjaNj

jajaNj

aaa

xmx

xbxmxxb

The update rules can be viewed as derived from constraints on the beliefs:

1.requirement on the variables beliefs (bi)

2.requirement on the factor beliefs (ba)

3.Marginalization requirement:

a

i aiN \)(

a

iiaN \)(

ia xxjjc

ajNcaNjaiiiid

iNdxmxxbxm )()()()(

\)()()(

ia xxjjc

ajNciaNjaiia xmxxm )()()(

\)(\)(

ia xx

aaii xbxb\

)()(

)()()(

iiaiNa

ii xmxb

)()()(\)()(

jjcajNcaNj

aaa xmxxb


BP on Tree = Up-Down

s4 s3

h2

h3e

s2 s1

h1

b a

c

d

)|Pr()|Pr()( 12111hshsxup ih

111)( 1 hbhach mmhm

)()()(

)()()(

2\

1

1\

1

11

11

smxhm

smxhm

bshx

bhb

ashx

aha

ib

ia

32

32

1

,313232 )|Pr()|Pr()()(

)(

hhhh

ih

hhhhhdownhup

xdown

3 2

2

3

33

1

31

)(),()(),(

)()(),(

)()()(

323313

3313

\31

h hehedc

hhehdc

hxchcchc

hmhhhhh

hmhmhh

hmxhmc

2 1

3


Loopy BP is not guaranteed to converge

X Y

Y

x

01

10

Y

x

01

10

1 1

0 0

This is not a hypothetical scenario – it frequently happens when there is too much symmetryFor example, most mutational effects are double stranded and so symmetric which can result in loops.


The Bethe Free Energy

H. Bethe

• LBP was introduced in several domains (BNs, Coding), and is consider very practical in many cases.

• ..but unlike the variational approaches we studied before, it is not clear how it approximate the likelihood/partition function, even when it converges..

hh

hqhqshphqF )(log)()|,(log)( • Compare to the variational free energy:

Theorem: beliefs are LBP fixed points if and only if they are locally optimal for the Bethe free energy

iiii

aaabethe

aaabethe

BetheBetheBethe

xbxbdxbxbH

xxbU

HUF

)(log)()1()(log)(

)(log)(

• In the early 2000, Yedidia, Freeman and Weiss discovered a connection between the LBP algorithm and the Bethe free energy developed by Hans Bethe to approximate the free energy in crystal field theory back in the 40’s/50’s.


Generalization: Regions-based free energy

RR AaR

XiR caci 11

• Start with a factor graph (X,A)

• Introduce regions (XR,AR) and multipliers cR

• We require that:

• We will work with valid regions graphs:

)()()(

)(log)()(

)()()(

)(log)(

RRRRRR

xRRRRRR

xRRRRR

AaaRR

bHbUbF

xbxbbH

xExbbU

xxE

R

R

R

RR XaNAa )(

Region-based average energy

Region average energy

Region Entropy

Region Free energy

})({})({})({

)(})({

)(})({

R

R

R

RRRRR

RRRRR

RRRR

bHbUbF

bHcbH

bUcbU

Region-based entropy

Region-based free energy


Bethe regions are the factors neighbors sets and single variables regions:

a

c

b

111 ccbac ccc

We compensate for the multiple counting of variables using the multiplicity constant

We can add larger regions

As long as we update the multipliers:

11 iia dcc

Ra

Rac

Rbc

RR

RR cc'

'1


Multipliers compensate on average, not on entropy

Claim: For valid regions, if the regions’ beliefs are exact:

a x

aaaa

c

R x RaaaRRR

RRRRRR

a

RaR

R

xxbxxbcbUcbU )(log)()(log)()(})({)1(

We cannot guarantee much on the region-based entropy:

Claim: the region-based entropy is exact when the model is a uniform distributionProof: exercise. This means that the entropy count the correct number of degrees of

freedom – e.g. for binary variables, H=Nlog2

Definition: a region based free energy approximation is said to be max-ent normal if its region-based entropy is maximized when the beliefs are uniform.

An non max-ent approximation can minimize the region free energy by selecting erroneously high entropy beliefs!

Rx

RRRRRR xbxbbH )(log)()(

)()( RRRR xpxb

x

RR xExpbU )()(})({then the average region-based energy is exact:

a x

aaaax a

aax a

xxpxxpxExpU )(log)()(log)()()(


Bethe’s region are max-ent normal

Claim: The Bethe regions gives a max-ent normal approximation (i.e. it maximize the region-based entropy on the uniform distribution)

a x aNi xiiiiaaaa

i xiiiiBethe

a ii

xbxbxbxbxbxbH)(

)(log)()(ln)()(log)(

Entropy Information

(maximal on uniform) (nonnegative, and 0 on uniform)

iiii

aaabethe

BetheBetheBethe

xbxbdxbxbH

HUF

)(log)()1()(log)(

)( abI)( ibH


Start with a complete graph and binary factors

Add all variable triplets, pairs and singleton as regions

Generate multipliers:triplets = 1 (20 overall)pairs = -3 (15 overall)singletons = 6 (6 overall) ( guarantee consistency)

Example: A Non max-ent approximation

Look at the consistent beliefs:

The Region entropy (for any region) = ln2. The total region entropy is:

otherwise

xxxxxxb

otherwise

xxxxbxb kji

kjiji

jii 0

2/1),,(

0

2/1),(;5.0)0(

2ln112ln362ln452ln20 R

RRR HcH

We claimed before the entropy of the uniform distribution will be exact: 6ln2

RR

RR cc'

'1


We want to solve a variational problem:

While enforcing constraints on the regions’ beliefs:

Inference as minimization of region-based free energy

})({min RR bF

1)( Rx

RR xb

)()( ''\ '

RRxx

RR xbxbRR

Unlike the structured variational approximation we discussed before, and although the beliefs are (regionally) compatible, we can have cases with optimal beliefs that are not representing a true global posterior distribution

C

BA

Y

x

4.01.0

1.04.0

1.04.0

4.01.0,

4.01.0

1.04.0CBA bbb

Y

x

1.04.0

4.01.0

Optimal region beliefs are identical to the factors:

5.0

5.0ib

Y

x

4.01.0

1.04.0

It can be shown that this cannot be the result of any joint distribution on the three variables

(note the negative feedback loop here)


Claim: When it converges, LBP finds a minimum of the Bethe free energy.

Proof idea: we have an optimization problem (minimum energy) with constraints (beliefs are consistent and adds up to 1). We write down a Lagrangian that expresses both minimization goal and constraints, and show that it is minimized when the LBP update rules are holding.

Inference as minimization of region-based free energy

i iNa x xxaaiiiai

i xiii

a xaaaBethe

i ia

ia

xbxbx

xbxbFL

)( \

)]()()[(

1)(1)(

Important technical point: we shall assume that in the fixed point all beliefs are non zero. This can be shown to hold if all factors are “soft” (do not contain zero values for any assignment).


The Bethe Lagrangian

i iNa x xxaaiiiai

i xiii

a xaaaBethe

i ia

ia

xbxbx

xbxbFL

)( \

)]()()[(

1)(1)(

i x

iiiiia x

aaaaa x

aaaaBethe

iaa

xbxbdxbxbxxbF )(log)()1()(log)()(log)(

i iNa x xxaaiiiai

i xiii

a xaaa

i ia

i

a

xbxbx

xb

xb

)( \

)]()()[(

1)(

1)(

Large region beliefs are normalized

Variable region beliefs are normalized

Marginalization


The Bethe lagrangian

Take the derivatives with respect to each ba and bi:

))(1exp()()(

)(1)(log)(log)(

)(

)(

aNiiaiaaaaa

aNiiaiaaaaa

aia

xxxb

xxbxxb

L

i iNa x xxaaiiiai

i xiii

a xaaaBethe

i ia

ia

xbxbx

xbxbFL

)( \

)]()()[(

1)(1)(

i x

iiiiia x

aaaaa x

aaaaBethe

iaa

xbxbdxbxbxxbF )(log)()1()(log)()(log)(

)))((1

11exp()(

)()1)()(log1()(

)(

)(

aNiiaii

iii

iNaiaiiiii

ii

xd

xb

xxbdxb

L


Bethe minima are LBP fixed points

))(exp()())(1exp()()()(

)(iai

aNia

aNiiaiaaaaa xxxxxb

)1

)(exp()))((

1

11exp()(

)()(

i

iai

iNaaNiiaii

iii d

xx

dxb

)(log)(log)(\)(

iicaiNc

iaiiai xmxmx

So here are the conditions:

And we can solve them if:

)()()(\)()(

iicaiNcaNi

aaa xmxxb

)()(1

1)(

)(\)()(iia

iNaiic

aiNci

iNaii xmxm

dxb

Giving us:

We saw before these conditions, with the marginalization constraint, are generating the update rules! So L minimum -> LBP fixed point is proven.The other direction quite direct – see Exercise

LBP is in fact computing the Lagrange multipliers – a very powerful observation


Generalizing LBP for region graphs

)()()()( '})(\{)P(')()P(

DDPRRDDPRDD

RRPRP

aaAa

RR xmxmxxbR

Parent-to-child beliefs:

A region graph is graph on subsets of nodes in the factor graph, with valid multipliers (as defined above)

RD(R) – Decedents of R

P(R)

RR AaR

XiR caci 11

• regions (XR,AR) and multipliers cR

• We require that:

• We will work with valid regions graphs:

RR XaNAa )(

RR

RR cc'

'1

P(D(R))\D(R)P(R) – Parents of R

D(R)


Generalizing LBP for region graphs

)()()()( '})(\{)P(')()P(

DDPRRDDPRDD

RRPRP

aaAa

RR xmxmxxbR

Parent-to-child algorithm:

RP

RP

x JJIRPDJI

JJIRPNJIaFaRRP xm

xmxxm

\

\

)(

)()()(

),(),(

),(),(

I

J

D(P)+P

Not D(P)+P

D(R) – Decedents of R

P(R) – Parents of R

P

RD(R)+R

IJ

D(P)+PP

RD(R)+R

N(I,J) = I not in D(P)+P J in D(P)+P but not D(R)+R

D(I,J) = I in D(P)+P but not D(R)+R J in D(R)+R


GLBP in practice

LBP is very attractive for users: really simple to implement, very fast

LBP performance is limited by the size of region assignments Xa which can grow rapidly with the factor’s degrees or the size of large regions

GLBP will be powerful when large regions can capture significant dependencies that are not captured by individual factors – think small positive loop or other symmetric effects

LBP messages can be computed synchronously (factors->variables->factors…), other scheduling options may boost up performance considerably

LBP is just one (quite indirect) way by which Bethe energies can be minimized. Other approaches are possible – which can be guaranteed to converge

The Bethe/Region energy minimization can be further constraint to force beliefs are realizable. This gives rise to the concept of Wainwright-Jordan marginal polytope and convex algorithms on it.

Documents

Genome Evolution. Amos Tanay 2010 Genome evolution: Lecture 9: Variational inference and Belief propagation