57
On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint work with Tony Jebara 1 / 46

On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

On the Bethe approximation

Adrian Weller

Department of Statistics at Oxford UniversitySeptember 12, 2014

Joint work with Tony Jebara

1 / 46

Page 2: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Outline

1 Background on inference in graphical modelsExamplesWhat are the problems and why are they interesting?Belief propagation (BP) as sum-product message passingVariational perspective on inference

2 Bethe approximationLink to BPOther methods to minimize the Bethe free energyNew approach: discretize to obtain an ε-approx globaloptimum

How discretize s.t. ε-approx guaranteed?How search efficiently over the discretized space?

If time: Understanding the two aspects of the Betheapproximation (entropy and polytope), and new work onclamping...

Questions (anytime!)2 / 46

Page 3: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Background

Focus on undirected probabilistic graphical models, also calledMarkov random fields (MRFs).Compact, powerful way to model dependencies among variables.

Many applications, including:

Systems biology (protein folding)

Social network analysis (friends, politics, terrorism)

Combinatorial problems (counting independent sets)

Computer vision (image denoising, depth perception)

Error-correcting codes (turbo codes, 3G/4G phones, satellitecommunication)

3 / 46

Page 4: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Example: image denoising

Inference is combining prior beliefs with observed evidence toform a prediction.

−→ MAP inference 4 / 46

Page 5: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Notation

Focus on MRFs which are discrete and finite

n variables V = {X1, . . . ,Xn} and (log) potential functions ψc

over subsets/factors c of V , c ∈ C ⊆ P(V ) which give higherscore to sub-configurations with higher compatibility

Write x = (x1, . . . , xn) ∈ X for one particular completeconfiguration and xc for a configuration of the variables in c

ψc maps each setting xc → ψc(xc) ∈ R [lookup table]

p(x) =1

Zexp

(∑c∈C

ψc(xc)

)=

e−E(x)

Z, E = −

∑c∈C

ψc(xc),

where the partition function Z =∑

x∈X exp(∑

c∈C ψc(xc))

is thenormalizing constant to ensure that probabilities sum to 1;E is the energy (negative score) cf. physics.

5 / 46

Page 6: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Inference: 3 key problems

p(x) =1

Zexp

(∑c∈C

ψc(xc)

)

MAP inference, identify a configuration of variables withmaximum probability: x∗ ∈ argmaxx∈X

∑c∈C ψc(xc)

Marginal inference, Compute the probability distribution of asubset of variables xc :

p(xc) =∑

x∈X :Xc=xcp(x) =

∑x∈X :Xc=xc

exp(∑

c∈C ψc (xc ))∑x∈X exp(

∑c∈C ψc (xc ))

Evaluate the partition function,Z =

∑x∈X exp

(∑c∈C ψc(xc)

)Great interest to find classes of problems and approachessuch that exact or approximate inference is tractable.

6 / 46

Page 7: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Remark: conditioning on observed variables

p(x) =1

Zexp

(∑c∈C

ψc(xc)

)Suppose V is split into observed variables Y = y and unobserved

variables XU so x = (xu, y), xu ∈ Xu

p(xu|y) = p(xu ,y)p(y) = p(xu ,y)∑

x′u∈Xup(x ′u ,y)

This is just a new smaller MRF with modified potentials onthe variable set XU

New partition function to normalize the new distribution

Hence the MRF framework is rich enough to handleconditioning

When we discuss MRFs, they might or might not have beenbased on conditioning on variables

7 / 46

Page 8: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Belief propagation (BP) for inference

Marginal inference via sum-product message passing

Send messages from variable v ∈ V to factor c ∈ C

mv→c(xv ) =∏

c∗∈C(v)\{c}

mc∗→v (xv )

Send messages from factor c to variable v

mc→v (xv ) =∑

x′c :x′v=xv

φc(x ′c)∏

v∗∈V (c)\{v}

mv∗→c(x ′v∗)

where φc(x ′c) = exp(ψc(x ′c))

For MAP inference, use max-product, switch∑

x ′c→ maxx ′c

For acyclic models, converges to exact marginals efficiently(2 passes, collect leaves to root then distribute)

8 / 46

Page 9: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

What about cyclic (loopy) models?

Can triangulate and run junction tree

Exact solution but takes time exponential in treewidth

Or... just run loopy belief propagation (LBP) and hope

Often produces strikingly good resultsBut may not converge at all

Extensive literature on trying to understand LBP

9 / 46

Page 10: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Inference: a variational perspective

Recall p(x) =1

Zexp

(∑c∈C

ψc(xc)

)=

e−E(x)

Z

KL-divergence between some distribution q(x) and p(x) given

by D(q||p) =∑

x q(x) log q(x)p(x) ≥ 0, equality iff q = p

Have

0 ≤ D(q||p) =∑x

q(x) log q(x)−∑x

q(x) log p(x)

= −S(q)−∑x

q(x) [−E (x)− logZ ]

= Eq(E (x))− S(q) + logZ

where S(q) is the standard Shannon entropy of q10 / 46

Page 11: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Inference: a variational perspective

0 ≤ D(q||p) = Eq(E (x))− S(q) + logZ , equality iff q = p

Hence Eq(E (x))− S(q) ≥ − logZ

This function of distribution q is called the (Gibbs) free energyFG (q) = Eq(E (x))− S(q)

Minimizing it over all valid distributions q yields − logZ

And the argmin is exactly when q = p, the true distribution

Hence can think of inference as optimization

But still intractable in general...

END OF PART I

11 / 46

Page 12: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Part II: Bethe approximation

Seek to approximate the partition function ZAlso interested in approximate marginal inference (medicaldiagnosis, power network)

The Bethe approximation: what and why?

Introduced by Hans Bethe in the 1930s to study phasetransitions in statistical physics. Wikipedia:Bethe left Germany in 1933, moving to England afterreceiving an offer as lecturer... He moved in with his friendRudolf Peierls... This meant that Bethe had someone tospeak to in German, and did not have to eat English food.

Found fresh application in machine learningDirect connections to variational inference and beliefpropagation [YFW01]

12 / 46

Page 13: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Part II: Bethe approximation

Seek to approximate the partition function ZAlso interested in approximate marginal inference (medicaldiagnosis, power network)

The Bethe approximation: what and why?

Introduced by Hans Bethe in the 1930s to study phasetransitions in statistical physics. Wikipedia:Bethe left Germany in 1933, moving to England afterreceiving an offer as lecturer... He moved in with his friendRudolf Peierls... This meant that Bethe had someone tospeak to in German, and did not have to eat English food.Found fresh application in machine learningDirect connections to variational inference and beliefpropagation [YFW01] 12 / 46

Page 14: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Recall variational approach

− logZ = minq∈MFG (q) = min

q∈MEq(E )− S(q(x))

M is the marginal polytope which comprises all globally validprobability distributions over all the variables, i.e. convex hull of all2n configurations (for binary variables)FG is the Gibbs free energy, optimum at the true distribution

Bethe approximation has 2 aspects, both pairwise approximations:1 Relax the marginal polytope M to the local polytope L which

enforces only pairwise consistency, hence pseudo-marginals2 Use Bethe entropy SB=

∑i∈V Si +

∑(i ,j)∈E Sij − Si − Sj

Obtain Bethe partition function ZB at the global optimum

− logZB = minq∈LF(q) = min

q∈LEq(E )− SB(q(x))

13 / 46

Page 15: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Connection to LBP

Obtain Bethe partition function ZB at the global optimum

− logZB = minq∈LF = min

q∈LEq(E )− SB(q(x))

marginal polytope(global consistency)

local polytope(local consistency)

F is called the Bethe free energy (approximates true free energy)

In a seminal paper, [YFW01] showed that fixed points of LBPcorrespond to stationary points of the Bethe free energy FRefined by [Hes02], stable fixed points correspond to localminima of F (converse not true in general)

14 / 46

Page 16: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Other methods to minimize Bethe free energy F

LBP may be viewed as an algorithm to try to minimize FBut may not converge, or may converge only to a localminimum

Spurred much effort to find convergent algorithms such as

Gradient methods [WT01]Double loop methods, e.g. CCCP [Yui02] or [HAK03]

But still only to a local optimum, no time guarantee

For binary pairwise models

Recent algorithm guaranteed to converge in polynomial timeto an approximately stationary point of F [Shi12], restrictionson topologyOur algorithm guaranteed to return an ε-approximation to theglobal optimum [WJ14]To our knowledge, no previously known methods guaranteed toreturn or approximate the global optimum

15 / 46

Page 17: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Binary pairwise MRFs

Main focus now on MRFs which are binary, i.e. all Xi ∈ {0, 1},and pairwise, i.e. all potentials are over ≤ 2 variables

n variables V = {X1, . . . ,Xn}, singleton potentials ψi (xi )

x = (x1, . . . , xn) ∈ {0, 1}n is one particular configuration

m edges (i , j) ∈ E ⊆ V × V , pairwise potentials ψij(xi , xj)

p(x) =1

Zexp

∑i∈V

ψi (xi ) +∑

(i ,j)∈E

ψij(xi , xj)

Can always reparameterize to a minimal representation{θi : i ∈ V }, {Wij : (i , j) ∈ E} s.t. same distribution

p(x) =1

Z ′exp

∑i∈V

θixi +∑

(i ,j)∈E

Wijxixj

16 / 46

Page 18: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Binary pairwise MRFs

Main focus now on MRFs which are binary, i.e. all Xi ∈ {0, 1},and pairwise, i.e. all potentials are over ≤ 2 variables

n variables V = {X1, . . . ,Xn}, singleton potentials ψi (xi )

x = (x1, . . . , xn) ∈ {0, 1}n is one particular configuration

m edges (i , j) ∈ E ⊆ V × V , pairwise potentials ψij(xi , xj)

p(x) =1

Zexp

∑i∈V

ψi (xi ) +∑

(i ,j)∈E

ψij(xi , xj)

Can always reparameterize to a minimal representation{θi : i ∈ V }, {Wij : (i , j) ∈ E} s.t. same distribution

p(x) =1

Z ′exp

∑i∈V

θixi +∑

(i ,j)∈E

Wijxixj

16 / 46

Page 19: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Binary pairwise MRFs: simple example

p(x) =1

Zexp

∑i∈V

ψi (xi ) +∑

(i ,j)∈E

ψij(xi , xj)

Can always reparameterize to a minimal representation{θi : i ∈ V }, {Wij : (i , j) ∈ E} s.t. same distribution

p(x) =1

Z ′exp

∑i∈V

θixi +∑

(i ,j)∈E

Wijxixj

X1 X2

local θ1 = 4local θ2 = −5edge Wij = 3

Wij > 0 attractiveWij < 0 repulsive

ψ1(x1) ψ12(x1, x2) ψ2(x2)x1 ψ1(x1)0 21 4

x1\x2 0 10 1 −31 3 2

x2 0 1ψ2(x2) −1 −2

17 / 46

Page 20: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Bethe pseudo-marginals in the local polytope

− logZB = minq∈LF = min

q∈LEq(E )− SB(q(x))

Must identify q(x) ∈ L that minimizes F

q defined by singleton pseudo-marginals qi = p(Xi = 1) ∀i ∈ Vand pairwise µij ∀(i , j) ∈ E . Local polytope constraints imply

µij =

[p(Xi = 0,Xj = 0) p(Xi = 0,Xj = 1)p(Xi = 1,Xj = 0) p(Xi = 1,Xj = 1)

]=

[1 + ξij − qi − qj qj − ξij

qi − ξij ξij

]with constaint that all terms ≥ 0⇒ ξij ∈ [max(0, qi + qj − 1),min(qi , qj)]

[WT01] showed:

Minimizing F , can solve explicitly for ξij(qi , qj ,Wij)

Here Wij is the associativity of the edge (as earlier)

Hence sufficient to search over (q1, . . . , qn) ∈ [0, 1]n, but how?18 / 46

Page 21: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Our approach: a mesh over Bethe pseudo-marginals

We discretize the space (q1, . . . , qn) ∈ [0, 1]n with a provablysufficient mesh M(ε), fine enough s.t. optimum discretized pointq∗ has F(q∗) ≤ minq∈LF(q) + ε

00.2

0.40.6

0.81

0

0.5

10

0.10.20.30.40.50.60.70.80.9

1

q 3

q1

q2

19 / 46

Page 22: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Key ideas to approximate log ZB to within ε

Discretize to construct a provably sufficient mesh M(ε):

How guarantee F(q∗) ≤ minq∈L F(q) + ε?How search the large discrete mesh efficiently?

Developed two approaches:

curvMesh bounds curvature [WJ13]gradMesh bounds gradients - typically much better (orders ofmagnitude) [WJ14]

If original model attractive, i.e. Wij > 0 ∀(i , j) ∈ E(submodular cost functions), then show the discretizedmulti-label problem is submodular [WJ13,KKL12]

Hence, can be solved via graph cuts [SF06]O(N3) where N =

∑i∈V Ni points in dim i [cf.

∏i∈V Ni ]

Obtain FPTAS with gradMesh, N = O(nmWε

)To compare, for curvMesh,N = O

(ε−1/2n7/4∆3/4 exp

[12 (W (1 + ∆/2) + T )

])

20 / 46

Page 23: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Key ideas to approximate log ZB to within ε

Discretize to construct a provably sufficient mesh M(ε):

How guarantee F(q∗) ≤ minq∈L F(q) + ε?How search the large discrete mesh efficiently?

Developed two approaches:

curvMesh bounds curvature [WJ13]gradMesh bounds gradients - typically much better (orders ofmagnitude) [WJ14]

If original model attractive, i.e. Wij > 0 ∀(i , j) ∈ E(submodular cost functions), then show the discretizedmulti-label problem is submodular [WJ13,KKL12]

Hence, can be solved via graph cuts [SF06]O(N3) where N =

∑i∈V Ni points in dim i [cf.

∏i∈V Ni ]

Obtain FPTAS with gradMesh, N = O(nmWε

)To compare, for curvMesh,N = O

(ε−1/2n7/4∆3/4 exp

[12 (W (1 + ∆/2) + T )

])

20 / 46

Page 24: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Key ideas to approximate log ZB to within ε

Discretize to construct a provably sufficient mesh M(ε):

How guarantee F(q∗) ≤ minq∈L F(q) + ε?How search the large discrete mesh efficiently?

Developed two approaches:

curvMesh bounds curvature [WJ13]gradMesh bounds gradients - typically much better (orders ofmagnitude) [WJ14]

If original model attractive, i.e. Wij > 0 ∀(i , j) ∈ E(submodular cost functions), then show the discretizedmulti-label problem is submodular [WJ13,KKL12]

Hence, can be solved via graph cuts [SF06]O(N3) where N =

∑i∈V Ni points in dim i [cf.

∏i∈V Ni ]

Obtain FPTAS with gradMesh, N = O(nmWε

)To compare, for curvMesh,N = O

(ε−1/2n7/4∆3/4 exp

[12 (W (1 + ∆/2) + T )

])20 / 46

Page 25: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Bounding the locations of stationary points

For general edge types (associative or repulsive), letWi =

∑j∈N(i):Wij>0Wij , Vi = −

∑j∈N(i):Wij<0Wij

Theorem (WJ13)

At any stationary point of the Bethe free energy,σ(θi − Vi ) ≤ qi ≤ σ(θi + Wi )

Developed an algorithm (Bethe bound propagation BBP) thatiteratively improves these bounds

[MK07] already had a similar algorithm, finds ranges ofpossible beliefs in LBP - bit slower but typically better

Use this to preprocess model to yield a smaller orthotope

reduces search space directlyfor curvMesh lowers max curvature, hence coarser mesh

21 / 46

Page 26: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Bethe free energy landscape (stylized)

Red dot shows the global optimum, we might return the green dot

22 / 46

Page 27: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Curvature: all terms of the Hessian Hij = ∂2F∂qi∂qj

Hii = − di − 1

qi (1− qi )+∑

j∈N(i)

qj(1− qj)

Tij≥ 1

qi (1− qi ),

Hij =

{qiqj−ξij

Tij(i , j) ∈ E

0 (i , j) /∈ E , i 6= j .

where di is the degree of Xi in the model, and

Tij = qiqj(1−qi )(1−qj)−(ξij−qiqj)2 ≥ 0, equality iff qi or qj ∈ {0, 1}

Leads to bound on max second derivative in any direction(curvMesh)

qiqj − ξij term is negative for an attractive edge, hence obtainthe submodularity result

23 / 46

Page 28: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

gradMesh: analyze first derivatives of F

∂F∂qi

= −θi + log(1− qi )

di−1

qdi−1i

∏j∈N(i)(qi − ξij)∏

j∈N(i)(1 + ξij − qi − qj)[WT01]

Theorem (WJ14)

−θi + log qi1−qi −Wi ≤ ∂F

∂qi≤ −θi + log qi

1−qi + Vi

Upper and lower bounds are separated by a constant, andboth are monotonically increasing with qi

Within our search space, allows us to bound∣∣∣∂F∂qi ∣∣∣ ≤ Di := Vi + Wi =∑

j∈N(i) |Wij |

24 / 46

Page 29: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

gradMesh: search over purple region

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−15

−10

−5

0

5

10

15

Upper and Lower Bounds for ∂F∂qi

Pseudo−marginal qi

Par

tial d

eriv

ativ

e

fiU

fiL

Ai

1−Bi

qi s.t.

fiU(q

i)=0

qi s.t.

fiL(q

i)=0

Region of Bethe box[A

i, 1−B

i]

Di=V

i+W

i−logL

i−logU

i

Shaded area shows wherepartial derivative can be 0

Parameters used in this example:θ

i=1, V

i=2, W

i=3

Li=1.8, U

i=2.9

25 / 46

Page 30: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

gradMesh: complexity

In search space,

∣∣∣∣∂F∂qi∣∣∣∣ ≤ Di := Vi + Wi =

∑j∈N(i)

|Wij |

We can apportion ε error among n variables

Simple method: each gets εn

Need gradienti .stepi ≈ εn .

Hence number of mesh points in dimension i ,

Ni ≈1

stepi≈ n

ε.gradienti = O

n

ε

∑j∈N(i)

|Wij |

Hence N =

∑i Ni = O

(nεmW

)Various tricks in paper show how to improve performance

26 / 46

Page 31: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Comparison of methods: left ε = 1, right ε = 0.1; (when fixed, W = 5, n = 10)

5 10 15 2010

0

1010

1020

n

N

curvMeshOrigcurvMeshNewgradMesh

0 5 10

1010

1020

W

N

curvMeshOrigcurvMeshNewgradMesh

5 10 15 2010

0

1010

1020

n

N

curvMeshOrigcurvMeshNewgradMesh

0 5 10

1010

1020

W

N

curvMeshOrigcurvMeshNewgradMesh

27 / 46

Page 32: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Example where LBP fails to converge, gradMesh works well

Power network oftransformers

Xi ∈ {stable,fail}Attractiveedges betweentransformers

Would like torank bymarginalprobability offailure p(Xi )

1

2

3 4

5

6

7

8

9

10

11

1213

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

4950

51

52

5354

55

28 / 46

Page 33: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Recap

The Bethe approximation is often strikingly accurate.New results:

Novel formulation of the Hessian of the Bethe free energy FBounds on derivatives and locations of optima

First method guaranteed to return ε-approx global optimumlogZB , allows its accuracy to be tested rigorously

Provides benchmark against which to judge other heuristics(LBP, HAK etc.)

Useful in practice for small problems

FPTAS for attractive models, was open theoretical question

Further improvements in new work...

29 / 46

Page 34: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Understanding the Bethe approximation

Joint work with Kui Tang and David Sontag

Goal - separate and evaluate the two aspects of the Betheapproximation:

1 Relax the marginal polytope M to the local polytope L whichenforces only pairwise consistency, hence pseudo-marginals

2 Use Bethe entropy SB=∑

i∈V Si +∑

(i,j)∈E Sij − Si − Sj

Consider marginal, cycle and local polytopes

Compare against tree-reweighted approximation (TRW)

same polytopesconcave upper-bounding entropy

Analytic and experimental results

30 / 46

Page 35: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Illustration of polytopes

marginal polytopeglobal consistency

cycle polytopecycle consistency

local polytopelocal consistency

31 / 46

Page 36: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Questions addressed include

Does tightening the relaxation of the marginal polytopealways improve the Bethe approximation for logZ?

No (empirically usually very helpful for general models)

In attractive models, when local potentials are low andcouplings high, why does the Bethe approximation performpoorly for marginals?

Bethe entropy

In general models, for low couplings, the Bethe approximationperforms much better than TRW, yet as coupling increases,this advantage disappears. How does this vary if we tightenthe relaxation of the marginal polytope?

Mixed, see Experiments

32 / 46

Page 37: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Questions addressed include

Does tightening the relaxation of the marginal polytopealways improve the Bethe approximation for logZ?

No (empirically usually very helpful for general models)

In attractive models, when local potentials are low andcouplings high, why does the Bethe approximation performpoorly for marginals?

Bethe entropy

In general models, for low couplings, the Bethe approximationperforms much better than TRW, yet as coupling increases,this advantage disappears. How does this vary if we tightenthe relaxation of the marginal polytope?

Mixed, see Experiments

32 / 46

Page 38: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Questions addressed include

Does tightening the relaxation of the marginal polytopealways improve the Bethe approximation for logZ?

No (empirically usually very helpful for general models)

In attractive models, when local potentials are low andcouplings high, why does the Bethe approximation performpoorly for marginals?

Bethe entropy

In general models, for low couplings, the Bethe approximationperforms much better than TRW, yet as coupling increases,this advantage disappears. How does this vary if we tightenthe relaxation of the marginal polytope?

Mixed, see Experiments

32 / 46

Page 39: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Questions addressed include

Does tightening the relaxation of the marginal polytopealways improve the Bethe approximation for logZ?

No (empirically usually very helpful for general models)

In attractive models, when local potentials are low andcouplings high, why does the Bethe approximation performpoorly for marginals?

Bethe entropy

In general models, for low couplings, the Bethe approximationperforms much better than TRW, yet as coupling increases,this advantage disappears. How does this vary if we tightenthe relaxation of the marginal polytope?

Mixed, see Experiments

32 / 46

Page 40: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Questions addressed include

Does tightening the relaxation of the marginal polytopealways improve the Bethe approximation for logZ?

No (empirically usually very helpful for general models)

In attractive models, when local potentials are low andcouplings high, why does the Bethe approximation performpoorly for marginals?

Bethe entropy

In general models, for low couplings, the Bethe approximationperforms much better than TRW, yet as coupling increases,this advantage disappears. How does this vary if we tightenthe relaxation of the marginal polytope?

Mixed, see Experiments

32 / 46

Page 41: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Questions addressed include

Does tightening the relaxation of the marginal polytopealways improve the Bethe approximation for logZ?

No (empirically usually very helpful for general models)

In attractive models, when local potentials are low andcouplings high, why does the Bethe approximation performpoorly for marginals?

Bethe entropy

In general models, for low couplings, the Bethe approximationperforms much better than TRW, yet as coupling increases,this advantage disappears. How does this vary if we tightenthe relaxation of the marginal polytope?

Mixed, see Experiments

32 / 46

Page 42: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Tightening the polytope relaxation - does it always help?

NoConsider symmetric

nonhomogeneous

cycle, vary WBC ,

θA = θB = θC = 0

A

B C

WAB = WAC = 10,strongly attractive

−10 −5 0 5 106

7

8

9

10

11

12

13

14

15

16

BC edge weight

log

Z

trueBetheBethe+cycle

Lemma: ∂ log ZB∂WBC

= µBC (0, 0) + µBC (1, 1), all singleton marginals 12

For weakly attractive edge BC, cycle improves pairwise marginal (similarslopes near 0) but worsens partition function (gap between curves near 0)

33 / 46

Page 43: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Threshold result for attractive models due to SB entropy

Lemma: For a symmetric homogeneous d-regular MRF,q = (12 , . . . ,

12) is a stationary point of F but not a minimum

for W > 2 log dd−2 (uses earlier Hessian result)

Recall∑

i di = 2m (handshake lemma), henceSB = mSij + (n − 2m)Si . For large W , all probability masspulled onto main diagonal, hence Sij ≈ Si . For m > n, toavoid negative SB , each entropy term → 0 by tending to

pairwise

(1 00 0

)or symmetrically

(0 00 1

).

0 0.5 1−1.5

−1

−0.5

0

q

Bet

he fr

ee e

nerg

y E

−S

B

K5 : W = 1

0 0.5 1−0.8

−0.6

−0.4

−0.2

0

q

Bet

he fr

ee e

nerg

y E

−S

B

W = 1.38

0 0.5 1−0.4

−0.3

−0.2

−0.1

0

qB

ethe

free

ene

rgy

E−

SB

W = 1.75 34 / 46

Page 44: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Also a polytope effect for frustrated cycles

A frustrated cycle has an odd number of repulsive edges, this pullssingleton marginals the other way, toward 1

2

Seen Bethe entropy effect for attractive cycles

Also a polytope effect for frustrated cycles

Recall optimum energy on local polytope for a symmetricfrustrated cycle is at (12 , . . . ,

12)

C5 topology, θi ∼ [0,Tmax ], all edges W

−10 −5 0 5 100.5

0.6

0.7

0.8

0.9

1

edge weight W

avg

sin

gle

ton

mar

gin

al

trueBetheBethe+cycle

−10 −5 0 5 100.5

0.6

0.7

0.8

0.9

1

edge weight W

avg

sin

gle

ton

mar

gin

al

trueBetheBethe+cycle

35 / 46

Page 45: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Experiments: General models θi ∼ [−2, 2](attractive and repulsive edges) K10 topology

2 8 16 24 320

20

40

60

80

100

Maximum coupling strength y

Bethe+localBethe+cycleBethe+margTRW+localTRW+cycleTRW+marg

log partition error

2 8 16 24 320

0.5

1

1.5

2

Maximum coupling strength y

Bethe+cycleBethe+marg

TRW+cycleTRW+marg

log partition error, local removed

2 8 16 24 320

0.1

0.2

0.3

0.4

Maximum coupling strength y

Singleton marginals, average `1 error

2 8 16 24 320

0.1

0.2

0.3

0.4

Maximum coupling strength y

Pairwise marginals, average `1 error36 / 46

Page 46: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Conclusions for general models

Big gains from cycle polytope (suggest Frank-Wolfe)

Not much additional gain from marginal polytope(computationally harder)

Bethe performs remarkably well

Better than TRW for logZ , pairwise marginalsLess clear on singleton marginals: TRW better for very strongcoupling

Still much to learn about why Bethe performs so well...

37 / 46

Page 47: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Summary

The Bethe approximation is remarkably effective forapproximate inference

Novel results on Hessian of Bethe free energy

First algorithm for ε-approx of global optimum logZB , FPTASfor attractive models

Contributions to understanding the Bethe approximation(polytope and entropy)

Where feasible, tightening to the cycle polytope can be veryhelpful

Additional results in new work (e.g. clamping)...

Thank you!

38 / 46

Page 48: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Attractive example: max score and value, with argmax

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1

0

1

qi

Sco

re/V

alue

Opt Score(C) and Value(−F), i=3/4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

argm

ax S

ingl

eton

Val

ues

39 / 46

Page 49: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

References

F. Korc̆, V. Kolmogorov, and C. Lampert. Approximating marginals usingdiscrete energy minimization. Technical report, IST Austria, 2012.

J. Mooij and H. Kappen. Sufficient conditions for convergence of thesum-product algorithm. IEEE Transactions on Information Theory, 2007.

D. Schlesinger and B. Flach. Transforming an arbitrary minsum probleminto a binary one. Technical report, Dresden University of Tech, 2006.

A. Weller and T. Jebara. Approximating the Bethe partition function. InUAI, 2014.

A. Weller, K. Tang, D. Sontag, and T. Jebara. Understanding the Betheapproximation: When and how can it go wrong? In UAI, 2014.

A. Weller and T. Jebara. Bethe bounds and approximating the globaloptimum. In AISTATS, 2013.

M. Welling and Y. Teh. Belief optimization for binary networks: A stablealternative to loopy belief propagation. In UAI, 2001.

J. Yedidia, W. Freeman, and Y. Weiss. Understanding belief propagationand its generalizations. In IJCA, Distinguished Lecture Track, 2001.

40 / 46

Page 50: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Extra Slides with Supplementary Material

Supplementary Material(if time or questions)

41 / 46

Page 51: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Cycle polytope

A relaxation of the marginal polytope

Inherits all constraints of the local polytope, hence at least astight

In addition, enforces consistency around any cycle

Cycle inequalities [B93]

∀ cycles C and every subset of edges F ⊆ C with |F | odd:∑(i,j)∈F

(µij(0, 0) + µij(1, 1))+∑

(i,j)∈C\F

(µij(1, 0) + µij(0, 1)) ≥ 1.

Cycle polytope = marginal polytope for symmetric planarMRFs [B93]

Cycle polytope = TRI for binary pairwise [S10]

42 / 46

Page 52: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Threshold for attractive models ξij(qi , qj ,Wij)

0 0.5 1−1.5

−1

−0.5

0

q

Bet

he fr

ee e

nerg

y E

−S

B

K5 : W = 1

0 0.5 1−0.8

−0.6

−0.4

−0.2

0

q

Bet

he fr

ee e

nerg

y E

−S

BW = 1.38

0 0.5 1−0.4

−0.3

−0.2

−0.1

0

q

Bet

he fr

ee e

nerg

y E

−S

B

W = 1.75

0 0.5 10

1

2

3

4

q

Bet

he e

ntro

py S

B

W = 1

0 0.5 10

0.5

1

1.5

2

2.5

q

Ene

rgy

E

0 0.5 1−0.4

−0.2

0

0.2

0.4

q

Bet

he e

ntro

py S

B

W = 4.5

0 0.5 10

0.5

1

1.5

2

2.5

q

Ene

rgy

E

43 / 46

Page 53: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Experiments: Attractive models θi ∼ [−0.1, 0.1]

0.4 2 4 8 12 160

0.2

0.4

0.6

0.8

1

Maximum coupling strength y

Bethe+localBethe+cycleBethe+margTRW+localTRW+cycleTRW+marg

log partition error

For this distribution of models,the polytope appears to makeno difference

Though recall we showedtheoretically it can

0.4 2 4 8 12 160

0.1

0.2

0.3

0.4

0.5

Maximum coupling strength y

Bethe+localBethe+cycleBethe+margTRW+localTRW+cycleTRW+marg

Singleton marginals, average `1 error

0.4 2 4 8 12 160

0.02

0.04

0.06

0.08

0.1

Maximum coupling strength y

Bethe+localBethe+cycleBethe+margTRW+localTRW+cycleTRW+marg

Pairwise marginals, average `1 error (small scale)

44 / 46

Page 54: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Clamping variables: Attractive binary pairwise models

ZB = optimal Bethe partition function for original model

Clamp variable Xi , form new approximation

Z(i)B = ZB |Xi=0 + ZB |Xi=1.

Theorem (WJ14 NIPS)

For an attractive binary pairwise model and any variable Xi ,

ZB ≤ Z(i)B .

Corollary

For an attractive binary pairwise model, ZB ≤ Z .

⇒ clamping only improves the estimate of the partition function.

45 / 46

Page 55: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Clamping variables: stronger result

For any i ∈ V, x ∈ [0, 1], letlogZBi (x) = maxq∈[0,1]n:qi=x −F(q)

Observe logZBi (0) = logZB |Xi=0, logZBi (1) = logZB |Xi=1

and logZB = maxqi∈[0,1] logZBi (qi )

Recall Si (x) = −x log x − (1− x) log(1− x) singleton entropy

Lemma: To prove clamping result, sufficient iflogZBi (qi ) ≤ qi logZBi (1) + (1− qi ) logZBi (0) + Si (qi )

Theorem (WJ14 NIPS)

For an attractive binary pairwise model, logZBi (qi )− Si (qi ) isconvex.

Uses earlier results on Hessian

46 / 46

Page 56: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Clamping variables: stronger result

For any i ∈ V, x ∈ [0, 1], letlogZBi (x) = maxq∈[0,1]n:qi=x −F(q)

Observe logZBi (0) = logZB |Xi=0, logZBi (1) = logZB |Xi=1

and logZB = maxqi∈[0,1] logZBi (qi )

Recall Si (x) = −x log x − (1− x) log(1− x) singleton entropy

Lemma: To prove clamping result, sufficient iflogZBi (qi ) ≤ qi logZBi (1) + (1− qi ) logZBi (0) + Si (qi )

Theorem (WJ14 NIPS)

For an attractive binary pairwise model, logZBi (qi )− Si (qi ) isconvex.

Uses earlier results on Hessian

46 / 46

Page 57: On the Bethe approximation - Columbia Universityadrian/oxf091214.pdf · On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint

Clamping variables: stronger result

For any i ∈ V, x ∈ [0, 1], letlogZBi (x) = maxq∈[0,1]n:qi=x −F(q)

Observe logZBi (0) = logZB |Xi=0, logZBi (1) = logZB |Xi=1

and logZB = maxqi∈[0,1] logZBi (qi )

Recall Si (x) = −x log x − (1− x) log(1− x) singleton entropy

Lemma: To prove clamping result, sufficient iflogZBi (qi ) ≤ qi logZBi (1) + (1− qi ) logZBi (0) + Si (qi )

Theorem (WJ14 NIPS)

For an attractive binary pairwise model, logZBi (qi )− Si (qi ) isconvex.

Uses earlier results on Hessian

46 / 46